Skip to main content

Remove all Html Tags in pdf file




If  you want to remove all of the HTML tags contained within your PDF form?

I'll list two of the major options, using an HTML Parser and using a Regular Expression to tackle this issue.

Option 1 : Use the HTML Agility Pack

 The HTML Agility Pack is an agile parser that reads, writes and handles most situations that you would need to do involving HTML in .NET. (As a bonus is it also available through NuGet)

From this related Stack Overflow discussion, you can see the code listed below to strip all of the HTML tags from some text :

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
   output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());
I haven't worked with the HTML Agilty Pack, however I have heard nothing but good things so I am listing it.

Option 2 : Regular Expression

If you currently have the entire contents of your PDF within string format, you could use the following Regular Expression to easily strip out all of the HTML tags contained within it (however realize that this may affect the appearance of your PDF) :
<[^>]*>

which you can use in the following way :

//Uses a Regular Expression to strip your HTML tags (RegexOptions.Compiled for improved performance)
string result = new Regex("<[^>]*>", RegexOptions.Compiled).Replace(yourString, "");
This will likely be the "easier" method but may not be perfect by any means (as regular expressions typically aren't).

Comments

Popular posts from this blog

How to hide url parameters in asp.net

There are different ways to Hide the URL in asp.net , you can choose any one from bellow options . Previously I was Explained about the  Difference between Convert.tostring and .tostring() method Example  ,   Reasons to use Twitter Bootstrap , How to Register AJAX toolkit in web.config file in asp.net a) Using Post Method b) Using Of Session . c) URL Encoding & decoding process . d) Using Server.Transfer() instead of Response.Redirect() method (1)Use a form and POST the information. This might require additional code in source pages, but should not require logic changes in the target pages (merely change Request.QueryString to Request.Form). While POST is not impossible to muck with, it's certainly less appealing than playing with QueryString parameters. (2)Use session variables to carry information from page to page. This is likely a more substantial effort compared to (1), because you will need to take session variable checking into account...

ASP.NET Routing

ASP.NET routing enables you to use URLs that do not have to map to specific files in a Web site. Because the URL does not have to map to a file, you can use URLs that are descriptive of the user's action and therefore are more easily understood by users. The ASP.NET MVC framework and ASP.NET Dynamic Data extend routing to provide features that are used only in MVC applications and in Dynamic Data applications. For more information about MVC, see ASP.NET MVC 3 . For more information about Dynamic Data, see ASP.NET Dynamic Data Content Map . In an ASP.NET application that does not use routing, an incoming request for a URL typically maps to a physical file that handles the request, such as an .aspx file. For example, a request for http://server/application/Products.aspx?id=4 maps to a file that is named Products.aspx that contains code and markup for rendering a response to the browser. The Web page uses the query string value of id=4 to determine what type of c...

How to send mail asynchronously in asp.net with MailMessage

With Microsoft.NET Framework 2.0 everything is asynchronous and we can send mail also asynchronously. This features is very useful when you send lots of bulk mails like offers , Discounts , Greetings . You don’t have to wait for response from mail server and you can do other task . By using     SmtpClient . SendAsync Method (MailMessage, Object)    you need to do  System.Net.Mail has also added asynchronous support for sending email. To send asynchronously, you need need to Wire up a SendCompleted event Create the SendCompleted event Call SmtpClient.SendAsync smtpClient.send() will initiate the sending on the main/ui  thread and would block.  smtpClient.SendAsync() will pick a thread from the .NET Thread Pool and execute the method on that thread. So your main UI will not hang or block . Let's create a simple example to send mail. For sending mail asynchronously you need to create a event handler that will notify that mail success...