Remove all Html Tags in pdf file

If you want to remove all of the HTML tags contained within your PDF form?

I'll list two of the major options, using an HTML Parser and using a Regular Expression to tackle this issue.

Option 1 : Use the HTML Agility Pack

The HTML Agility Pack is an agile parser that reads, writes and handles most situations that you would need to do involving HTML in .NET. (As a bonus is it also available through NuGet)

From this related Stack Overflow discussion, you can see the code listed below to strip all of the HTML tags from some text :

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());
I haven't worked with the HTML Agilty Pack, however I have heard nothing but good things so I am listing it.

Option 2 : Regular Expression

If you currently have the entire contents of your PDF within string format, you could use the following Regular Expression to easily strip out all of the HTML tags contained within it (however realize that this may affect the appearance of your PDF) :
<[^>]*>

which you can use in the following way :

//Uses a Regular Expression to strip your HTML tags (RegexOptions.Compiled for improved performance)
string result = new Regex("<[^>]*>", RegexOptions.Compiled).Replace(yourString, "");
This will likely be the "easier" method but may not be perfect by any means (as regular expressions typically aren't).

Asp.net,C#.net,Sql Srver problems and solutions Blog

Search This Blog

Remove all Html Tags in pdf file

Labels

Comments

Post a Comment

Popular posts from this blog

How to hide url parameters in asp.net

ASP.NET Routing

How to send mail asynchronously in asp.net with MailMessage