If you want to remove all of the HTML tags contained within your PDF form?
I'll list two of the major options, using an HTML Parser and using a Regular Expression to tackle this issue.
Option 1 : Use the HTML Agility Pack
The HTML Agility Pack is an agile parser that reads, writes and handles most situations that you would need to do involving HTML in .NET. (As a bonus is it also available through NuGet)
From this related Stack Overflow discussion, you can see the code listed below to strip all of the HTML tags from some text :
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());
I haven't worked with the HTML Agilty Pack, however I have heard nothing but good things so I am listing it.
Option 2 : Regular Expression
If you currently have the entire contents of your PDF within string format, you could use the following Regular Expression to easily strip out all of the HTML tags contained within it (however realize that this may affect the appearance of your PDF) :
<[^>]*>
which you can use in the following way :
//Uses a Regular Expression to strip your HTML tags (RegexOptions.Compiled for improved performance)
string result = new Regex("<[^>]*>", RegexOptions.Compiled).Replace(yourString, "");
This will likely be the "easier" method but may not be perfect by any means (as regular expressions typically aren't).
Comments
Post a Comment