Html Agility Pack HTML Parsing Engine

Attention to get the latest Official Html Agility Pack releases please use the Nuget Package

Html Agility Pack is an HTML parsing engine written for .NET. It is available for many .NET platforms including .NET CF, WP7 and Silverlight

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature

Sample applications:

Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.

There is no dependency on anything else than .Net's XPATH implementation. There is no dependency on Internet Explorer's MSHTML dll or W3C's HTML tidy or ActiveX / COM object, or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool. The version posted here on CodePlex is for the .NET Framework 2.0. If you need the old version, please go to the old page or drop me a note.

Examples
http://htmlagilitypack.codeplex.com/wikipage?title=Examples

Download
http://htmlagilitypack.codeplex.com/

For More Info

http://runtingsproper.blogspot.in/2009/11/easily-extracting-links-from-snippet-of.html
http://runtingsproper.blogspot.in/2009/09/introduction-to-htmlagilitypack-library.html

Sample Code

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

Content match = null;

// Run only if there are links in the document.
if (linkNodes != null)
{
    foreach (HtmlNode linkNode in linkNodes)
    {
        HtmlAttribute attrib = linkNode.Attributes["href"];
        // Do whatever else you need here
    }
}

Asp.net,C#.net,Sql Srver problems and solutions Blog

Search This Blog

Html Agility Pack HTML Parsing Engine

Labels

Comments

Post a Comment

Popular posts from this blog

How to hide url parameters in asp.net

12 Sentences that Change Your Attitude at Work

Asp.Net AjaxFileUpload Control With Drag Drop And Progress Bar