Skip to main content

Html Agility Pack HTML Parsing Engine


Attention to get the latest Official Html Agility Pack releases please use the Nuget Package

Html Agility Pack is an HTML parsing engine written for .NET. It is available for many .NET platforms including .NET CF, WP7 and Silverlight


What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature

Sample applications:

Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.

There is no dependency on anything else than .Net's XPATH implementation. There is no dependency on Internet Explorer's MSHTML dll or W3C's HTML tidy or ActiveX / COM object, or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool. The version posted here on CodePlex is for the .NET Framework 2.0. If you need the old version, please go to the old page or drop me a note.

Examples
http://htmlagilitypack.codeplex.com/wikipage?title=Examples

Download
http://htmlagilitypack.codeplex.com/

For More Info

http://runtingsproper.blogspot.in/2009/11/easily-extracting-links-from-snippet-of.html
http://runtingsproper.blogspot.in/2009/09/introduction-to-htmlagilitypack-library.html


Sample Code

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

Content match = null;

// Run only if there are links in the document.
if (linkNodes != null)
{
    foreach (HtmlNode linkNode in linkNodes)
    {
        HtmlAttribute attrib = linkNode.Attributes["href"];
        // Do whatever else you need here
    }
}

Comments

Popular posts from this blog

Asp.Net AjaxFileUpload Control With Drag Drop And Progress Bar

This Example explains how to use AjaxFileUpload Control With Drag Drop And Progress Bar Functionality In Asp.Net 2.0 3.5 4.0 C# And VB.NET. Previous Post  I was Explained about the   jQuery - Allow Alphanumeric (Alphabets & Numbers) Characters in Textbox using JavaScript  ,  Fileupload show selected file in label when file selected  ,  Check for file size with JavaScript before uploading  . May 2012 release of AjaxControlToolkit includes a new AjaxFileUpload Control  which supports Multiple File Upload, Progress Bar and Drag And Drop functionality. These new features are supported by Google Chrome version 16+, Firefox 8+ , Safari 5+ and Internet explorer 10 + , IE9 or earlier does not support this feature. To start with it, download and put latest AjaxControlToolkit.dll in Bin folder of application, Place ToolkitScriptManager  and AjaxFileUpload on the page. HTML SOURCE < asp:ToolkitScriptManager I...

View online files using the Google Docs Viewer

Use Google Docs Viewer for Document viewing within Browser I was looking for a way to let users see Microsoft Word Doc or PDF files online while using my application without leaving their browser without downloading files and then opening it to view with Word or PDF viewer . I was looking for some way out either via any PHP or Microsoft.NET libraries, did some googling on that; but later on I just got an idea that google already has all code written for me.. when I have any email attachment in PDF or DOC or DOCX google does it for me ..! Even while searching I can see PDFs by converting them in HTML. So I just googled it up and found that Google already has this ability that we can use Google Docs Viewer without any Google Account Login . YES that's true no Google Account login is required. It's damn simple and easy. Just pass document path as attachment as parameter and we are done. Google Docs Viewer gives us ability to embed PDF, DOC/DOCX, PPT, TIFF:...

How to send mail asynchronously in asp.net with MailMessage

With Microsoft.NET Framework 2.0 everything is asynchronous and we can send mail also asynchronously. This features is very useful when you send lots of bulk mails like offers , Discounts , Greetings . You don’t have to wait for response from mail server and you can do other task . By using     SmtpClient . SendAsync Method (MailMessage, Object)    you need to do  System.Net.Mail has also added asynchronous support for sending email. To send asynchronously, you need need to Wire up a SendCompleted event Create the SendCompleted event Call SmtpClient.SendAsync smtpClient.send() will initiate the sending on the main/ui  thread and would block.  smtpClient.SendAsync() will pick a thread from the .NET Thread Pool and execute the method on that thread. So your main UI will not hang or block . Let's create a simple example to send mail. For sending mail asynchronously you need to create a event handler that will notify that mail success...