C# Disciples

my life in Avalon ….

Parsing HTML in C#

Today I stumbled upon a bizarre problem, I wanted to parse an HTML for a site to find out if the site contains any RSS feeds. After some research I found out that finding the RSS feed for a site is not that hard, you have to look for an element that looks like this

image

“Easy task”, I said, me Mr.Optimistic; “I will just load it up in XElement and using Linq to XML to get the data I want”. BUT guess what the web is filled with crazy HTML that a standard .NET parser such as XElement just gives up on and blows up in flames.

After some heavy head banging to walls and ceilings, I found the solution, HtmlAgilityPack. This open source project lets you load HTML even if it is not in a good shape. With some options HTMLAgilityPack will fix these errors and then you can query the HTML to get elements and their attributes as you please.

Here is the code I used to load the HTML and find the data I need

image

Kudos to the guys from HTMLAgilityPack!!!

About these ads

January 5, 2012 - Posted by | WPF

3 Comments »

  1. What I love with HAP is that you can use almost exactly like Linq to XML, the interface is very natural. Great tool!

    Comment by Thomas Levesque | January 5, 2012 | Reply

  2. You might as well step over Linq and use XPATH the whole way.

    var nodes = htmlDoc.DocumentNode.SelectNodes(“//link[@type='application/rss']“)

    Comment by Rune Juhl-Petersen | January 6, 2012 | Reply

  3. Thanks for the tip!

    Btw, what template are you using for VS? I love the looks!

    Comment by Stian | January 16, 2012 | Reply


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 845 other followers

%d bloggers like this: