Parsing HTML in C#

Today I stumbled upon a bizarre problem, I wanted to parse an HTML for a site to find out if the site contains any RSS feeds. After some research I found out that finding the RSS feed for a site is not that hard, you have to look for an element that looks like this

image

“Easy task”, I said, me Mr.Optimistic; “I will just load it up in XElement and using Linq to XML to get the data I want”. BUT guess what the web is filled with crazy HTML that a standard .NET parser such as XElement just gives up on and blows up in flames.

After some heavy head banging to walls and ceilings, I found the solution, HtmlAgilityPack. This open source project lets you load HTML even if it is not in a good shape. With some options HTMLAgilityPack will fix these errors and then you can query the HTML to get elements and their attributes as you please.

Here is the code I used to load the HTML and find the data I need

image

Kudos to the guys from HTMLAgilityPack!!!

About these ads

3 thoughts on “Parsing HTML in C#

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s