Parsing HTML in C#

Today I stumbled upon a bizarre problem, I wanted to parse an HTML for a site to find out if the site contains any RSS feeds. After some research I found out that finding the RSS feed for a site is not that hard, you have to look for an element that looks like this

image

“Easy task”, I said, me Mr.Optimistic; “I will just load it up in XElement and using Linq to XML to get the data I want”. BUT guess what the web is filled with crazy HTML that a standard .NET parser such as XElement just gives up on and blows up in flames.

After some heavy head banging to walls and ceilings, I found the solution, HtmlAgilityPack. This open source project lets you load HTML even if it is not in a good shape. With some options HTMLAgilityPack will fix these errors and then you can query the HTML to get elements and their attributes as you please.

Here is the code I used to load the HTML and find the data I need

image

Kudos to the guys from HTMLAgilityPack!!!

3 thoughts on “Parsing HTML in C#

Leave a comment