Today I stumbled upon a bizarre problem, I wanted to parse an HTML for a site to find out if the site contains any RSS feeds. After some research I found out that finding the RSS feed for a site is not that hard, you have to look for an element that looks like this
“Easy task”, I said, me Mr.Optimistic; “I will just load it up in XElement and using Linq to XML to get the data I want”. BUT guess what the web is filled with crazy HTML that a standard .NET parser such as XElement just gives up on and blows up in flames.
After some heavy head banging to walls and ceilings, I found the solution, HtmlAgilityPack. This open source project lets you load HTML even if it is not in a good shape. With some options HTMLAgilityPack will fix these errors and then you can query the HTML to get elements and their attributes as you please.
Here is the code I used to load the HTML and find the data I need
Kudos to the guys from HTMLAgilityPack!!!