Today I stumbled upon a bizarre problem, I wanted to parse an HTML for a site to find out if the site contains any RSS feeds. After some research I found out that finding the RSS feed for a site is not that hard, you have to look for an element that looks like this
“Easy task”, I said, me Mr.Optimistic; “I will just load it up in XElement and using Linq to XML to get the data I want”. BUT guess what the web is filled with crazy HTML that a standard .NET parser such as XElement just gives up on and blows up in flames.
After some heavy head banging to walls and ceilings, I found the solution, HtmlAgilityPack. This open source project lets you load HTML even if it is not in a good shape. With some options HTMLAgilityPack will fix these errors and then you can query the HTML to get elements and their attributes as you please.
Here is the code I used to load the HTML and find the data I need
Kudos to the guys from HTMLAgilityPack!!!
What I love with HAP is that you can use almost exactly like Linq to XML, the interface is very natural. Great tool!
You might as well step over Linq and use XPATH the whole way.
var nodes = htmlDoc.DocumentNode.SelectNodes(“//link[@type=’application/rss’]”)
Thanks for the tip!
Btw, what template are you using for VS? I love the looks!