Hpricot - My New Favourite Ruby XML Parser

One of the missing features in the default Ruby distribution is the lack of a good XML parser. The included REXML is only sufficient for the most basic scenarios as performance degrades quickly with XML size.

Recently I had a situation where I needed to parse a 700 Kb XML file and extract some values with XPath queries. Doing this in REXML proved to be too slow (around 30 seconds). Since I was on OS X it was a small task to get the Ruby libxml bindings. The speed increase was immense and everything worked smoothly.

As usual, requirements change and the application needed to be able to run on Windows and OS X. Unfortunately the Ruby libxml api does not work in Windows. Looking around, I couldn’t find a decent XML parser for Ruby that worked on both platforms and I didn’t want to code for both REXML and libxml.

Enter Hpricot. Originally written to do HTML scraping it is actually very capable of working with XML too. And it has binaries for Windows, Linux and OS X.

A quick example shows how easy it is to load and get data from an XML file:

1
2
3
4
5
6
7

require 'hpricot' doc = Hpricot(open("lazaridis\_msgs.xml"))

doc.search("//message").each do |message| 
e_number = message.attributes["subject"][16..17] 
puts "Evaluation identifier is #{e_number}" 
end

Technically, Hpricot isn’t an XML-parser. It doesn’t validate the document which means that malformed XML can slip through. You will have to be careful if your application relies on wellformedness of the XML data.

I will be switching the Ruby Accessibility Analysis Kit over to Hpricot soon. It will be a nice speed increase for your Rails unit tests using RAAKT. This will also solve the problem in RubyfulSoup where the author declared a “Tag” class with a bad scope.

So, maybe I am the last person on earth to discover this, but if you need a great library for XML parsing on multiple platforms, check out Hpricot.