in Ruby

Hpricot – My New Favourite Ruby XML Parser

One of the missing features in the default Ruby distribution is the lack of a good XML parser. The included REXML is only sufficient for the most basic scenarios as performance degrades quickly with XML size.

Recently I had a situation where I needed to parse a 700 Kb XML file and extract some values with XPath queries. Doing this in REXML proved to be too slow (around 30 seconds). Since I was on OS X it was a small task to get the Ruby libxml bindings. The speed increase was immense and everything worked smoothly.

As usual, requirements change and the application needed to be able to run on Windows and OS X. Unfortunately the Ruby libxml api does not work in Windows. Looking around, I couldn’t find a decent XML parser for Ruby that worked on both platforms and I didn’t want to code for both REXML and libxml.

Enter Hpricot. Originally written to do HTML scraping it is actually very capable of working with XML too. And it has binaries for Windows, Linux and OS X.

A quick example shows how easy it is to load and get data from an XML file:


require 'hpricot'
doc = Hpricot(open("lazaridis_msgs.xml"))

doc.search("//message").each do |message|
	e_number = message.attributes["subject"][16..17]
	puts "Evaluation identifier is #{e_number}"
end

Technically, Hpricot isn’t an XML-parser. It doesn’t validate the document which means that malformed XML can slip through. You will have to be careful if your application relies on wellformedness of the XML data.

I will be switching the Ruby Accessibility Analysis Kit over to Hpricot soon. It will be a nice speed increase for your Rails unit tests using RAAKT. This will also solve the problem in RubyfulSoup where the author declared a “Tag” class with a bad scope.

So, maybe I am the last person on earth to discover this, but if you need a great library for XML parsing on multiple platforms, check out Hpricot.

Write a Comment here on the real web

Comment

  1. Thanks for the posting!

    eI also had the same problem with REXML and found a Windows binary that someone had blogged about making. My question for you is, how would you compare the performance of Hpricot vs. libxml? My app is particularly sensitive to speed, so although Hpricot might be easier I am wondering what you saw performance wise? One file I have that I would like to parse is 20megs! I might have to pre-process it first, but still – some larger data sets.

    I also pull in an XML web service and one page was taking 15-20 seconds using REXML and it dropped to about 1.5 seconds using libxml, and most of that is in waiting for the external web service.

  2. Hi ..
    Can you post a link to the reference for Hpricot.
    As with everything in Ruby the documentation is hard .. very hard to find.
    The link by mechanized on rubyforge is dead .. has been that way for a few days.

    Thanks

  3. Just today, I experienced the same problem as Hedley with Hpricot’s XPath and CSS selectors failing to navigate a namespaced xml doc. With a little bit of hunting around in the wiki, I discovered that the problem seems to have been fixed in a recent change:

    http://code.whytheluckystiff.net/hpricot/changeset/146

    Upgrading my gem to 0.5 got things working for me. I can now navigate namespaced xml docs as normal with Hpricot. The one tangle seems to be that Hpricot still downcases all of the tags, so I had to take that into account when constructing my selectors. There’s some discussion of the problem over on the wiki so it is somewhat likely on its way towards resolution.

  4. Thank you so much for making this comment on the web. I also found REXML to be very slow and libxml is super hard to install on windows – at least i couldn’t get it to run after hours of work and finally decided to look for other XML implementations.
    Hpricot installed out of the box with gem and I successfully replaced all REXML commands with hpricot API without any problems. The XML process time improved from 30 seconds to 1. Thanks!

Webmentions

  • Fear and Loathing in Software Development | Educate. Liberate. August 16, 2009

    […] project I’ve worked on since it was initially released. This can further be evidenced by various people around the net, and several personal […]