Parsing ASP.NET sites with WWW::Mechanize and Hpricot

Users of Hpricot (which WWW::Mechanize is using as the default html parser) may have discovered that the buffer size for attribute values is set to 16384 bytes default. Typically this isn’t a problem, I mean who would put 16Kb of data into an HTML attribute? Well, ASP.NET uses a hidden input field to store view state in order to save a few clock cycles on the server side (and spare developers the hazzle of coding view state).

Typically, developers tend to forget to turn off view state resulting in a lot of data that never is used. The guy who made the decision to have this default view state behaviour has probably caused a lot of unnecessary bytes clogging your internet connection (as it typically is included in each request).

If you are using mechanize and/or Hpricot to parse such a site you may have come across this error:

ran out of buffer space on element <input>, starting on line 38. (Hpricot::ParseError)

If you want to try it out, load this sample viewstate file into Hpricot. The buffer space error has been reported in the Hpricot issue tracker.

Fortunately, from version 0.5 of Hpricot it is easy to increase the buffer size before loading data. This is done by setting the buffer_size attribute to a sufficiently large number:

require ‘hpricot’
Hpricot.buffer_size = 262144

Fixing Mechanize

As mechanize uses Hpricot as the default parser this error will happen when loading many ASP.NET pages. Fortunately, mechanize allows the user to specify a custom parser class through the pluggable_parser attribute. To make mechanize use Hpricot with a larger buffer size:

require ‘hpricot’
require ‘mechanize’

Hpricot.buffer_size = 262144
agent = WWW::Mechanize.new
agent.pluggable_parser.default = Hpricot
agent.get(‘http://www.peterkrantz.com/wp-content/uploads/2007/02/viewstatesample.htm’)

…and we’re back on track mechanizing the world again.

Comments

  1. Ed says at 2007-06-15 13:06:

    What a star, I am using hpricot to parse search results from ask.com and was getting exactly this error.

  2. Chris Papadopoulos says at 2007-12-11 20:12:

    I was getting the “ran out of buffer space” error from using Hpricot with a particular site and after consulting the Google I arrived here and found the easy fix. Thanks!

  3. coderrr says at 2008-03-09 13:03:

    this patch fixes hpricot to dynamically allocate more memory as needed, so you never get these errors…

    http://coderrr.wordpress.com/2007/09/14/hpricot-patch-to-support-arbitrarily-large-elements/

  4. Zeroday 01100100011010010 » Hpricot Workaround for ASPX viewstate says at 2008-11-28 23:11:

    [...] are various pages out there which detail the work around and the rumor is that the memory cap is to ensure that the script doesn’t end up consuming [...]

  5. ‘ran out of buffer space on element’ errors in Hpricot at Naofumi Kagami says at 2009-01-05 09:01:

    [...] problem, mentioned in this blog post, is that an ever increasing number of ASP.NET web sites have huge amounts of data in an HTML [...]

Leave a comment

You can use some HTML elements. You know which they are.

Additional comments powered by BackType