in Ruby, Testing

Parsing ASP.NET sites with WWW::Mechanize and Hpricot

Users of Hpricot (which WWW::Mechanize is using as the default html parser) may have discovered that the buffer size for attribute values is set to 16384 bytes default. Typically this isn’t a problem, I mean who would put 16Kb of data into an HTML attribute? Well, ASP.NET uses a hidden input field to store view state in order to save a few clock cycles on the server side (and spare developers the hazzle of coding view state).

Typically, developers tend to forget to turn off view state resulting in a lot of data that never is used. The guy who made the decision to have this default view state behaviour has probably caused a lot of unnecessary bytes clogging your internet connection (as it typically is included in each request).

If you are using mechanize and/or Hpricot to parse such a site you may have come across this error:

ran out of buffer space on element <input>, starting on line 38. (Hpricot::ParseError)

If you want to try it out, load this sample viewstate file into Hpricot. The buffer space error has been reported in the Hpricot issue tracker.

Fortunately, from version 0.5 of Hpricot it is easy to increase the buffer size before loading data. This is done by setting the buffer_size attribute to a sufficiently large number:

[source:ruby]
require ‘hpricot’
Hpricot.buffer_size = 262144
[/source]

Fixing Mechanize

As mechanize uses Hpricot as the default parser this error will happen when loading many ASP.NET pages. Fortunately, mechanize allows the user to specify a custom parser class through the pluggable_parser attribute. To make mechanize use Hpricot with a larger buffer size:

[source:ruby]
require ‘hpricot’
require ‘mechanize’

Hpricot.buffer_size = 262144
agent = WWW::Mechanize.new
agent.pluggable_parser.default = Hpricot
agent.get(‘http://www.peterkrantz.com/wp-content/uploads/2007/02/viewstatesample.htm’)
[/source]

…and we’re back on track mechanizing the world again.

Write a Comment here on the real web

Comment

  1. What a star, I am using hpricot to parse search results from ask.com and was getting exactly this error.

  2. I was getting the “ran out of buffer space” error from using Hpricot with a particular site and after consulting the Google I arrived here and found the easy fix. Thanks!

Webmentions

  • ‘ran out of buffer space on element’ errors in Hpricot at Naofumi Kagami March 9, 2008

    […] problem, mentioned in this blog post, is that an ever increasing number of ASP.NET web sites have huge amounts of data in an HTML […]

  • Zeroday 01100100011010010 » Hpricot Workaround for ASPX viewstate March 9, 2008

    […] are various pages out there which detail the work around and the rumor is that the memory cap is to ensure that the script doesn’t end up consuming […]