SimpleCrawler - A Simple Web Crawler Library

With SimpleCrawler (SC) basic web crawling becomes easy in Ruby. Use SimpleCrawler as the foundation for your own crawling needs.

SC is inspired by code in an article by Scott Nedderman (which didn't work properly for me).

A minimal example (crawl a website and print page titles):

require 'rubygems'
require 'hpricot'
require 'simplecrawler'
 
# Set up a new crawler
sc = SimpleCrawler::Crawler.new("http://www.peterkrantz.com/")
 
# The crawler yields a Document object for each visited page.
sc.crawl { |document|
   # Parse page title with Hpricot and print it
   hdoc = Hpricot(document.data)
   puts hdoc.search("title").first.inner_html
}

Quickstart

Getting started with SC is easy.

Examples

Questions, Feedback, Bugs, Praise

To contact me, please send email to peter.krantz@NODAMNSPAMgmail.com (and start the subject line with “SimpleCrawler”) or use the Rubyforge issue tracker to report bugs and feature requests.

 
start.txt · Last modified: Y-m-d H:i by peterkz
Recent changes RSS feed, Powered by DokuWiki