====== SimpleCrawler - A Simple Web Crawler Library ====== With SimpleCrawler (SC) basic [[http://en.wikipedia.org/wiki/Web_crawler|web crawling]] becomes easy in Ruby. Use SimpleCrawler as the foundation for your own crawling needs. SC is inspired by code in an article [[http://blog.netphase.com/2007/04/19/ruby-web-crawler/|by Scott Nedderman]] (which didn't work properly for me). A minimal example (crawl a website and print page titles): require 'rubygems' require 'hpricot' require 'simplecrawler' # Set up a new crawler sc = SimpleCrawler::Crawler.new("http://www.peterkrantz.com/") # The crawler yields a Document object for each visited page. sc.crawl { |document| # Parse page title with Hpricot and print it hdoc = Hpricot(document.data) puts hdoc.search("title").first.inner_html } ===== Quickstart ===== Getting started with SC is easy. * [[Install|Installing the SimpleCrawler gem]]. * [[principles|Basic principles]] in SC (The Document class). ===== Examples ===== * [[examples:find-pdf-documents|Find all PDF documents on a website]] * [[examples:accessibility-report|Site accessibility report with Raakt and Ruport]] ===== Questions, Feedback, Bugs, Praise ===== To contact me, please send email to peter.krantz@NODAMNSPAMgmail.com (and start the subject line with "SimpleCrawler") or use the [[http://rubyforge.org/tracker/?atid=16645&group_id=4318&func=browse|Rubyforge issue tracker]] to report bugs and feature requests.