Basic Principles of using SimpleCrawler

Using SC involves the following steps:

1. Require the simplecrawler library

require 'simplecrawler'

2. Create an instance of the SimpleCrawler::Crawler object

Pass the website address as a parameter if you like.

require 'simplecrawler'
crawler = SimpleCrawler::Crawler.new("http://www.peterkrantz.com/")

3. Configure optional parameters

The main options are:

  • load_binary_data: Defaults to false. If true, binary files are loaded into the data property of the yielded Document object.
  • skip_patterns: An array of regular expression patterns that each URI is checked for. If any of the expressions match, the URI is skipped. E.g. to skip Microsoft Word documents based on the filename use [“\\.doc$”] (please note that the dot has to be escaped with a double backslash).
  • include_patterns: An array of regular expression patterns that each URI is checked for. All URI:s must match at least one of the expressions or the URI is skipped. E.g. to only include blog items from 2008 use [“\\/2008\\/”].
  • maxcount: The maximum number of pages to crawl. Defaults to unlimited.
require 'simplecrawler'
crawler = SimpleCrawler::Crawler.new("http://www.peterkrantz.com/")
crawler.skip_patterns = ["\\.doc$", "\\.pdf$", "\\.xls$", "\\.pdf$", "\\.zip$"]
crawler.maxcount = 100

4. Call the crawl method and do stuff with the yielded Document object

require 'simplecrawler'
 
crawler = SimpleCrawler::Crawler.new("http://www.peterkrantz.com/")
crawler.skip_patterns = ["\\.doc$", "\\.pdf$", "\\.xls$", "\\.pdf$", "\\.zip$"]
crawler.maxcount = 100
 
crawler.crawl { |document|
   puts document.uri
}

The yielded document is an instance of the Document class. It has the following properties:

  • uri: The URI of the crawled page.
  • fetched_at: The time when the document was fetched.
  • headers: A Hash containing the HTTP headers returned by the server.
  • http_status: An array containing HTTP response code and message (e.g. “200” and “OK”.
  • data: The data returned by the server. If this was an HTML page you can load the data into Hpricot for further processing.
 
principles.txt · Last modified: Y-m-d H:i by peterkz
Recent changes RSS feed, Powered by DokuWiki