Basic Principles of using SimpleCrawler

Using SC involves the following steps:

1. Require the simplecrawler library

require 'simplecrawler'

2. Create an instance of the SimpleCrawler::Crawler object

Pass the website address as a parameter if you like.

require 'simplecrawler'
crawler = SimpleCrawler::Crawler.new("http://www.peterkrantz.com/")

3. Configure optional parameters

The main options are:

require 'simplecrawler'
crawler = SimpleCrawler::Crawler.new("http://www.peterkrantz.com/")
crawler.skip_patterns = ["\\.doc$", "\\.pdf$", "\\.xls$", "\\.pdf$", "\\.zip$"]
crawler.maxcount = 100

4. Call the crawl method and do stuff with the yielded Document object

require 'simplecrawler'
 
crawler = SimpleCrawler::Crawler.new("http://www.peterkrantz.com/")
crawler.skip_patterns = ["\\.doc$", "\\.pdf$", "\\.xls$", "\\.pdf$", "\\.zip$"]
crawler.maxcount = 100
 
crawler.crawl { |document|
   puts document.uri
}

The yielded document is an instance of the Document class. It has the following properties: