Find all PDF documents on a site

This is an example of how SimpleCrawler can be used to find documents of a specific type on a website. In this case the site in the command line argument is crawled and URIs for documents of the type “application/pdf” is returned.

require 'simplecrawler'
 
# Set up a new crawler
sc = SimpleCrawler::Crawler.new(ARGV[0])
sc.maxcount = 200 #Only crawl 200 pages
 
sc.crawl { |document|
   if document.headers["content-type"] == "application/pdf"
      puts document.uri
   end
}
 
examples/find-pdf-documents.txt · Last modified: Y-m-d H:i by peterkz
Recent changes RSS feed, Powered by DokuWiki