A very simple implementation of a web crawler which saves academic papers, based on the edu.stanford.pubcrawl.newcrawler web crawler. It can be invoked at the command line with the following command:
java edu.stanford.pubcrawl.newcrawler.papercrawler.PaperCrawler numThreads startURL [hostRestrictionSuffix]
This command creates a crawler with numThreads
threads and starts crawling at startURL
.
If the optional third argument is specified, then the crawl is restricted to hosts with names that end in the
provided suffix (and links to other hosts are ignored). In this crawl, files ending with ".ps" are prioritized
for download, and are saved to disk in a directory structure stemming from the current directory. A file called
"pages.db" is created in the current directory to record the relationship between the original URLs of the files
and the new file locations. The directory structure has as many base directories as there are threads in the crawl,
so that each thread has its own base directory. Then the files are saved in directories that are 2 levels down,
where each directory can have a maximum of 100 files.
See also the package documentation for the edu.stanford.pubcrawl.newcrawler package. Send all questions and comments to the author Teg Grenager.