This is a simple tool to download corresponding binary data from CommonCrawl indexes. Forked from CommonCrawlDocumentDownload project.
Adjust options by revising application.yml
.
- Include or exclude specific file extension for download
- Adjust download speed and location. Due to download speed, sometimes CommonCrawl server sends 503 error.
- Without typing lookupURLs and downloadDocumnets, fetch index and download document at once
- Log is huge and long to read. Soon remove unnecessary part of log
cd CommonCrawlDocumentDownload
./gradlew check
./gradlew lookupURLs
Reads the current Common Crawl URL index data and extracts all URLs for
interesting mime-types or file extensions, stores the URLs in a file
called commoncrawl-CC-MAIN-<year>-<crawl>.txt
There are some options.
./gradlew lookupURLs -Pkey='YYYY-NN'
Default key is '2023-14'. Here is list of keys.
./gradlew downloadDocuments
Uses the URLs listed in commoncrawl-CC-MAIN-<year>-<crawl>.txt
to
download the documents from the Common Crawl.
./gradlew deduplicate
Some files have equal content, this task will detect these based on file-size and content-hash and move all duplicates to a backup-directory to leave only unique files in place.
- common-crawl-download is licensed under the BSD 2-Clause License.