goc9000 / octa-format

Database format for storing crawling traces

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OCTA Format

OCTA (Online Crawling Trace Archive) is a database format for storing the complete history ("trace") of Web crawling sessions.

The format is designed to enable online, concurrent operation for multiple parties while the crawler is still running. It also aims to be open, portable, tolerant of incomplete data, extensible, while also providing support for large binary content, compression, deduplication and arbitrary annotations.

Details can be found in the Octa Format document.

License: MIT

About

Database format for storing crawling traces

License:MIT License