oraclebill / workflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataRescue Workflow -- Overview

This document describes the workflow our coalition uses for the Data Rescue project, both at in-person events and when people work remotely. It explains the process that a url/dataset goes through from the time it has been identified by a seeder & sorter as "uncrawlable", until it is made available as a record in the datarefuge.org CKAN data catalog. The process involves several distinct stages, and is designed to maximize smooth hand-offs so that each phase is handled by someone with distinct expertise in the area they're tackling, while the data is always being tracked for security.

Before you begin

We are so glad that you are participating in this project!

  • If you are an event organizer: learn about what you need to do to prepare the event.
  • If you are a regular participant: get a role assignment (e.g., Seeder, or Harvester), get account credentials needed for your role, and go over the workflow corresponding to your role.

Plan Overview

Seeders and Sorters canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, otherwise they add them to the Uncrawlable spreadsheet using the project's Chrome Extension.

Researchers inspect the "uncrawlable" list to confirm that seeders' assessments were correct (that is, that the URL/dataset is indeed uncrawlable). Research.md describes this process in more detail.

Often this step is incorporated into either "Seeding and Sorting" or "Harvesting".

Harvesters take the "uncrawlable" data and try to figure out how to capture it. This is a complex task which can require substantial technical expertise, and which requires different techniques for different tasks. Harvesters should see the included Harvesting Toolkit for more details and tools.

Checkers inspect a harvested dataset and make sure that it is complete. The main question the checkers need to answer is "will the bag make sense to a scientist"? Checkers need to have an in-depth understanding of harvesting goals and potential content variations for datasets.

  • Do some quality assurance on the dataset to make sure the content is correct and corresponds to what was described in the spreadsheet
  • Package the data into a bagit file (or "bag"), which includes basic technical metadata and upload it to final DataRefuge destination.
  • Creates a CKAN record for this S3 resource
  • Links bag, makes public

Partners

Data Rescue is a broad, grassroots effort with support from numerous local and nationwide networks. Thanks particularly to EDGI and Date Refuge for their leadershp, and to our numerous supporters for their hard work.

About

License:Creative Commons Attribution Share Alike 4.0 International


Languages

Language:Python 97.8%Language:Shell 2.2%