Data.World

SLACK CHANNEL: TBD

Project Description:

The idea is a sharable staging area for wrangling data before transfer to an open data portal (e.g., https://data.world). The repo stores raw data and wrangling scripts. Citizen Lab members can push raw data into the repo and authorized members can write scripts to prepare data for transfer to an open data portal.

With some effort, we can create a repository of data and scripts to facilitate future data wrangling and help keep our open data manageable.

Quick Start

Curator Process, starts in raw-data folder
Developer Process
Maintainer Process

Definitions

Comma Separated Values (CSV). CSV is a method of formatting values in a text file.
GitHub is a technology for versioning files.
Open Data
Open Data Portal is technology that facilitates the sharing of a dataset(s) via an application programming interface (API)

Citizen Labs Members

Members and areas of responsibility.

Name	Role	Raw-data	Clean-data	Scripts	Open Data
Cara D	Curator	X
Eileen B	Curator	X
James W	Developer		X	X
Jace B	Maintainer			X	X

Member Roles & Responsibilities

Declares the duties of members.

Curator	Developer	Maintainer
Curates dataset(s)	Writes/Updates scripts	Puts data into production
Loads raw dataset to GitHub	Tests dataset load	Removes data from production
Creates GIT pull request	Maintains GIT scripts folder
Signoff on Prod dataset load	Maintains clean-data folder	Maintains the GIT master branch
	Creates clean-data set	Maintains the Prod Environment
	Maintains Dev Environment
	Creates GIT pull request
	Adds raw-data sub-folders

Member Provisioning

Role	CL Membership	GHF Account	DDW Account	CLDW Account	Jupiter Notebook
Curator	X	X
Developer	X	X	X		X
Maintainer	X	X		X	X

GHF Account is GitHub Free Account
CL Membership is Citizen Labs Membership
DDW Account is Development Data.World Account. This is the Develpers personal account.
CLDW Account is Citizen Labs Data.World Account
X is a link to more info or install instructions

Applications & Datasets

For every dataset in the citizenlabs/data.world repo, there exists at least one application. For instance: the Grand River Drains data (gr_drains.csv) is used by the "Adopt a Drain" application.

Application	Dataset
Adopt a Drain	gr_drains

Raw-data

Raw data (raw-data) is the original data provided by a client (aka, curator).
Raw data is expected to contain text, organized in columns and rows, where values are separated by commas. The first row is required to list the name of the columns. The Curator is responsible for putting raw data into the raw-data folder.

More info about providing raw-data can be found here.

Clean-data

Raw data doesn't always conform to the expectations of an application.
Cleaning includes: renaming columns, enforcing types, formating dates, removing duplicate rows and outliers.
Clean data is stored in the clean-data folder. The Developer is responsible for cleaning raw data.

More info about clean-data can be found here.

Scripts

Scripts encapuslate the steps to process raw-data into clean-data and finally move data to https://data.world. Each dataset in the raw-data folder has a script to control how it is cleaned and moved into the test or production environments.

We use python in the form of Jupyter Notebooks to build, document, and run our scripts.
The Developer is responsible for writing and testing scripts to clean the raw data.
The Maintainer is responsible for running scripts that create or refesh data used by applications.

More info about scripts can be found here.

Local		GitHub		Data.World.Test		Data.World.Prod
Curator	>	raw-data
Developer	<	raw-data
	>	clean-data	>	test-data
Maintainer	<	clean-data
	>				>	open-data
App(s)	<				<	open-data

Process Overview

This is a best case scenario with no failures. Use as a guide to a successful completion of the process.

About

Work with data sets prior to uploading to data.world

data data-structures

GNU General Public License v3.0

Languages

Language:Python 68.6%Language:Jupyter Notebook 31.4%

citizenlabsgr / data.world