NicolaasJKotze / Data-cleaning

This represents data extracted according to this format https://github.com/South-Africa-Government-Procurement/project-docs/wiki/Data-models-and-standards#abstract-records-of-amounts

Home Page:https://join.keepthereceipts.org.za/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Procurement Data Cleaning

Thanks for offering to help out! For more info, join: http://join.keepthereceipts.org.za/ or Slack, channel #keep-the-receipts on https://zatech.co.za/

Basic steps to get started:

  • Download and install Tabula: https://tabula.technology/
  • Download a copy of the PDF from the GitHub issue that you will be processing.
  • Load the PDF into Tabula.
  • Highlight/select the tables in Tabula, export to CSV.
  • Open the CSV files in a spreadsheet app (Excel / Google Sheets / LibreOffice)
  • Examine the CSV, and make any adjustments: -- DON'T fix any spelling mistakes or typos, these should match the original document as closely as possible. -- DON'T remove the headings for the table. -- DO remove any totals rows - we are interested in individual line items, not totals. -- DO remove any empty lines that aren't needed. -- DO make sure that everything that is in one row on the PDF is one row on the CSV (More info here)[keep-the-receipts#104 (comment)]
  • Save the resulting CSV file, which you will use for creating the Pull Request. Use the same name as the source PDF file for the CSV (naturally replacing the .pdf extension with .csv).
  • Raise a PR using the Github UI.

If you're already familiar with Git, some extra tips:

F.A.Q.:

  1. Should we put different tables into different CSVs?

If the column headings are the same, they can be in one CSV. But if the tables relate to different departments or entities, add a column to specify who that part of the CSV relates to. If the columns headings are different, they should be separate CSVs.

  1. What should I do if there are merged cells that should be split?

Instructions for managing merged cells are here: keep-the-receipts#119 (comment)

  1. What should I do if Tabula splits cells that should be on a single row? Instructions for this are here: keep-the-receipts#104 (comment)

  2. Should I do 'pass 2' of a file?

Preferably do pass 1 first, just so we easily keep track and ensure there's a first pass of everything. Once Pass 1 is done, you can do pass 2 unless it was you who did pass 1. The idea with two passes is to identify errors when by looking at the differences between two passes done by two different people

About

This represents data extracted according to this format https://github.com/South-Africa-Government-Procurement/project-docs/wiki/Data-models-and-standards#abstract-records-of-amounts

https://join.keepthereceipts.org.za/