gjacopo / test-ODP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

test-ODP

This repo aims at illustrating the production of ODP format datasets in GitHub through automatically triggered workflwow/actions.

Source material (dataset + code): https://github.com/ceperezegma/ODP_datasets_distributions.

objective: Automatically run the production script at 8:00am every morning.

constraints:

  1. input datasets are stored in a Google spreadsheet,
  2. input datasets are fetched through Google API through service authentication,
  3. operations are run from a Jupyter notebook (rather than using a script or a library): ODP_datasets_formats_update.ipynb (preview)

resolution:

  • the updated notebook ODP_datasets_formats_update.ipynb (preview) enables to circumvent constraint #2 by accessing and fetching the data from the Google spreadsheet without the need to authenticate:
    • the function gsheet_into_df is defined in place of set_google_drive_access_scope, access_google_sheet and last_processed_date to avoid using Google API and service authentication:
     def gsheet_into_df(urlbase, sheet_key, worksheet_id, file, first_column, last_column):
    
        url_domain = '%s/ccc?key=%s&gid=%s&output=csv' % (urlbase, sheet_key, worksheet_id)
    
        try:
            date_parser = lambda x: datetime.datetime.strptime(x, "%d/%m/%Y")
            last_proc_dates = pd.read_csv(file, header = 0, parse_dates = [0], date_parser = date_parser)
        except:
            print('File %s not found -- All data will be loaded' % file)
            next_row_to_read = 0
        else:
            next_row_to_read = last_proc_dates.shape[0] + 2
    
        try:
            url = '%s&range=A:A' % url_domain
            r = requests.get(url)
        except:
            raise IOError('URL %s not reached' % url)
          
        try:
            date_parser = lambda x: datetime.datetime.strptime(x, "%m/%d/%Y %H:%M:%S")
            dates = pd.read_csv(BytesIO(r.content), 
                                parse_dates = ['1'], date_parser = date_parser) 
        except:
            raise IOError('Data not loaded')
        else:
            row_total = len(dates)
    
        if row_total < next_row_to_read:  
            return None # [0]     
      
        try:
            cell_range = '%s%s:%s%s' % (first_column, next_row_to_read, last_column, row_total)
            url = '%s&range=%s' % (url_domain, cell_range)
            r = requests.get(url)
        except:
            raise IOError('URL %s not reached' % url)
        else:
            df = pd.read_csv(BytesIO(r.content), header = None) 
          
        return df
    • packages gspread and oauth2client are discarded since no authentication is required; instead requests objects are used,
    • package tqdm is also discarded for compatibility with the workflow.
    • references to local data/variables are discarded (e.g., local path, local credentials data, etc...),
  • the action configuration file automated.yml is used to automatically run the production of ODP:
          uses: actions/checkout@v2 
        with:
          persist-credentials: false
          fetch-depth: 0
          uses: actions/setup-python@v2 
        with:
          python-version: ${{ matrix.python-version }}
    • the environment packages are installed:
        run: |
          python -m pip install --upgrade pip
          if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
    • nbconvert is installed:
        run: |
          python -m pip install --upgrade pip
          pip install nbconvert
    • the notebook ODP_datasets_formats_update.ipynb is converted to a Python script:
        run: |
          jupyter nbconvert --to script ODP_datasets_formats_update.ipynb
    • the generated Python script ODP_datasets_formats_update.py is run:
        run: |
          python ODP_datasets_formats_update.py
        uses: EndBug/add-and-commit@v4
        with:
          message: "Automated data fetching"
          add: 'datasets_formats_processed.csv' 
          force: true
    • the jobs above are scheduled to run every day at 7:00am UTC (8:00):
     - cron: "0 7 * * *"
    or to be automatically triggered upon a change of the source notebook:
    
     push:
       paths:
         - '**.ipynb'

About


Languages

Language:Jupyter Notebook 100.0%