bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create license-compliant version of the Pile: Enron Emails

albertvillanova opened this issue · comments

This one wil be a nice trial for PII removal

DONE: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_the_pile_enron_emails

I have processed the raw data file to extract only the email bodies as .txt files.

  • 51 emails gave Unicode error and were ignored:
      File ".../.venv/lib/python3.9/site-packages/mailparser/mailparser.py", line 446, in parse
        payload = payload.decode('raw-unicode-escape')
    UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in position 190-191: truncated \UXXXXXXXX escape
    

Sample:

{'text': '\n\n\n\nName\t\t\tNew Title\t\t\t\tEffective Date\t\t\tMid Year promotion Yes/No\n\nFloyd, Jodie\t\tSr Cust Svc Rep (no change)\t\t7/16/01\t\t\t\tNo\n\nBuehler, Craig\t\tSr Mkt/Sup Analyst (no change)\t\t7/16/01\t\t\t\tNo\n\nWagoner, Mike\t\tTeam Advisor - Gas Control\t\t7/1/01\t\t\t\tNo\n\nClapper, Karen\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nGreaney, Chris\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nWilkens, Jerry\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nMinton, Kevin\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nCox, Don\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nHanagriff, Richard\tSr Accounting Control Spec\t\t8/1/01\t\t\t\tYes\n\n\nThanks,\nMS\n\n\n\n\n\n\n',
 'meta': "{'file': 'maildir/blair-l/personnel___promotions/1.txt'}"}