IQSS / dataverse-sample-data

Scripts and sample data for demo purposes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(Sample Data Broken for Some Users) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 834: ordinal not in range(128)

atc0005 opened this issue · comments

As noted on gdcc/dataverse-ansible#38, I encountered the following error first when running the Ansible playbook from that repo, then again when following the steps in this repo's README file.

Snippet of the output just prior and then the error message:

Creating dataverse ubiquity-press.json in dataverse :root
{'name': 'Ubiquity Press Dataverse', 'alias': 'ubiquity-press', 'dataverseContacts': [{'contactEmail': 'ubiquity-press@mailinator.com'}], 'affiliation': '', 'description': 'Ubiquity Press is an open access publisher of peer-reviewed, academic journals. Our flexible publishing model makes journals affordable, and enables researchers around the world to find and access the information they need, without barriers. The following gives an overview of how we work. More information can be found in a recent interview with Chronicle of Higher Education: <a href="http://chronicle.com/blogs/profhacker/ubiquity/43312" rel="nofollow" target="_blank">"Open Access Ahoy: An Interview with Ubiquity Press"</a>.', 'dataverseType': 'JOURNALS'}
Dataverse ubiquity-press created.
<Response [201]>
Dataverse ubiquity-press published.
<Response [200]>
Creating dataverse jopd.json in dataverse ubiquity-press
{'name': 'Journal of Open Psychology Data (JOPD) Dataverse', 'alias': 'jopd', 'dataverseContacts': [{'contactEmail': 'jopd@mailinator.com'}], 'affiliation': 'Ubiquity Press', 'description': 'Datasets from data papers published in the Journal of Open Psychology Data (JOPD).', 'dataverseType': 'JOURNALS'}
Dataverse jopd created.
<Response [201]>
Dataverse jopd published.
<Response [200]>
Creating dataset flynn-effect-in-estonia.json in dataverse jopd
Traceback (most recent call last):
  File "create_sample_data.py", line 56, in <module>
    metadata = json.load(f)
  File "/usr/lib64/python3.6/json/__init__.py", line 296, in load
    return loads(fp.read(),
  File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 834: ordinal not in range(128)

The environment is a CentOS 7 x64 LXD container. I attempted to replicate within a local CentOS 7 x64 VM, but my (unfortunately remote) VMware Workstation environment is acting up. I'll attempt to further replicate in a non-LXD environment when I have more time.

FWIW, I think adding ", encoding='utf-8') " to the open calls right before the json.load statements would work, but I just had a similar situation in dataverse-metrics and it turned out I was able to read the unicode in python 3 but not python 2, so there must also be some environment variable (or module?) that can be set (which would explain why this hasn't been seen by others?)

@qqmyers thanks for the tip about the Python version.

@atc0005 which version of Python was used above, please?

@pdurbin: which version of Python was used above, please?

What @donsizemore said. Please let me know if you need more info.

Thanks all for the details here. I'm going to get this into a sprint so that we can get it fixed.

  • This could be a python version mis-match - consider asking/telling people to use python 3
  • This could be that there's some malformed(?) UTF-8 characters in the data itself
  • This could be a python version mis-match - consider asking/telling people to use python 3

If it helps, I believe that I was using Python 3.6 at the time I encountered the issue. The error snippet in the OP suggests this, but it's been long enough since my attempt to load the sample data that I don't recall for sure.