(Sample Data Broken for Some Users) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 834: ordinal not in range(128)

Question

(Sample Data Broken for Some Users) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 834: ordinal not in range(128)

atc0005 opened this issue 4 years ago · comments

As noted on gdcc/dataverse-ansible#38, I encountered the following error first when running the Ansible playbook from that repo, then again when following the steps in this repo's README file.

Snippet of the output just prior and then the error message:

Creating dataverse ubiquity-press.json in dataverse :root
{'name': 'Ubiquity Press Dataverse', 'alias': 'ubiquity-press', 'dataverseContacts': [{'contactEmail': 'ubiquity-press@mailinator.com'}], 'affiliation': '', 'description': 'Ubiquity Press is an open access publisher of peer-reviewed, academic journals. Our flexible publishing model makes journals affordable, and enables researchers around the world to find and access the information they need, without barriers. The following gives an overview of how we work. More information can be found in a recent interview with Chronicle of Higher Education: <a href="http://chronicle.com/blogs/profhacker/ubiquity/43312" rel="nofollow" target="_blank">"Open Access Ahoy: An Interview with Ubiquity Press"</a>.', 'dataverseType': 'JOURNALS'}
Dataverse ubiquity-press created.
<Response [201]>
Dataverse ubiquity-press published.
<Response [200]>
Creating dataverse jopd.json in dataverse ubiquity-press
{'name': 'Journal of Open Psychology Data (JOPD) Dataverse', 'alias': 'jopd', 'dataverseContacts': [{'contactEmail': 'jopd@mailinator.com'}], 'affiliation': 'Ubiquity Press', 'description': 'Datasets from data papers published in the Journal of Open Psychology Data (JOPD).', 'dataverseType': 'JOURNALS'}
Dataverse jopd created.
<Response [201]>
Dataverse jopd published.
<Response [200]>
Creating dataset flynn-effect-in-estonia.json in dataverse jopd
Traceback (most recent call last):
  File "create_sample_data.py", line 56, in <module>
    metadata = json.load(f)
  File "/usr/lib64/python3.6/json/__init__.py", line 296, in load
    return loads(fp.read(),
  File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 834: ordinal not in range(128)

The environment is a CentOS 7 x64 LXD container. I attempted to replicate within a local CentOS 7 x64 VM, but my (unfortunately remote) VMware Workstation environment is acting up. I'll attempt to further replicate in a non-LXD environment when I have more time.

qqmyers · Answer 1 · Sat Aug 29 2020 20:09:25 GMT+0800 (China Standard Time)

FWIW, I think adding ", encoding='utf-8') " to the open calls right before the json.load statements would work, but I just had a similar situation in dataverse-metrics and it turned out I was able to read the unicode in python 3 but not python 2, so there must also be some environment variable (or module?) that can be set (which would explain why this hasn't been seen by others?)

Philip Durbin · Answer 2 · Mon Aug 31 2020 22:28:54 GMT+0800 (China Standard Time)

@qqmyers thanks for the tip about the Python version.

@atc0005 which version of Python was used above, please?

Don Sizemore · Answer 3 · Mon Aug 31 2020 22:33:13 GMT+0800 (China Standard Time)

@pdurbin he first hit the bug using dataverse-ansible, which installs 3.6:
https://github.com/GlobalDataverseCommunityConsortium/dataverse-ansible/blob/master/tasks/sampledata.yml#L16

Adam Chalkley · Answer 4 · Mon Aug 31 2020 22:40:04 GMT+0800 (China Standard Time)

@pdurbin: which version of Python was used above, please?

What @donsizemore said. Please let me know if you need more info.

Danny Brooke · Answer 5 · Sat Dec 05 2020 02:10:17 GMT+0800 (China Standard Time)

Thanks all for the details here. I'm going to get this into a sprint so that we can get it fixed.

Danny Brooke · Answer 6 · Thu Jan 07 2021 03:43:10 GMT+0800 (China Standard Time)

This could be a python version mis-match - consider asking/telling people to use python 3
This could be that there's some malformed(?) UTF-8 characters in the data itself

Adam Chalkley · Answer 7 · Thu Jan 07 2021 17:13:06 GMT+0800 (China Standard Time)

This could be a python version mis-match - consider asking/telling people to use python 3

If it helps, I believe that I was using Python 3.6 at the time I encountered the issue. The error snippet in the OP suggests this, but it's been long enough since my attempt to load the sample data that I don't recall for sure.