clics / clics2

A python package to create and analyze colexification networks from lexical datasets.

Home Page:http://clics.clld.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clics tries to write non-ASCII to a GML file

Anaphory opened this issue · comments

Using pyclics on a windows computer, I got the characteristic encoding error that python throws when trying to pump useful characters into windows text files or onto the console.
The culprit sits here:

with self.fname.open('w') as fp:
fp.write('\n'.join(html.unescape(line) for line in nx.generate_gml(graph)))

I first tried to force fp to have UTF-8 encoding, but closer investigation upon trying to draw the graph showed that networkx's GML reader accepts only ASCII input – why are we unescaping the non-ascii characters in there in the first place, instead of plain using nx.write_gml(graph, self.fname)?

Because we want to read the graph in unicode in cytoscape, and cytoscape accepts gml with unicode. We want to look at real word forms when looking at clics data, this is why we write it in this form in GML. The standard is a bit problematic, but was during creation of clics1 useful for inspection, and by then GML problems did not persist.

If cytoscape could unescape HTML character references, the solution would be to just let generate_gml generate those for the IPA symbols, so I assume cytoscape does not have that decoding step?
(For ISO 8859-1, the &name; syntax is actually in the GML standard, so that would be a bug in cytoscape; Even though the GML standard is way old and restricting this to ISO 8859-1, not knowing anything about UTF-8 yet, which is actually a much better solution nowadays, supporting &name; would probably not be hard and still be helpful for compatibility, but that's not a discussion for here.)

I will just install cytoscape, it's probably the nicer solution anyway.

I have marked this issue as “won't fix” for now, let's close it and re-open it if ever cytoscape's behaviour changes to permit &name;.