composewell / unicode-transforms

Fast Unicode normalization in Haskell

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Switch to text files for UCD instead of XML?

harendra-kumar opened this issue · comments

Comments from #31

We are currently using an xml file ucdxml/ucd.all.flat.zip for parsing the unicode data. Every time we add a new one it adds to the size of the repository, I am not sure if git is able to create a delta efficiently for this file (.git directory size on my machine is 36 MB). Also, it also takes a lot of time parsing the xml, though it is only a one time job. It contains all the unicode database even a lot of stuff that we do not need.

I am wondering if we should switch to the text files - https://www.unicode.org/Public/13.0.0/ucd/ . That will reduce the size and speed significantly because we need only a few of these and git should be able to handle the incremental changes to these files much more efficiently. It requires some effort and testing though. We can consider this when creating a new package for the data.

We will be depending on unicode-data, #57 . It uses text files for parsing.