JuliaStrings / utf8proc

a clean C library for processing UTF-8 Unicode data

Home Page:http://juliastrings.github.io/utf8proc/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New data file generator with support for UCD 13 & 14

chris0e3 opened this issue · comments

Attached is data_make.py, a python3 script designed to combine & replace data/data_generator.rb & data/charwidths.jl and support both UCD 13 & 14. Also utf8proc.c.patch, a small change to utf8proc.c needed to support UCD 14.

Here are some of its features:

  • Written in Python (easier to read & support?), only uses (a little) sed. Tested on Python 3.7.4.
  • Doesn’t use an unspecified version of Ruby.
  • Doesn’t use an unspecified version of Julia.
  • Doesn’t require a previously built, unspecified, version of libutf8proc.
  • Runs to completion in 5-6 seconds (about 10x as fast as data_generator.rb).
  • Passes all utf8proc tests.
  • No changes to the public API.
  • Can generate a byte-for-byte identical utf8proc_data.c file compared to that contained in utf8proc 2.6.1.
  • Can generate an equivalent utf8proc_data.c source file that is over 1.1 MB smaller.
  • Writes informative header comments to the generated file.
  • Can process the latest UCD 14-dev data files and generate a utf8proc_data.c that passes all current tests.
    [Due to the increased size of UCD 14 data I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires a small patch to utf8proc.c.]
  • Can half the size of utf8proc_stage1table. (Saves 4352 bytes.)
  • Can be used to create a utf8proc_properties table that is > 64,000 bytes smaller.
  • Doesn’t need data/Uppercase.txt, data/Lowercase.txt or data/CharWidths.txt files.

To build with (the still in development) UCD 14 requires a new Makefile. I haven’t supplied that here as the UCD 14 is still in a state of flux & the URLs are changing. (I can supply one if requested.)
UCD 14 has increased the size of the generated data. I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires the small patch to utf8proc.c contained in utf8proc.c.patch. With the patch applied utf8proc.c still works with the original utf8proc_data.c, and the new format UCD 13 & 14 data.

To use:

  1. Download & unpack a clean copy of utf8proc-2.6.1.tar.gz.
  2. Unpack & copy the attached data_make.py & utf8proc.c.patch into the utf8proc-2.6.1 dir.
  3. Run make -kC data to download the UCD 13 data files. [It’s OK if CharWidths.txt is not made.]
  4. Run patch < utf8proc.c.patch.
  5. Run ./data_make.py --verbose --format=1 --output=utf8proc_data.c
  6. Run make check.

Usage is:

data_make.py [-v|--verbose] [-f#|--format=#] [--fix26] [--cmap] [-o ‹out-file›|--output=‹out-file›] [‹data-dir›]

If unspecified the output file is utf8proc_data.out.c.
If unspecified the input data-dir file is ./data.
If --format=0 alone is used (the default) then the output file should be identical to the original utf8proc_data.c file.
If --fix26 is used then the fixes described in issue #226 are applied to the tables.
If --cmap is used then the utf8proc_sequences table is split & the utf8proc_casemap table added. This requires the utf8proc.c.patch to be applied.
If --format=1 is used then --fix26 & --cmap are implied and the output file uses the new compact source form.
Using UCD 14 automatically forces --format=1 (thus --fix26 & --cmap too).
Using --verbose reports the options in effect & successful generation of the output file.

data_make.zip

With the release of Unicode 14.0 I have now also updated the make files.
I’ve attached the changed files below.

I also updated data_make.py to add a UNICODE_VERSION macro to utf8proc_data.c, and changed the utf8proc_unicode_version API to return it if defined.

To build & test: Copy utf8proc-2.6.1.tar.gz & the attached utf8proc-2.6.1-changes.tar.gz into ‹your-work-dir› and:

cd ‹your-work-dir›
tar -xf utf8proc-2.6.1.tar.gz
tar -xf utf8proc-2.6.1-changes.tar.gz
make -C utf8proc-2.6.1 update check UNICODE_VERSION=14.0.0                                              

This will download the UCD data files, generate a utf8proc_data.c, compile the code & run the tests.

Alternatively, make -C utf8proc-2.6.1 update check UNICODE_VERSION=13.0.0 will build utf8proc using the older UCD 13 data. Also if you don’t specify any UNICODE_VERSION=… it defaults to 14.0.0.

utf8proc-2.6.1-changes.tar.gz

This sounds great, I'll try to take a look at it later.

Can you convert this into a pull request? A PR is much easier to review than a tarball of changes.

Can you convert this into a pull request? A PR is much easier to review than a tarball of changes.

I’m sorry, I’m not a git user. I don’t know how to do that.
I could probably attach a patch file here, if that would help.
[The python script is nearly 600 lines, but the other changes are very small.]

I’m sorry, I’m not a git user. I don’t know how to do that.

There are hundreds of tutorials online — it's pretty indispensable for participating in any free/open-source software projects these days, not to mention a lot of commercial projects.

(If you can write Python code with all of the features listed above, I'm sure you can learn git!)

In a pinch, I can take the .tar.gz file you posted and make a pull request for you, though.

(If you can write Python code with all of the features listed above, I'm sure you can learn git!)

I could, probably, learn git. I really don’t want to 🤡.
I’m not a Python programmer. I used it because I thought it would be acceptable, and I posted here because I was just trying to help out. I solved a problem and thought it could help others.

In a pinch, I can take the .tar.gz file you posted and make a pull request for you, though.

Did the above commits give you what you wanted?
[I followed the instructions for ‘Linking a pull request …’, but just noticed that at the end it states “… will not be listed as a linked pull request”. So perhaps I have to do/should have done something else.]

Also, I appear to have missed the changes in 610730f.

You have to create a new pull request based on the commits: https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

I read that and thought that “To open a pull request in a public repository, you must have write access …” meant it wasn’t what I wanted. So I followed this https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue. But apparently you have to push a pull request!
Anyway, I re-merged the changes from 610730f plus 1 additional warning.
[Of course I had based my changes on the released 2.6.1 code.]
And I also tweaked the Makefile so it still builds with the original 2.6.1 utf8proc_data.c as well as the newly generated ones for UCD 13 & 14.

[All done without git 🤓.]

Closed in favor of #258