New data file generator with support for UCD 13 & 14
chris0e3 opened this issue · comments
Attached is data_make.py
, a python3 script designed to combine & replace data/data_generator.rb & data/charwidths.jl and support both UCD 13 & 14. Also utf8proc.c.patch
, a small change to utf8proc.c needed to support UCD 14.
Here are some of its features:
- Written in Python (easier to read & support?), only uses (a little) sed. Tested on Python 3.7.4.
- Doesn’t use an unspecified version of Ruby.
- Doesn’t use an unspecified version of Julia.
- Doesn’t require a previously built, unspecified, version of libutf8proc.
- Runs to completion in 5-6 seconds (about 10x as fast as data_generator.rb).
- Passes all utf8proc tests.
- No changes to the public API.
- Can generate a byte-for-byte identical utf8proc_data.c file compared to that contained in utf8proc 2.6.1.
- Can generate an equivalent utf8proc_data.c source file that is over 1.1 MB smaller.
- Writes informative header comments to the generated file.
- Can process the latest UCD 14-dev data files and generate a utf8proc_data.c that passes all current tests.
[Due to the increased size of UCD 14 data I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires a small patch to utf8proc.c.] - Can half the size of utf8proc_stage1table. (Saves 4352 bytes.)
- Can be used to create a utf8proc_properties table that is > 64,000 bytes smaller.
- Doesn’t need data/Uppercase.txt, data/Lowercase.txt or data/CharWidths.txt files.
To build with (the still in development) UCD 14 requires a new Makefile. I haven’t supplied that here as the UCD 14 is still in a state of flux & the URLs are changing. (I can supply one if requested.)
UCD 14 has increased the size of the generated data. I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires the small patch to utf8proc.c contained in utf8proc.c.patch. With the patch applied utf8proc.c still works with the original utf8proc_data.c, and the new format UCD 13 & 14 data.
To use:
- Download & unpack a clean copy of
utf8proc-2.6.1.tar.gz
. - Unpack & copy the attached
data_make.py
&utf8proc.c.patch
into theutf8proc-2.6.1
dir. - Run
make -kC data
to download the UCD 13 data files. [It’s OK if CharWidths.txt is not made.] - Run
patch < utf8proc.c.patch
. - Run
./data_make.py --verbose --format=1 --output=utf8proc_data.c
- Run
make check
.
Usage is:
data_make.py [-v|--verbose] [-f#|--format=#] [--fix26] [--cmap] [-o ‹out-file›|--output=‹out-file›] [‹data-dir›]
If unspecified the output file is utf8proc_data.out.c
.
If unspecified the input data-dir file is ./data
.
If --format=0
alone is used (the default) then the output file should be identical to the original utf8proc_data.c
file.
If --fix26
is used then the fixes described in issue #226 are applied to the tables.
If --cmap
is used then the utf8proc_sequences
table is split & the utf8proc_casemap
table added. This requires the utf8proc.c.patch to be applied.
If --format=1
is used then --fix26
& --cmap
are implied and the output file uses the new compact source form.
Using UCD 14 automatically forces --format=1
(thus --fix26
& --cmap
too).
Using --verbose
reports the options in effect & successful generation of the output file.
With the release of Unicode 14.0 I have now also updated the make files.
I’ve attached the changed files below.
I also updated data_make.py
to add a UNICODE_VERSION
macro to utf8proc_data.c
, and changed the utf8proc_unicode_version
API to return it if defined.
To build & test: Copy utf8proc-2.6.1.tar.gz
& the attached utf8proc-2.6.1-changes.tar.gz
into ‹your-work-dir›
and:
cd ‹your-work-dir›
tar -xf utf8proc-2.6.1.tar.gz
tar -xf utf8proc-2.6.1-changes.tar.gz
make -C utf8proc-2.6.1 update check UNICODE_VERSION=14.0.0
This will download the UCD data files, generate a utf8proc_data.c
, compile the code & run the tests.
Alternatively, make -C utf8proc-2.6.1 update check UNICODE_VERSION=13.0.0
will build utf8proc using the older UCD 13 data. Also if you don’t specify any UNICODE_VERSION=…
it defaults to 14.0.0.
This sounds great, I'll try to take a look at it later.
Can you convert this into a pull request? A PR is much easier to review than a tarball of changes.
Can you convert this into a pull request? A PR is much easier to review than a tarball of changes.
I’m sorry, I’m not a git user. I don’t know how to do that.
I could probably attach a patch file here, if that would help.
[The python script is nearly 600 lines, but the other changes are very small.]
I’m sorry, I’m not a git user. I don’t know how to do that.
There are hundreds of tutorials online — it's pretty indispensable for participating in any free/open-source software projects these days, not to mention a lot of commercial projects.
(If you can write Python code with all of the features listed above, I'm sure you can learn git!)
In a pinch, I can take the .tar.gz file you posted and make a pull request for you, though.
(If you can write Python code with all of the features listed above, I'm sure you can learn git!)
I could, probably, learn git. I really don’t want to 🤡.
I’m not a Python programmer. I used it because I thought it would be acceptable, and I posted here because I was just trying to help out. I solved a problem and thought it could help others.
In a pinch, I can take the .tar.gz file you posted and make a pull request for you, though.
Did the above commits give you what you wanted?
[I followed the instructions for ‘Linking a pull request …’, but just noticed that at the end it states “… will not be listed as a linked pull request”. So perhaps I have to do/should have done something else.]
Also, I appear to have missed the changes in 610730f.
You have to create a new pull request based on the commits: https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request
You have to create a new pull request based on the commits: https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request
I read that and thought that “To open a pull request in a public repository, you must have write access …” meant it wasn’t what I wanted. So I followed this https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue. But apparently you have to push a pull request!
Anyway, I re-merged the changes from 610730f plus 1 additional warning.
[Of course I had based my changes on the released 2.6.1 code.]
And I also tweaked the Makefile so it still builds with the original 2.6.1 utf8proc_data.c as well as the newly generated ones for UCD 13 & 14.
[All done without git 🤓.]
Closed in favor of #258