TreeRex / han-eacc-ucs

Improved EACC to Unicode mappings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EACC/Unicode Ideograph Mappings

The kEACC field in Unihan 6.2 is woefully out of date. Compared to the mappings in the latest MARC-8 Code Table at the Library of Congress (LoC) it has 8 different mappings and is missing 235.

This directory contains an updated table for Unihan derived from the LoC data.

The Source Data

  • Unihan_OtherMappings.txt 6.2 from the Unicode Consortium
  • “MARC-8 to Unicode XML mapping file” from the Library of Congress

The Mapping Table

loc-eacc-ucs.txt was generated with loc.xslt XSLT script from the LoC MARC-8 table.

The Programs

loc.xslt
XSLT script to extract the Han Ideograph mappings from the LoC XML file. Handles the cases where the EACC code maps to both the PUA and to U+3013. The output of this script is a file containing two tab-separated columns:
  1. The 3-byte EACC code as six hexadecimal numbers
  2. The USV of the corresponding Unicode character
eacc-loc-unihan.lisp
functions for reading the mapping tables and comparing their entries. This uses the CL-PPCRE library which is easily installable via QuickLisp. Tested with Clozure Common Lisp it should work with any implementation.

Comparing the Tables

Load eacc-loc-unihan.lisp into your Lisp image and switch to the EACC package.

EACC> (defvar *unihan* (read-unihan-eacc-mappings "Unihan_OtherMappings.txt"))
*UNIHAN*
EACC> (defvar *loc* (read-loc-eacc-mappings "loc-eacc-ucs.txt"))
*LOC*
EACC> (compare-entries *UNIHAN* *LOC*)
4B5F58	0F9B2	096F6
215C32	0FA25	09038
215061	0FA1D	07CBE
4B7421	0F9A9	056F9
4B4B3E	0F9AD	073B2
215F71	0FA1C	09756
4B333E	0F92E	051B7
214339	0FA12	06674
NIL

The output of the call to compare-entries shows the 8 ideographs in EACC that have different mappings in Unihan (e.g., U+F982) than in the LoC table (e.g., U+96F6).

Comparing in the other direction shows the 235 characters that have mappings in the LoC table without a kEACC mapping in Unihan:

EACC> (compare-entries *LOC* *UNIHAN*)
4B3474		0537F
213F53		061F2
4B5361		089D2
214456		06813
;;; lots deleted
216053		0985E
216044		09818
3A284C		053A9
45564B		0865E
NIL

License

The source code is in the public domain: do with it what you will.

About

Improved EACC to Unicode mappings


Languages

Language:Common Lisp 100.0%