Use GitHub Releases for assets

Question

Use GitHub Releases for assets

wooorm opened this issue 5 years ago · comments

👋

watch releases
see the changelog
download assets, which is the main reason I’m opening this issue: I’ve been trying for a couple hours on macOS to build the hunspell .aff and .dic files but can’t seem to make it work 😩

Extra info: I package hunspell dictionaries for use in the JS ecosystem: wooorm/dictionaries, and similar projects (such as this one) do offer this feature.

crash5 · Answer 1 · Sat May 04 2019 14:54:08 GMT+0800 (China Standard Time)

Can you post the error message? Until an official version is available you can use an unofficial version from here.

Titus · Answer 2 · Sat May 04 2019 15:46:10 GMT+0800 (China Standard Time)

Sweet, thanks!

Setup

So, I’m on macOS (latest), and updated to use GNU stuff:

ispell: brew install ispell
hunspell: brew install hunspell
coreutils: brew install coreutils
sed: brew install gnu-sed
awk: brew install gawk
m4: brew install m4

Where specified in installation logs I’ve put them in PATH to overwrite the Apple versions of tools.

I’ve updated the max open files:

$ launchctl limit maxfiles
	maxfiles    65536          2000000

Usage

After cloning this repo I do:

$ LC_ALL=C make myspell

Yields:

===> magyar myspell alapszótár (magyar4myspell.dict) előállítása
==> szimbolikus kötések létrehozása a szotar.konf alapján
konfigurációs állomány nincs megadva, alapértelmezett a szotar.konf
/Users/tilde/Downloads/oss/magyarispell
. Rendben.
==> szótárak egybemásolása
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
..............recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
..................recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
................recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.sed: 1: "/Users/tilde/Downloads/ ...": undefined label 'ilde/Downloads/oss/magyarispell/tmp/fonev_osszetett.1'
 Rendben.
# hy: összetétel
-n .
==> igéből képzett alakok előállítása
.......... Rendben.
==> igék
... Rendben.
==> kivételek
-n .
-n .
-n .
-n .
Rendben.
==> névszók
.../usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
./usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
............/usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
./usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
/usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
/usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
.. Rendben.
==> morfológiai kódok
-n .
==> tiltott szavak
.recode: Invalid input in step `UTF-8..ISO-8859-2'
..... Rendben.
Rendben.
===> ragozási táblázat (magyar.aff) előállítása
===> myspell ragozási táblázat (hu_HU.aff) előállítása
===> myspell szótár (hu_HU.dic) előállítása
awk: cmd. line:2: (FILENAME=tmp/allomorf.txt FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
===> Unicode karakterkódolású állományok előállítása
===> Tömörített Hunspell szótárak elkészítése
output: hu_HU_u8_alias.dic, hu_HU_u8_alias.aff
output: hu_HU_u8_gen_alias.dic, hu_HU_u8_gen_alias.aff

Results

Actual

$ tail hu_HU_u8.aff -n20
PFX ! 0 nyolc/% .	[adj_num]+
PFX ! 0 nulla/% .	[adj_num]+
PFX ! 0 negyven/% .	[adj_num]+
PFX ! 0 millió/% .	[adj_num]+
PFX ! 0 milliárd/% .	[adj_num]+
PFX ! 0 két/% .	[adj_num]+
PFX ! 0 kilencven/% .	[adj_num]+
PFX ! 0 kilenc/% .	[adj_num]+
PFX ! 0 húsz/% .	[adj_num]+
PFX ! 0 hét/% .	[adj_num]+
PFX ! 0 három/% .	[adj_num]+
PFX ! 0 hetven/% .	[adj_num]+
PFX ! 0 hatvan/% .	[adj_num]+
PFX ! 0 hat/% .	[adj_num]+
PFX ! 0 harminc/% .	[adj_num]+
PFX ! 0 fél/% .	[adj_num]+
PFX ! 0 ezer/% .	[adj_num]+
PFX ! 0 egy/% .	[adj_num]+
PFX ! 0 billió/% .	[adj_num]+

Expected

$ tail hu_HU_u8.aff -n20
PFX ! 0 nyolc/% .
PFX ! 0 nulla/% .
PFX ! 0 negyven/% .
PFX ! 0 millió/% .
PFX ! 0 milliárd/% .
PFX ! 0 két/% .
PFX ! 0 kilencven/% .
PFX ! 0 kilenc/% .
PFX ! 0 húsz/% .
PFX ! 0 hét/% .
PFX ! 0 három/% .
PFX ! 0 hetven/% .
PFX ! 0 hatvan/% .
PFX ! 0 hat/% .
PFX ! 0 harminc/% .
PFX ! 0 fél/% .
PFX ! 0 ezer/% .
PFX ! 0 egy/% .
PFX ! 0 billió/% .

Jason Dent · Answer 3 · Tue Jan 16 2024 19:29:02 GMT+0800 (China Standard Time)

@crash5,

It looks like the release on https://github.com/crash5/mozilla-hungarian-spellchecker/releases/tag/2023.12.25.04.07 has embedded html entities in the file: &permil;

FORBIDDENWORD w
WORDCHARS -.&permil;&sect;%&deg;0123456789&ndash;&euro;'&apos;&ffi;&ffl;&ff;&fi;&fl;

crash5 · Answer 4 · Tue Jan 16 2024 23:48:18 GMT+0800 (China Standard Time)

I'm simply building it with the commands from the readme so I can't really do anything with it.

From a quick look at the git history and the makefiles these codes will be replaced with their unicode counterparts during the unicode output generation. bin/l1_u8.sed and bin/u8myspell

I don't know myspell so I can't tell whether it is a bug if it left in the output for non-unicode version or not. Maybe @laszlonemeth or @tgyurci can answer this question.

As I see it is good in the unicode outputs (hu_HU_u8*). Can you use these instead the non-unicode versions?

Jason Dent · Answer 5 · Wed Jan 17 2024 01:20:17 GMT+0800 (China Standard Time)

@crash5, @laszlonemeth,

The files might have u8 but they are not UTF-8, they are a mix of ISO-8859-2 and UTF-8.

Viewed as UTF-8:

Viwed as ISO-8859-2:

But, as you stated, WORDCHARDS is correctly converted to UTF-8.