laszlonemeth / magyarispell

Hungarian Hunspell dictionary

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use GitHub Releases for assets

wooorm opened this issue · comments

commented

👋

GitHub Releases makes it easy to:

  • watch releases
  • see the changelog
  • download assets, which is the main reason I’m opening this issue: I’ve been trying for a couple hours on macOS to build the hunspell .aff and .dic files but can’t seem to make it work 😩

Extra info: I package hunspell dictionaries for use in the JS ecosystem: wooorm/dictionaries, and similar projects (such as this one) do offer this feature.

Can you post the error message? Until an official version is available you can use an unofficial version from here.

commented

Sweet, thanks!

Setup

So, I’m on macOS (latest), and updated to use GNU stuff:

  • ispell: brew install ispell
  • hunspell: brew install hunspell
  • coreutils: brew install coreutils
  • sed: brew install gnu-sed
  • awk: brew install gawk
  • m4: brew install m4

Where specified in installation logs I’ve put them in PATH to overwrite the Apple versions of tools.

I’ve updated the max open files:

$ launchctl limit maxfiles
	maxfiles    65536          2000000        

Usage

After cloning this repo I do:

$ LC_ALL=C make myspell

Yields:

===> magyar myspell alapszótár (magyar4myspell.dict) előállítása
==> szimbolikus kötések létrehozása a szotar.konf alapján
konfigurációs állomány nincs megadva, alapértelmezett a szotar.konf
/Users/tilde/Downloads/oss/magyarispell
. Rendben.
==> szótárak egybemásolása
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
..............recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
..................recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
................recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.recode: Invalid input in step `UTF-8..ISO-8859-2'
recode: Invalid input in step `UTF-8..ISO-8859-2'
.sed: 1: "/Users/tilde/Downloads/ ...": undefined label 'ilde/Downloads/oss/magyarispell/tmp/fonev_osszetett.1'
 Rendben.
# hy: összetétel
-n .
==> igéből képzett alakok előállítása
.......... Rendben.
==> igék
... Rendben.
==> kivételek
-n .
-n .
-n .
-n .
Rendben.
==> névszók
.../usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
./usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
............/usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
./usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
/usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
/usr/bin/awk: fonev_igekoto.1 makes too many open files
 source line number 30
.. Rendben.
==> morfológiai kódok
-n .
==> tiltott szavak
.recode: Invalid input in step `UTF-8..ISO-8859-2'
..... Rendben.
Rendben.
===> ragozási táblázat (magyar.aff) előállítása
===> myspell ragozási táblázat (hu_HU.aff) előállítása
===> myspell szótár (hu_HU.dic) előállítása
awk: cmd. line:2: (FILENAME=tmp/allomorf.txt FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
===> Unicode karakterkódolású állományok előállítása
===> Tömörített Hunspell szótárak elkészítése
output: hu_HU_u8_alias.dic, hu_HU_u8_alias.aff
output: hu_HU_u8_gen_alias.dic, hu_HU_u8_gen_alias.aff

Results

Actual
$ tail hu_HU_u8.aff -n20
PFX ! 0 nyolc/% .	[adj_num]+
PFX ! 0 nulla/% .	[adj_num]+
PFX ! 0 negyven/% .	[adj_num]+
PFX ! 0 millió/% .	[adj_num]+
PFX ! 0 milliárd/% .	[adj_num]+
PFX ! 0 két/% .	[adj_num]+
PFX ! 0 kilencven/% .	[adj_num]+
PFX ! 0 kilenc/% .	[adj_num]+
PFX ! 0 húsz/% .	[adj_num]+
PFX ! 0 hét/% .	[adj_num]+
PFX ! 0 három/% .	[adj_num]+
PFX ! 0 hetven/% .	[adj_num]+
PFX ! 0 hatvan/% .	[adj_num]+
PFX ! 0 hat/% .	[adj_num]+
PFX ! 0 harminc/% .	[adj_num]+
PFX ! 0 fél/% .	[adj_num]+
PFX ! 0 ezer/% .	[adj_num]+
PFX ! 0 egy/% .	[adj_num]+
PFX ! 0 billió/% .	[adj_num]+
Expected
$ tail hu_HU_u8.aff -n20
PFX ! 0 nyolc/% .
PFX ! 0 nulla/% .
PFX ! 0 negyven/% .
PFX ! 0 millió/% .
PFX ! 0 milliárd/% .
PFX ! 0 két/% .
PFX ! 0 kilencven/% .
PFX ! 0 kilenc/% .
PFX ! 0 húsz/% .
PFX ! 0 hét/% .
PFX ! 0 három/% .
PFX ! 0 hetven/% .
PFX ! 0 hatvan/% .
PFX ! 0 hat/% .
PFX ! 0 harminc/% .
PFX ! 0 fél/% .
PFX ! 0 ezer/% .
PFX ! 0 egy/% .
PFX ! 0 billió/% .

@crash5,

It looks like the release on https://github.com/crash5/mozilla-hungarian-spellchecker/releases/tag/2023.12.25.04.07 has embedded html entities in the file: ‰

FORBIDDENWORD w
WORDCHARS -.‰§%°0123456789–€''&ffi;&ffl;&ff;&fi;&fl;

I'm simply building it with the commands from the readme so I can't really do anything with it.

From a quick look at the git history and the makefiles these codes will be replaced with their unicode counterparts during the unicode output generation. bin/l1_u8.sed and bin/u8myspell

I don't know myspell so I can't tell whether it is a bug if it left in the output for non-unicode version or not. Maybe @laszlonemeth or @tgyurci can answer this question.

As I see it is good in the unicode outputs (hu_HU_u8*). Can you use these instead the non-unicode versions?

@crash5, @laszlonemeth,

The files might have u8 but they are not UTF-8, they are a mix of ISO-8859-2 and UTF-8.

Viewed as UTF-8:
image

Viwed as ISO-8859-2:
image

But, as you stated, WORDCHARDS is correctly converted to UTF-8.

image