Rules for words added from symbols

Question

Rules for words added from symbols

philpennock opened this issue 3 years ago · comments

With the new .words support, I have just been working on the nats-io/nats.go client library and quite a few typos and stale comments are being fixed in the PR I'm working on, so thank you. But, this is exposing some nice-to-haves:

symbols which are types, not of an array, should be added with an /S affix rule, so comments can talk about their plurals
if the .words file were loaded before the symbol tables, arbitrary sane rules could be written, without being masked by the symbols being added without rules
Comments can talk about functions from an imported module which aren't being used, explaining why, so it might be useful to add the exported symbols of imported libraries, if that can be done sanely with performance. Eg, explaining why strconv.AppendInt is not being used.
If a struct field's tag starts [a-z]+:" then up until the next comma or double-quotes is probably a variant spelling for wire transfer formats, and it makes sense for comments to use that term. Eg, a NoWait field can be json:"no_wait,omitempty"
omitempty should probably be in the built-in dictionary. :)

The other head-scratcher from this work is hostnames, or other fields which look like hostnames. In this case, NATS subject examples, such as time.us.east.atlanta and time.eu.east leading to complaints about EU and Atlanta being wrong. I'm not sure what could sanely be done here.

Phil Pennock · Answer 1 · Sun Feb 06 2022 10:38:02 GMT+0800 (China Standard Time)

FYI: the above were found while composing nats-io/nats.go#899 -- I figure seeing the context is good, and seeing positive results from your work on this tool. It's an great addition!

Dan Kortschak · Answer 2 · Sun Feb 06 2022 10:52:37 GMT+0800 (China Standard Time)

symbols which are types, not of an array, should be added with an /S affix rule, so comments can talk about their plurals

Unfortunately, local dictionaries don't include the affix rules, just the dictionary. This also impacts on the second bullet point, since there are no rules.

Phil Pennock · Answer 3 · Sun Feb 06 2022 10:59:00 GMT+0800 (China Standard Time)

I can't define affix rules, sure, but I can use them. Eg, I can add demarshal/SDG and it covers demarshaling. It's just that if the stem is already registered then the duplicate in the .words file is ignored, together with its affix rules. So we can't add affix rules for symbols from the source, but I think perhaps we could if the .words file were loaded before those symbols?

Dan Kortschak · Answer 4 · Sun Feb 06 2022 11:06:03 GMT+0800 (China Standard Time)

You're right; I was mangling the test.

Dan Kortschak · Answer 5 · Sun Feb 06 2022 11:36:39 GMT+0800 (China Standard Time)

That is fixed in 46bdc9c. Just reordering is not enough, you need to check whether it is recognised, otherwise the affix rule'd entry gets clobbered.

Dan Kortschak · Answer 6 · Sun Feb 13 2022 05:54:55 GMT+0800 (China Standard Time)

I think I have addressed most of these. One that I am not sure of how to do is the pluralised word addition to the runtime dictionary for types that are elements of arrays or slices. We now have access to type information, but runtime word addition doesn't accept rule suffixes (confirmed from the hunspell source) and as far as I can see add_with_affix (which could be used as a partial workaround) is broken (unless I'm using it wrong) https://go.dev/play/p/tXHcnAY2w8p.

Phil Pennock · Answer 7 · Tue Feb 15 2022 02:28:57 GMT+0800 (China Standard Time)

So the affix rules are strongly associated with one dictionary sibling file and not generally exposed for the language, and you have to use exemplars of "word like this one" to program to the API?

There's something strange in the en_US.dic file: thing/M means you can get thing's and so thang's is accepted as correct. This dictionary appears to not accept things as a result of thing. Running hunspell -s =(echo things) shows that the stem is the and the J affix rule is giving us the ings instead of e, the -> things.

So I think the issue might be that your example word to use for affix rules does not have the rules you expected it to have.

If I use item as the example, then I get your playground code to work as I expect.

Dan Kortschak · Answer 8 · Tue Feb 15 2022 05:39:23 GMT+0800 (China Standard Time)

Thanks @philpennock that is very helpful (I should look harder in future). I think this may give us a way forward to add plurals and possessives. Very funny to see that things is not an acceptable word (same in en_AU).

Phil Pennock · Answer 9 · Tue Feb 15 2022 05:57:09 GMT+0800 (China Standard Time)

Oh, things is acceptable, but not because of thing. Looks as though they've tried to ensure a unique derivation for every spelling, which is problematic when you have one spelling for two different semantic meanings from two different roots. I guess it's evidence that existing open source language dictionary stemming rules are not a good choice for seeding an AI semantic model.

But thank you. Over the past couple of weeks I've learnt more than I ever expected to about computerized spelling correction. Eww. I'm glad you're maintaining gospel and not me! 🥇

Dan Kortschak · Answer 10 · Wed Feb 16 2022 16:05:52 GMT+0800 (China Standard Time)

I think this is all done.