Rules for words added from symbols
philpennock opened this issue · comments
With the new .words
support, I have just been working on the nats-io/nats.go client library and quite a few typos and stale comments are being fixed in the PR I'm working on, so thank you. But, this is exposing some nice-to-haves:
- symbols which are types, not of an array, should be added with an
/S
affix rule, so comments can talk about their plurals - if the .words file were loaded before the symbol tables, arbitrary sane rules could be written, without being masked by the symbols being added without rules
- Comments can talk about functions from an imported module which aren't being used, explaining why, so it might be useful to add the exported symbols of imported libraries, if that can be done sanely with performance. Eg, explaining why
strconv.AppendInt
is not being used. - If a struct field's tag starts
[a-z]+:"
then up until the next comma or double-quotes is probably a variant spelling for wire transfer formats, and it makes sense for comments to use that term. Eg, aNoWait
field can bejson:"no_wait,omitempty"
-
omitempty
should probably be in the built-in dictionary. :)
The other head-scratcher from this work is hostnames, or other fields which look like hostnames. In this case, NATS subject examples, such as time.us.east.atlanta
and time.eu.east
leading to complaints about EU and Atlanta being wrong. I'm not sure what could sanely be done here.
FYI: the above were found while composing nats-io/nats.go#899 -- I figure seeing the context is good, and seeing positive results from your work on this tool. It's an great addition!
symbols which are types, not of an array, should be added with an
/S
affix rule, so comments can talk about their plurals
Unfortunately, local dictionaries don't include the affix rules, just the dictionary. This also impacts on the second bullet point, since there are no rules.
I can't define affix rules, sure, but I can use them. Eg, I can add demarshal/SDG
and it covers demarshaling
. It's just that if the stem is already registered then the duplicate in the .words
file is ignored, together with its affix rules. So we can't add affix rules for symbols from the source, but I think perhaps we could if the .words
file were loaded before those symbols?
You're right; I was mangling the test.
That is fixed in 46bdc9c. Just reordering is not enough, you need to check whether it is recognised, otherwise the affix rule'd entry gets clobbered.
I think I have addressed most of these. One that I am not sure of how to do is the pluralised word addition to the runtime dictionary for types that are elements of arrays or slices. We now have access to type information, but runtime word addition doesn't accept rule suffixes (confirmed from the hunspell source) and as far as I can see add_with_affix (which could be used as a partial workaround) is broken (unless I'm using it wrong) https://go.dev/play/p/tXHcnAY2w8p.
So the affix rules are strongly associated with one dictionary sibling file and not generally exposed for the language, and you have to use exemplars of "word like this one" to program to the API?
There's something strange in the en_US.dic
file: thing/M
means you can get thing's
and so thang's
is accepted as correct. This dictionary appears to not accept things
as a result of thing
. Running hunspell -s =(echo things)
shows that the stem is the
and the J
affix rule is giving us the ings
instead of e
, the -> things
.
So I think the issue might be that your example word to use for affix rules does not have the rules you expected it to have.
If I use item
as the example, then I get your playground code to work as I expect.
Thanks @philpennock that is very helpful (I should look harder in future). I think this may give us a way forward to add plurals and possessives. Very funny to see that things is not an acceptable word (same in en_AU).
Oh, things
is acceptable, but not because of thing
. Looks as though they've tried to ensure a unique derivation for every spelling, which is problematic when you have one spelling for two different semantic meanings from two different roots. I guess it's evidence that existing open source language dictionary stemming rules are not a good choice for seeding an AI semantic model.
But thank you. Over the past couple of weeks I've learnt more than I ever expected to about computerized spelling correction. Eww. I'm glad you're maintaining gospel and not me! 🥇
I think this is all done.