Is there a way to batch feed the morfologik spellchecker with words and get the word and suggestions out? This to improve the suggestions with rep-like replacements or plain replacements in the text file?
Maybe your best bet is to use command lines options -eo -e ID
with the ID being the ID of the spell checker rule.
The output below shows that a simple one letter missing error yields a lot of suggestions. The order of the suggestions is certainly not by the order of frequency.
ruud@TaalTik:/media/ruud/data2/LanguageTool-current$ java -jar languagetool-commandline.jar -l nl -eo -e MORFOLOGIK_RULE_NL_NL --line-by-line
Expected text language: Dutch
Warning: running in line by line mode. Cross-paragraph checks will not work.
Working on STDIN…
fiet
1.) Line 1, column 1, Rule ID: MORFOLOGIK_RULE_NL_NL
Message: Mogelijke spelfout gevonden
Suggestion: bijt; dijt; fit; mijt; wijt; zijt; Eijt; jijt; rijt; bij; dit; dient; feit; fiets; hij; jij; liet; lijkt; lijn; lijst; mij; mijn; niet; pijn; tijd; uit; vijf; wij; ziet; zij; zijn; zit; Eijk; FIE; Fie; Fiji; … and a lot more
The frequencies are:
fiets 211798
zijt 36350
So I would have expected fiets before zijt, not after.
Any idea what this is caused by?
just speculation:
what if they aren’t frequencies but ‘priorities’?
(highest frequency gets priority 1, second highest gets priority 2 … and so on)
It’s not always reliable, but it might have been done to reduce file-size.
(instead of a, lets say, 1024-bit frequency value only a, lets say, 32-bit priority number needs to be stored)
All has been used as stated on the wiki.
The frequency is only the second sort criterion I think. The first one is how similar the words are. This can become a bit complicated because the replacement pairs from nl_NL.info
are applied first. They might make zijt
more similar to fiet
than fiets
(but I haven’t checked).
I will remove the replacements and try again.
Without all those replacements, it is better.
But actually, I assumed the order of presenting alternatives would be purely by the frequency class. Or maybe a weighted balance between levenhstein distance related to the word length and the frequency class…
I will do more tests, but so far it looks like the use of the frequencies is too early; maybe it should be the last thing to do.
Some of the replacements are quite short : ij <=> ei , f <=> v, s<=>z. These are completely valid, but the impact is very large in short words.
I am checking all words in order of decreasing frequency with the spellchecker without replacements now. Then I will check which replacements are actually needed in the top of the frequency list, and make them as long as possible,
I threw all words in order of descending frequency to the spell checking rule. My conclusions so far are:
- LT is much better at suggesting than Hunspell is, especially when multiple letters in different parts of the word have changed.
- LT does not do compounding; in compounding languages, the words list needed for the ‘tail’ is enormous.
- The Hunspell REP’s are not of a lot of use in the .info; too many changes lead to less optimal suggestions. Some are needed though for suggestions very far from the word