Spellchecker improvement discussion

There’s a thing confusing me.
The stored pair <not corrected sentence, corrected sentence> may be impossible to reproduce using the current version of the LT (e.g. the pair was received using the old version of the LT and the rules are changed now and the current version of the LT suggests other replacements etc). How possible do you think it is?
Since I cannot explore that private stored data, I could provide the command-line tool to check whether the mentioned problem exists. What do you think about?

I’m not sure I understand the problem: We have the original sentence and a corrected sentence (plus some meta data like rule id), why would it be necessary to reproduce the correction? Don’t you want to use “old” suggestions because LT might have become better by now?

Having the pair of sentence and the correction I’d like to receive all the suggestions given by the LT. This is the simplest way to receive the data of the following format:

typo features suggestion was selected by user
siter sites false
siter sister true
siter sizer false

It’s handy to have the data in this form when training the model.

To get this data I want to send the bad sentence to the LT, receive all the replacements suggested and mark those that were selected by user and those that weren’t. But I think that some suggestion mechanisms were changed and that can influence the mentioned workflow. It’s interesting to explore the scale of that problem.

I see. It’s hard to tell how much the suggestions change. Probably not that much. Here’s how we log data now (leaving out the sentence here):

+----------------+-----------------+----------------+
| suggestion_pos | covered         | replacement    |
+----------------+-----------------+----------------+
|              0 | rhe             | the            |
|              0 | womens          | women          |
|              0 | frustated       | frustrated     |
|              2 | litteraly       | literally      |

suggestion_pos is the position of the selected suggestion in the list. It has a special case of 99 for those cases where the user doesn’t use one of the suggestions, but types in their own.

So finally can I count on the following format of the data? original sentence | corrected sentence | suggestion position | covered | replacement. Or something mentioned in this list is missing in the logged data and vice versa?

Yes, you can assume that. There’s more metadata in there, but I don’t think it can be used now.

I’n generating the prototyped test data and prototyping the models comparing tool so any kind of info about the data format is welcome :slight_smile:
I’m asking not for the data, just for the format of that data.

Created a tool to extract the features I’d like to start with. @dnaber Could you please run it?
P.S. Tested only on the data I have.

I get this error (I’ve added output the about the sentence):

sentence  : En Venezuela es común que los hijos se independicen hasta que se casan.
correction: En Venezuela es común que los hijos se independicen hasta que se casein.
java.lang.StringIndexOutOfBoundsException: String index out of range: 72
	at java.lang.String.substring(String.java:1963)
	at io.github.oserikov.languagetool.Utils.startOfErrorString(Utils.java:44)
	at io.github.oserikov.languagetool.Main.processRow(Main.java:187)
	at io.github.oserikov.languagetool.Main.processDBData(Main.java:104)
	at io.github.oserikov.languagetool.Main.main(Main.java:69)

Aww, my bad. Will fix in a couple of minutes.

Should be fixed now. Pushed a fix, published a release

@daniel,
Could you, please, run the following query

SELECT COUNT(*) FROM corrections WHERE language = 'ru-RU' AND rule_id = 'MORFOLOGIK_RULE_RU_RU'

on the logs database?

The language code is just ru (not ru-RU), but then I get: 934729.

Ok, thank you!

Could you, please, run the updated features extractor? I’ve added the suggestion position extraction (forgot to do that earlier).

Done, sent the result via private message.

@dnaber Could you, please, run the updated features extractor?

The features seem to be shuffled, but to order the corrections by model’s score it’s handy to be able to group all the suggestions for the same sentence together, so I’ve added an id column – a hash value unique for each group of corrections.

I’m looking for a way to bind each MORFOLOGIK_RULE_%_% rule id with an org.languagetool.language.Language subclass programmatically. Maybe someone used to do that before?

@dnaber, Could you, please, SELECT DISTINCT language, rule_id FROM corrections ?

Result sent via private message