Spellchecker improvement discussion

oserikov · April 1, 2018, 1:17pm

There’s a thing confusing me.
The stored pair <not corrected sentence, corrected sentence> may be impossible to reproduce using the current version of the LT (e.g. the pair was received using the old version of the LT and the rules are changed now and the current version of the LT suggests other replacements etc). How possible do you think it is?
Since I cannot explore that private stored data, I could provide the command-line tool to check whether the mentioned problem exists. What do you think about?

dnaber · April 1, 2018, 1:43pm

I’m not sure I understand the problem: We have the original sentence and a corrected sentence (plus some meta data like rule id), why would it be necessary to reproduce the correction? Don’t you want to use “old” suggestions because LT might have become better by now?

oserikov · April 1, 2018, 1:54pm

Having the pair of sentence and the correction I’d like to receive all the suggestions given by the LT. This is the simplest way to receive the data of the following format:

typo	features	suggestion	was selected by user
siter	…	sites	false
siter	…	sister	true
siter	…	sizer	false

It’s handy to have the data in this form when training the model.

To get this data I want to send the bad sentence to the LT, receive all the replacements suggested and mark those that were selected by user and those that weren’t. But I think that some suggestion mechanisms were changed and that can influence the mentioned workflow. It’s interesting to explore the scale of that problem.

dnaber · April 1, 2018, 2:19pm

I see. It’s hard to tell how much the suggestions change. Probably not that much. Here’s how we log data now (leaving out the sentence here):

+----------------+-----------------+----------------+
| suggestion_pos | covered         | replacement    |
+----------------+-----------------+----------------+
|              0 | rhe             | the            |
|              0 | womens          | women          |
|              0 | frustated       | frustrated     |
|              2 | litteraly       | literally      |

suggestion_pos is the position of the selected suggestion in the list. It has a special case of 99 for those cases where the user doesn’t use one of the suggestions, but types in their own.

oserikov · April 1, 2018, 5:56pm

So finally can I count on the following format of the data? original sentence | corrected sentence | suggestion position | covered | replacement. Or something mentioned in this list is missing in the logged data and vice versa?

dnaber · April 1, 2018, 6:05pm

Yes, you can assume that. There’s more metadata in there, but I don’t think it can be used now.

oserikov · April 1, 2018, 6:09pm

I’n generating the prototyped test data and prototyping the models comparing tool so any kind of info about the data format is welcome
I’m asking not for the data, just for the format of that data.

oserikov · May 10, 2018, 3:54pm

Created a tool to extract the features I’d like to start with. @dnaber Could you please run it?
P.S. Tested only on the data I have.

dnaber · May 10, 2018, 6:10pm

I get this error (I’ve added output the about the sentence):

sentence  : En Venezuela es común que los hijos se independicen hasta que se casan.
correction: En Venezuela es común que los hijos se independicen hasta que se casein.
java.lang.StringIndexOutOfBoundsException: String index out of range: 72
	at java.lang.String.substring(String.java:1963)
	at io.github.oserikov.languagetool.Utils.startOfErrorString(Utils.java:44)
	at io.github.oserikov.languagetool.Main.processRow(Main.java:187)
	at io.github.oserikov.languagetool.Main.processDBData(Main.java:104)
	at io.github.oserikov.languagetool.Main.main(Main.java:69)

oserikov · May 10, 2018, 6:24pm

Aww, my bad. Will fix in a couple of minutes.

oserikov · May 10, 2018, 7:07pm

Should be fixed now. Pushed a fix, published a release

oserikov · May 15, 2018, 2:30pm

@daniel,
Could you, please, run the following query

SELECT COUNT(*) FROM corrections WHERE language = 'ru-RU' AND rule_id = 'MORFOLOGIK_RULE_RU_RU'

on the logs database?

dnaber · May 15, 2018, 2:40pm

The language code is just ru (not ru-RU), but then I get: 934729.

oserikov · May 15, 2018, 6:20pm

Ok, thank you!

oserikov · May 17, 2018, 11:59pm

Could you, please, run the updated features extractor? I’ve added the suggestion position extraction (forgot to do that earlier).

dnaber · May 18, 2018, 12:54pm

Done, sent the result via private message.

oserikov · May 21, 2018, 1:53pm

@dnaber Could you, please, run the updated features extractor?

The features seem to be shuffled, but to order the corrections by model’s score it’s handy to be able to group all the suggestions for the same sentence together, so I’ve added an id column – a hash value unique for each group of corrections.

oserikov · June 1, 2018, 5:47am

I’m looking for a way to bind each MORFOLOGIK_RULE_%_% rule id with an org.languagetool.language.Language subclass programmatically. Maybe someone used to do that before?

oserikov · June 1, 2018, 11:59am

@dnaber, Could you, please, SELECT DISTINCT language, rule_id FROM corrections ?

dnaber · June 1, 2018, 1:23pm

Result sent via private message