Best way to apply suggestions/corrections automatically

Paprikamann · January 21, 2018, 2:04pm

Hi!

I want to use LanguageTool for correction of a large amount of product descriptions (fashion online shops). We need to eliminate spelling errors (sometimes grammar errors, as well) for further processing. Of course, this should mostly happen automatically.
After quite a time of research in this forum/API/documentation, I’ve realized, that LanguageTool is not really good at that (and that’s not the purpose of LT if I’m right?). So I want to implement this with an own method.

The approach would be:

Create a word list from a product description (all single words with number of their appearance)
Check new product descriptions
If there are multiple suggestions for a word → look up the word in our own word list and set the word as the correct one, that appears the most in our word list.

Example - Correct sentence:

Das schwarze Hemd ist aus Baumwolle.

Example - Wrong sentence:

Das schwarze Hmd ist aus Baumwolle.

Suggestions by LT:
Amt; Hut; Hemd; Hd; …

Obviously, the suggested word “Amt” is not the correct one.

My point is: What’s the best way to tell LT, which word is the right one to choose?

Use the approach I descibed above (still a risk of choosing a wrong word: What if “Amt” really would be the correct word? and no grammar correction
Using more n-gram data: I thought, that the LT n-gram dataset could recognize the context, but unfortunately that’s not the case. At least not for the example above.
Go through each error and correct them manually and afterwards write a rule (JAVA/XML → what’s better?) for this particular error.

I hope, that someone can help me and tell me, what’s the best ways to approach this issue. Do you think comparing the suggestions with an own word list, is a reasonable way to do it?

Thanks in advance and best regards
Paprikamann

dnaber · January 21, 2018, 2:53pm

That’s what I would try. You can use org.languagetool.dev.bigdata.NGramLookup to compare sequence probabilities (that class is not part of a JAR, you need to check out the code). das schwarze Hemd is (slightly) more probable than das schwarze Amt according to that class. The complete sequence “das schwarze Hemd” has no occurrences in our n-gram data, though, probably because the Google ngram data we use has a minimum occurrence value of 40.

Adding your own ngrams would be even better, if you have enough data and the quality isn’t too bad.

Paprikamann · January 22, 2018, 8:31am

Thanks for the reply.
First, I probably will try the easy approach to see how the results are.

But I guess, that adding own ngrams is really the best solution. Our texts are kind of specific (fashion online shops). Feeding LT with our own data would make sense.
I’ve never really worked with ngrams before, plus LanguageTool is also new to me.
How to do this ngram stuff?

Is there a method to (automatically) create ngrams from my data? I’ve seen a CommonCrawlToNgram file. Should I just adjust it to read own files? Or is there an easier way? (maybe other tools?)
The descriptions are not always complete and correct german sentences. Often, it’s just a few words (just like you describe clothes). Can I use both for all ngrams? Or use the single word descriptions only for 1-grams and the sentence descriptions for 1-3-grams? Or all together?
After having my ngrams, run LuceneSimpleIndexCreator, right?
Adding ngrams to LT (e.g. setting Path in Standalone version)

I’m sure, there’s more to it, right?

Some help is very much appreciated.

Regards
Paprikamann

dnaber · January 22, 2018, 10:48am

Adapting CommonCrawlToNgram sounds like a good plan.

It should work automatically, i.e. if there’s only one token, this should be used, if there are 3 all of them should be used. But you should probably check if this really works. Also, you should make sure that these short texts are not concatenated.

That’s only for internal testing, you don’t need to call it.

I don’t think so…

Paprikamann · January 25, 2018, 9:49am

Another question regarding the use of GitHub and the source code. I’m not really familiar with GitHub and importing projects from there.

How do I do that to use the dev folder? (i.e. the classes from there?) I’m using Eclipse and it would be great if I somehow could import it there, right click on CommonCrawlToNgram and “run”.

Is there a description or tutorial for that? Do I have to use maven? I’ve already cloned the rep and even imported it in Eclipse, but whats next? It’s not marked as a Java Project.

dnaber · January 25, 2018, 10:02am

I don’t know about Eclipse, but have you made sure you’ve not just imported the code as Java code, but the project as a Maven project?

Paprikamann · January 29, 2018, 1:33pm

First of all: Thanks! That fixed the problem. I was always trying to import it as Git-Project.

I was able to create the ngrams with a test file. In fact, I didn’t even have to adapt CommonCrawlToNgram. It worked, but the Index is a bit broken.

Let’s say, I have a description line like this:

Auf eine Reise nach Japan können Sie getrost verzichten. Schneller geht´s mit diesem Blouson im Asia-Look, der Ihrem Outfit das gewisse Etwas verleiht. Sehr schön ist das bunte Blumen-Motiv im Vorder- und Rückteil. Hervorragende Ausstattung: mit kleinem geripptem Stehkragen (auch Baseball-Kragen genannt), Bündchen am Ärmelsaum und Saum und seitlichen Eingriffstaschen. Aus pflegeleichtem Material mit Innenfutter. Stylingtipp: Tragen Sie dazu ein schwarzes Shirt und Jeans. Passform: Gerade Produkt-Typ: BlazerjackeBlouson Futter :100% Polyester Material : 95% Polyester 5% Elasthan Rippe : 95% Polyester 5% ElasthanPflegehinweise: Maschinenwäsche

How to pre-process it the best way before starting the ngram creation? I probably should have each sentence in a new line? And each “short text” (e.g. Passform: Gerade), too? Because otherwise I would have Passform: Gerade Produkt as a 3-gram, although that’s not the right context. I rather would like to have it as a 1-gram only.

And regarding the use in LanguageTool. The used rule for my custom ngram is NgramProbabilityRule, right? The API description says:

LanguageTool’s probability check that uses ngram lookups to decide if an ngram of the input text is so rare in our ngram index that it should be considered an error.

This is only for detecting errors? What about suggested corrections? That’s what important and interesting to me? The error detection of the vanilla LanguageTool is fine. I want to decide which suggestion is the right one based on ngrams (or any other method).

In the end it’s all about das schwarze Hmd should be (automatically) corrected to das schwarze Hemd and NOT to das schwarze Amt.

dnaber · January 29, 2018, 2:11pm

In that case you won’t need NgramProbabilityRule but will need to adapt the sorting. I think the right place is GermanSpellerRule.sortSuggestionByQuality(). This place will need to use the ngram data, so you’ll need to get it to that place without always re-initializing Lucene etc. Probably you’ll need a static object for that.

NgramProbabilityRule was an idea to find improbable ngram sequences without using confusion pairs. It’s not enabled by default due to the false alarms it creates.

Paprikamann · January 29, 2018, 6:10pm

Hmm, ok I think I got it. I see sortSuggestionsByQuality is a protected method. So the source code is needed here again.

I guess, I need to compare each suggestion as a ngram. So, if I have ten suggestion, I have to compare the probabilities of each ngram of these suggestion. And the one with the highest probabilities is (hopefully) the correct solution.

But just out of curiosity: What are the suggestions when using ngram data for the “normal” error detection? Are there suggestions at all, or only if you specify them in the confusion_set.txt?

dnaber · January 29, 2018, 6:47pm

For NgramProbabilityRule (which is not active be default), see languagetool/languagetool-core/src/main/java/org/languagetool/rules/ngrams/NgramProbabilityRule.java at master · languagetool-org/languagetool · GitHub - it tries different part-of-speech tags and uses the more probably one as a suggestion.

For the ngram approach that is actually live on languagetool.org (GermanConfusionProbabilityRule etc): this is just pairs, so the other member of the pair is used as a suggestion if it’s more probable.

Paprikamann · February 8, 2018, 2:49pm

Hi. Thanks for your help so far. I have another question. I’m trying to adapt the sorting in GermanSpellerRule.sortSuggestionByQuality() right now.
In order to do that, I need the ngram(s) of the misspelled word. To create that, I need the whole sentence and the words before and after the misspelling. My idea was to use the AnalyzedTokenReadings here.

What’s the best way to do that? I could create a method and call it from an own java file. But maybe it’s possible to adapt the source code of LT itself? Do I have to change all the code from the beginning (check method)?

dnaber · February 8, 2018, 3:16pm

If you don’t want to sort the result in your code (outside LT), you’d need to change the LT source so the method gets not only the word, but also its context. It seems all methods from match() will need to be modified. Also, the spelling rule currently doesn’t know about ngram data. It would need to be initialzed in German.getRelevantLanguageModelRules() instead to know about the ngram data.

Paprikamann · February 12, 2018, 11:15am

I’ve decided to use own code to do the sorting. In the end it’s a bit easier to adapt.

I’m running in an error right now. Maybe you can give a hint?

Using NgramLookUp works perfectly. But when I call getCount in my method it throws this error. The Ngram indexes are the same.

dnaber · February 12, 2018, 12:11pm

This sounds more like a dependency issue. Does your code use the same version of Lucene as LT does? (see pom.xml)

Paprikamann · February 12, 2018, 1:28pm

Yep. I’ve used Lucene 7.2. to build my index. Switched it back to 5.5(?). Work now.

Paprikamann · February 12, 2018, 4:04pm

Next one …

Now it’s about the probabilities. Somehow I have to match the probs of my unigrams, bigrams and trigrams. Is it just to multiply them?

That’s a sample corrected sentence. In fact, there’s only one error: geschmedige. But Languagetool marks Cain as an error, too.

Das geschmedige Shirt von Marc Cain Sports zeigt einen lässig-sportiven Style, der gleichzeitig einen hohen Komfort sicherstellt.

In my processing I get following suggestions (for each ngram):
[geschmeidige] [Das, geschmeidig] [geschmedige, Shirt, von]
You can see, all are different.
The problem is, that Das geschmeidige and geschmeidige Shirt von (as well as Das geschmeidige Shirt) are very rare in the corpus. I haven’t added the rest of my available texts to the corpus, but still this should work here.

The word geschmeidige occurs very often in the text. The idea is, that it’s probability should “fix” the others probabilities (bigram, trigram) to get the right solution.

How to do that? Do you already have a solution in LT for that?

dnaber · February 13, 2018, 8:41am

You’re using getCount()? Then maybe using getPseudoProbability() will help, as that considers all the probabilities of 1 to 3 gram (source code).

Paprikamann · February 14, 2018, 9:27am

I’m already using getPseudoProbability. The problem is, that the probability of other trigrams is equal or higher compared to the “correct trigram”.

Below are the full examples:

[geschmedige] -> count:1, 3.331367826315807E-6, log:-12.612127579656768
[geschmeidige] -> count:34, 5.8298936960526624E-5, log:-9.749926698727299
[geschweige] -> count:0, 1.6656839131579034E-6, log:-13.305274760216713

Unigram result = geschmeidige

[Das, geschmedige] -> count:1, 3.3313678263158073E-6, log:-12.612127579656768
[geschmedige, Shirt] -> count:1, 3.331367826315807E-6, log:-12.612127579656768
[Das, geschmeidige] -> count:1, 3.3313678263158073E-6, log:-12.612127579656768
[geschmeidige, Shirt] -> count:0, 1.6656839131579034E-6, log:-13.305274760216713
[Das, geschweige] -> count:0, 1.6656839131579036E-6, log:-13.305274760216713
[geschweige, Shirt] -> count:0, 1.6656839131579034E-6, log:-13.305274760216713
...
[Das, geschmeidig] -> count:1, 3.3313678263158073E-6, log:-12.612127579656768
[geschmeidig, Shirt] -> count:0, 1.6656839131579034E-6, log:-13.305274760216713

Bigram result = Das geschmeidig

[Das, geschmedige, Shirt] -> count:1, 4.745538214125082E-9, log:-19.166060983682577
[geschmedige, Shirt, von] -> count:1, 3.331367826315807E-6, log:-12.612127579656768
[., Das, geschmeidige] -> count:0, 1.4916127099112595E-10, log:-22.625993039662504
[Das, geschmeidige, Shirt] -> count:0, 2.372769107062541E-9, log:-19.859208164242524
[geschmeidige, Shirt, von] -> count:0, 4.759096894736867E-8, log:-16.860622821706126
[., Das, geschweige] -> count:0, 1.4916127099112595E-10, log:-22.625993039662504
[Das, geschweige, Shirt] -> count:0, 1.1863845535312704E-9, log:-20.552355344802468
[geschweige, Shirt, von] -> count:0, 1.6656839131579034E-6, log:-13.305274760216713

Trigram result = geschmedige Shirt von

I know, the data is quite small, but still this is an issue that always could occur. The unigram suggestion is correct, the bigram correction is almost correct, but the trigram suggestion is not correct at all.

So far, I store the ngram with their probabilities in a Hashmap and get the key with the highest value (that’s the solution). Since a lot of the probabilities are equal, I could try to only return the key, if it equals the unigram.
E.g.: Instead of retrieving Das geschmeidig as a bigram suggestion, I would search for a bigram that contains geschmeidige (and has a high probability).
But that’s still prone to error …

dnaber · February 14, 2018, 11:30am

What about setting a threshold, so that words with a low occurrence count are ignored for the probability calculation? With threshold = 1, 2gram and 3gram would be ignored in this case.

Paprikamann · February 21, 2018, 5:31pm

Sorry for the late reply. I will definitely add this minimum threshold together with some other criteria.

And again, there’s the next obstacle. It’s about suggestions, e.g. Baumwoll-Denim. The suggestion is one word, i.e. a unigram. In my ngram indexes it’s a trigram (google tokenizers splits it all) and therefore contains white spaces.
The suggestions are added to the unigrams, while the bigrams and trigrams stay empty. The lookup doesn’t work then of course.

I’m not sure, what’s the best way to solve it.

modify the LT source code and prevent LT from suggestion such words
try to split such suggestions: I would have to iterate over all suggestions and tokenize them
modify the ngram indexes (tokenization) and prevent it to split Baumwoll-Denim -> this word would be a unigram then
??

Another point is, that I have to pay attention to the performance. In the prod system there could be over a million lines to check …

/EDIT: I’ve decided to recreate the ngrams and use WordTokenizer instead of GoogleStyleWordTokenizer. Since LT itself doesn’t use GoogleStyleWordTokenizer internally, it would be a mess to match both tokenizations.