Best way to apply suggestions/corrections automatically

Paprikamann · February 26, 2018, 1:20pm

A problem regarding maven:
How to add the JARs properly to my maven project if the JARs are built from source code?

I’ve added some words to spelling.txt and want it to include to my (maven) project. So far I used the regular maven dependencies of LT, but now I would have to build it froom source code.

I’ve then tried to add language-de-4.1-SNAPSHOT.jar (language-modules/de) as an external library. Didn’t work. I’ve also tried to use:

mvn install:install-file -Dfile=C:\Users.…\LT_src\languagetool-language-modules\de\target\language-de-4.1-SNAPSHOT.jar -DgroupId=org.languagetool -DartifactId=language-de -Dversion=4.1-SNAPSHOT -Dpackaging=jar

Didn’t work either. I e.g. get an error that .../de_DE.dict was not found in classpath.

What am I missing here?

dnaber · February 26, 2018, 5:14pm

If you just call mvn install in your LT sources (mvn install -DskipTests is faster by skipping the tests), the SNAPSHOTs JARs should also end up in your local repo and can be referenced from other projects.

Paprikamann · March 15, 2018, 1:53pm

What’s the difference between getPseudoProbability ( BaseLanguageModel )

and the probability calculation in ConfustionProbabilityRule, respective get3gramProbabilityFor (Code link )?

I’m using the simple getPseudoProbability. I’m calling it on my unigram, bigram and trigram and afterwards I compare the probabilities of them.

I’m not sure if get3gramProbabilityFor would be the better solution? If I used that, I wouldn’t have to process unigrams and trigrams separately?

dnaber · March 16, 2018, 8:22am

get3gramProbabilityFor considers the context so that the relevant word’s context is used like this:

_ _ X
_ X _
X _ _

(X is the word from the confusion set, _ the context)

Paprikamann · March 19, 2018, 3:39pm

Thanks. I’ve adapted this function and it works already quite well. There are still some problems. Let’s look at:

Ngram result: [Voderseite, aus, Webware]
P for Voderseite: 0.00000018034727761952 (3)
P for [Voderseite, aus]: 0.50000000000000000000 (1)
P for [Voderseite, aus, Webware]: 0.50000000000000000000 (1)
  Voderseite aus Webware => 0.00000004508681940488
Ngram result: [Geblümte, Voderseite, aus]
    P for Geblümte: 0.00000112717048512200 (24)
    P for [Geblümte, Voderseite]: 0.08000000000000000000 (1)
    P for [Geblümte, Voderseite, aus]: 0.08000000000000000000 (1)
  Geblümte Voderseite aus => 0.00000000721389110478
Ngram result: [_START_, Geblümte, Voderseite]
    P for _START_: 0.07338181939834254000 (1627566)
    P for [_START_, Geblümte]: 0.00001536035075668160 (24)
    P for [_START_, Geblümte, Voderseite]: 0.00000122882806053453 (1)
  _START_ Geblümte Voderseite => 0.00000000000138509872
Left: 4.508681940488013E-8 Middle: 7.213891104780822E-9 Right: 1.385098721124234E-12
Ngram result: [Vorderseite, aus, Webware]
    P for Vorderseite: 0.00019211493748419426 (4260)
    P for [Vorderseite, aus]: 0.04881483219901431400 (207)
    P for [Vorderseite, aus, Webware]: 0.00750997418446374100 (31)
  Vorderseite aus Webware => 0.00000007042897675637
Ngram result: [Geblümte, Vorderseite, aus]
    P for Geblümte: 0.00000112717048512200 (24)
    P for [Geblümte, Vorderseite]: 0.04000000000000000000 (0)
    P for [Geblümte, Vorderseite, aus]: 0.04000000000000000000 (0)
  Geblümte Vorderseite aus => 0.00000000180347277620
Ngram result: [_START_, Geblümte, Vorderseite]
    P for _START_: 0.07338181939834254000 (1627566)
    P for [_START_, Geblümte]: 0.00001536035075668160 (24)
    P for [_START_, Geblümte, Vorderseite]: 0.00000061441403026726 (0)
  _START_ Geblümte Vorderseite => 0.00000000000069254936
Left: 7.042897675636757E-8 Middle: 1.8034727761952055E-9 Right: 6.92549360562117E-13
P(Voderseite) = 4.505053057295024E-28
P(Vorderseite) = 8.796536361580524E-29

You can see that the probability of Vorderseite is smaller, although the word occurs much more often than Voderseite. It’s because of the relative probability calculation in getPseudoProbability. I’m not sure if using the cound would help here.

In the confusion set you have a factor. I can’t use the same approach since my confusion pairs can always be different. Still, I have to somehow include the occrance of the correct word into this calculations.

Does anyone have an idea?

dnaber · March 20, 2018, 10:55am

Mhhh, which of the numbers are you comparing now?

Paprikamann · March 20, 2018, 11:55am

I was talking about:

P(Voderseite) = 4.505053057295024E-28
P(Vorderseite) = 8.796536361580524E-29

Calculation: left * middle * right (position of the word)

The problem is, that the probability of e.g. a bigram is calculated as follows:
n+1 (count of bigram) / m+1 (count of first word) (+1 is to avoid problems with 0 occurance)
Example:

P for Voderseite: 0.00000018034727761952 (3)
P for [Voderseite, aus]: 0.50000000000000000000 (1) (calculation: 2/4)
P for [Voderseite, aus, Webware]: 0.50000000000000000000 (1)
Voderseite aus Webware => 0.00000004508681940488

vs

P for Vorderseite: 0.00019211493748419426 (4260)
P for [Vorderseite, aus]: 0.04881483219901431400 (207) (calculation: 208/4261)
P for [Vorderseite, aus, Webware]: 0.00750997418446374100 (31)
Vorderseite aus Webware => 0.00000007042897675637

In this case, the result is ok. The trigram with Vorderseite has a higher probability than the trigram with Voderseite. But the difference is really small.
And if you add the other probabilites to the calculation the solution will be wrong (-> Voderseite)

dnaber · March 20, 2018, 1:41pm

Have you tried what happens if you remove the +1 from the n+1?

SkyCharger001 · March 20, 2018, 2:15pm

a more reliable approach is to use
if (m==0) {x=0} else {x=n/m}

Paprikamann · March 21, 2018, 3:46pm

Neither of both approaches really help. The problem is that the relative probability is likely higher if the occurences are small. If they are high, the relative probability is likely smaller.

The problem with @SkyCharger001 approach is, that I calculate the probabibilites of three different positions (left, middle, right). In ConfusionProbabilityRule those three probs are multiyplied.
If one of the “position probs” is 0, the whole prob is 0.

@dnaber
Could it be that the calculation in BaseLanguageModel is wrong?
Look at the code: getPseudoProbability

The calculation (chain rule probability) is as follows:

P(A) * P(B | A) * P(C | A, B)

In your code you only have:

double thisP = (double) (phraseCount + 1) / (firstWordCount + 1);

So even for the trigram calculation you’re dividing through firstWordCount, although it should be firstANDsecondWordCount.
Am I right?

dnaber · March 26, 2018, 9:54am

Sorry, I overlooked your edit, as Discourse doesn’t seem to send notifications for edits… So basically,

double thisP = (double) (phraseCount + 1) / (firstWordCount + 1);

would become this (hacked in to test it)?

long firstTwoWordsCount = getCount(context.get(0) + " " + context.get(1));
(...)
      double thisP = (double) (phraseCount + 1) / (firstWordCount + 1);   // P(A|B)
      if (i == 3) {
        thisP *= (double) (phraseCount + 1) / (firstTwoWordsCount + 1);
      }

Have you tested this, does it help with your issue?