Thanks. I’ve adapted this function and it works already quite well. There are still some problems. Let’s look at:
Ngram result: [Voderseite, aus, Webware]
P for Voderseite: 0.00000018034727761952 (3)
P for [Voderseite, aus]: 0.50000000000000000000 (1)
P for [Voderseite, aus, Webware]: 0.50000000000000000000 (1)
Voderseite aus Webware => 0.00000004508681940488
Ngram result: [Geblümte, Voderseite, aus]
P for Geblümte: 0.00000112717048512200 (24)
P for [Geblümte, Voderseite]: 0.08000000000000000000 (1)
P for [Geblümte, Voderseite, aus]: 0.08000000000000000000 (1)
Geblümte Voderseite aus => 0.00000000721389110478
Ngram result: [_START_, Geblümte, Voderseite]
P for _START_: 0.07338181939834254000 (1627566)
P for [_START_, Geblümte]: 0.00001536035075668160 (24)
P for [_START_, Geblümte, Voderseite]: 0.00000122882806053453 (1)
_START_ Geblümte Voderseite => 0.00000000000138509872
Left: 4.508681940488013E-8 Middle: 7.213891104780822E-9 Right: 1.385098721124234E-12
Ngram result: [Vorderseite, aus, Webware]
P for Vorderseite: 0.00019211493748419426 (4260)
P for [Vorderseite, aus]: 0.04881483219901431400 (207)
P for [Vorderseite, aus, Webware]: 0.00750997418446374100 (31)
Vorderseite aus Webware => 0.00000007042897675637
Ngram result: [Geblümte, Vorderseite, aus]
P for Geblümte: 0.00000112717048512200 (24)
P for [Geblümte, Vorderseite]: 0.04000000000000000000 (0)
P for [Geblümte, Vorderseite, aus]: 0.04000000000000000000 (0)
Geblümte Vorderseite aus => 0.00000000180347277620
Ngram result: [_START_, Geblümte, Vorderseite]
P for _START_: 0.07338181939834254000 (1627566)
P for [_START_, Geblümte]: 0.00001536035075668160 (24)
P for [_START_, Geblümte, Vorderseite]: 0.00000061441403026726 (0)
_START_ Geblümte Vorderseite => 0.00000000000069254936
Left: 7.042897675636757E-8 Middle: 1.8034727761952055E-9 Right: 6.92549360562117E-13
P(Voderseite) = 4.505053057295024E-28
P(Vorderseite) = 8.796536361580524E-29
You can see that the probability of Vorderseite
is smaller, although the word occurs much more often than Voderseite
. It’s because of the relative probability calculation in getPseudoProbability
. I’m not sure if using the cound would help here.
In the confusion set you have a factor. I can’t use the same approach since my confusion pairs can always be different. Still, I have to somehow include the occrance of the correct word into this calculations.
Does anyone have an idea?