Not identified words

tiagosantos · January 11, 2019, 2:38am

No problem at all. Just better to focus on a task at the time, for efficiency sake.
Regarding the symbols (e.g. ∑ ≠ →), I can see the issue now. You can add those as ?PUNCT, but it may be a disheartening task if you do it in case-by-case basis. Better to dump a UNICODE math symbol table and “replace all” regexp (.) by \1\t\1\t_PUNCT\n.

marcoagpinto · January 23, 2019, 12:58pm

It is too hard/complex for me.

Tiago,

There are some unidentified words with verbs forms:
abarcá-lo-á
abarcá-lo-emos
abarcá-lo-ão
abarcar-lhos

How should I add them to added.txt?

Thanks!

tiagosantos · January 24, 2019, 7:02pm

Hi Marco,

Regarding POSs, I think they are considered diferent particles nao, since I introduced some tokenization changes. You should do it via disambiguation, if you deem it fit, but the main issue with this word forms is actually their spelling recognition. There are many verb form that are still not recognized by hunspell if 'mesoclises; is uses.
This ‘mesoclises’ forms are an issue I haven’t yet found a good solution to it. Even our Hunspell dictionary uses a form of uncompressed inflected verb forms with prefixes (uppercase L and P) to recognize all ‘mesoclises’ forms of some verbs, which is a computationally very expensive way to do it, as well as it require a great deal of manual input. I am still thinking of a solution that does not involve adding all base forms of a verb, as it is done at the moment, or if done, done with an automated script for all relevant verbs. If you have feasible ideas, I am very happy to hear them.

For examples see:
https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/hunspell/pt_PT.dic

assentes/L	[$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
assente/L	[$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
assentimos/L	[$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
assentis/L	[$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
assentem/L	[$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
assinto/L	[$assentir$CAT=v,T=inf,TR=_$P=1,N=s,T=p]
assinta/L	[$assentir$CAT=v,T=inf,TR=_$P=1,N=s,T=pc]
assintas/L	[$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=pc]
assinta/L	[$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=pc]
assintamos/L	[$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=pc]
assintais/L	[$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=pc]
assintam/L	[$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=pc]
assente/L	[$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=i]
assinta/L	[$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=i]
assintamos/L	[$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=i]
assenti/L	[$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=i]
assintam/L	[$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=i]
consentes/LS	[$consentir$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
consente/LS	[$consentir$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
consentimos/LS	[$consentir$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
consentis/LS	[$consentir$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
consentem/LS	[$consentir$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
consinto/LS	[$consentir$CAT=v,T=inf,TR=_$P=1,N=s,T=p]
consinta/LS	[$consentir$CAT=v,T=inf,TR=_$P=1,N=s,T=pc]
consintas/LS	[$consentir$CAT=v,T=inf,TR=_$P=2,N=s,T=pc]
consinta/LS	[$consentir$CAT=v,T=inf,TR=_$P=3,N=s,T=pc]
consintamos/LS	[$consentir$CAT=v,T=inf,TR=_$P=1,N=p,T=pc]

marcoagpinto · January 27, 2019, 11:20pm

@tiagosantos

Hello!

Tonight’s diff gives a hit in “os estudantes que possuem diploma de uma escola profissionalizante também podem entrar.”
https://languagetool.org/regression-tests/20190127/result_pt-PT_20190127.html

I added the POS entries as:

|profissionalizante|profissionalizante|AQ0MS0|
|profissionalizantes|profissionalizante|AQ0MP0|
|profissionalizantes|profissionalizante|AQ0FP0|

To get a valid POS I try to find other words whose Priberam dictionary says it is of the same kind and found on the morphological database of LanguageTool.

Could you confirm if the three entries I added are the most correct ones?

Notice that for the plural above, Priberam says “masculine and feminine” so I added two entries, one masculine and other feminine as I was not sure how to do it in one POS.

Thanks!

tiagosantos · January 28, 2019, 10:00pm

Hi Marco,

Given that profissionalizante is an ungendered adjective you can either add more POS with the feminine form or change M to C. Notice that in your list you forgot to add the feminine form for the singular form of profissionalizante, as you did with the plural.

marcoagpinto · January 28, 2019, 10:15pm

Hello Tiago,

I have just fixed it:

Thank you!

marcoagpinto · January 29, 2019, 1:09pm

@tiagosantos

Hello!

A few days ago I added the POS for “driver” and “drivers”.

Could you suggest that it is a foreign word and to replace with “controlador” or “controladores”?

Thanks!

marcoagpinto · February 11, 2019, 8:37am

Hello @tiagosantos

I am adding POS to words.

The word “t-shirts” triggers a false positive in LibreOffice.

Could you check?

Thanks!

tiagosantos · February 11, 2019, 1:59pm

Hi @marcoagpinto,

This needs the dictionary to be changes. Have you tried replacing the standard hunspells libreoffice dictionaries with the ones I am maintaining (GitHub - TiagoSantos81/PortugueseLibreOfficeExtension)?
They are a bit outdated now, and I will push a new version one of these days, but they shoud work.

marcoagpinto · February 11, 2019, 2:14pm

@tiagosantos
I am using the Minho university speller.

I am about to download and install your version.

The bad thing is that while adding POS to words, several words (from the list generated by the other LT member) appear as typos, and I have only been adding POS to words that appear as not identified and not to the ones that appear as typos

My silly idea was to first process all based on the Minho speller and then do a second check with your speller.

This was a silly idea since I should have done it from the beginning with yours.

Now I will have twice the work.

tiagosantos · February 11, 2019, 2:27pm

Marco, both ideas are good. It is a daunting task. If you are already using those dictionaries, I may suggest one way to accelerate the task.
You can replace the U.Minho tags by POS if you decode them. For example:

|...|[CAT=punctj]|
|---|---|
|à|[$ao$CAT=cp,Prep=a,Art=o$G=f,N=s]|
|abacateiro/p|[CAT=nc,G=m,N=s]|
|abacate/p|[CAT=nc,G=m,N=s]|
|abacaxi/p|[CAT=nc,G=m,N=s]|
|ábaco/p|[CAT=nc,G=m,N=s]|

[CAT=punct*] is equivelent to POS _PUNCT
[CAT=nc,G=m,N=s]| is equivelent to POS NCMS000

if you replace all those by their POS equivelent, remove the affixes and open in a Calc (for example) you can create a simple “POS dictionary”. Then you have to just run LT on it to triage the words that don’t have POS. It still takes a lot of time, but it is faster because you can just delete large chunks of the table.
If you need help I can provide a baseline with part of the dictionary with this conversion.