Not identified words

No problem at all. Just better to focus on a task at the time, for efficiency sake.
Regarding the symbols (e.g. ∑ ≠ →), I can see the issue now. You can add those as ?PUNCT, but it may be a disheartening task if you do it in case-by-case basis. Better to dump a UNICODE math symbol table and “replace all” regexp (.) by \1\t\1\t_PUNCT\n.

It is too hard/complex for me.

Tiago,

There are some unidentified words with verbs forms:
abarcá-lo-á
abarcá-lo-emos
abarcá-lo-ão
abarcar-lhos

How should I add them to added.txt?

Thanks!

Hi Marco,

Regarding POSs, I think they are considered diferent particles nao, since I introduced some tokenization changes. You should do it via disambiguation, if you deem it fit, but the main issue with this word forms is actually their spelling recognition. There are many verb form that are still not recognized by hunspell if 'mesoclises; is uses.
This ‘mesoclises’ forms are an issue I haven’t yet found a good solution to it. Even our Hunspell dictionary uses a form of uncompressed inflected verb forms with prefixes (uppercase L and P) to recognize all ‘mesoclises’ forms of some verbs, which is a computationally very expensive way to do it, as well as it require a great deal of manual input. I am still thinking of a solution that does not involve adding all base forms of a verb, as it is done at the moment, or if done, done with an automated script for all relevant verbs. If you have feasible ideas, I am very happy to hear them.

For examples see:
https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/hunspell/pt_PT.dic

assentes/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
assente/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
assentimos/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
assentis/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
assentem/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
assinto/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=s,T=p]
assinta/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=s,T=pc]
assintas/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=pc]
assinta/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=pc]
assintamos/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=pc]
assintais/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=pc]
assintam/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=pc]
assente/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=s,T=i]
assinta/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=s,T=i]
assintamos/L [$assentir$CAT=v,T=inf,TR=_$P=1,N=p,T=i]
assenti/L [$assentir$CAT=v,T=inf,TR=_$P=2,N=p,T=i]
assintam/L [$assentir$CAT=v,T=inf,TR=_$P=3,N=p,T=i]
consentes/LS [$consentir$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
consente/LS [$consentir$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
consentimos/LS [$consentir$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
consentis/LS [$consentir$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
consentem/LS [$consentir$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
consinto/LS [$consentir$CAT=v,T=inf,TR=_$P=1,N=s,T=p]
consinta/LS [$consentir$CAT=v,T=inf,TR=_$P=1,N=s,T=pc]
consintas/LS [$consentir$CAT=v,T=inf,TR=_$P=2,N=s,T=pc]
consinta/LS [$consentir$CAT=v,T=inf,TR=_$P=3,N=s,T=pc]
consintamos/LS [$consentir$CAT=v,T=inf,TR=_$P=1,N=p,T=pc]

@tiagosantos

Hello!

Tonight’s diff gives a hit in “os estudantes que possuem diploma de uma escola profissionalizante também podem entrar.”
https://languagetool.org/regression-tests/20190127/result_pt-PT_20190127.html

I added the POS entries as:

|profissionalizante|profissionalizante|AQ0MS0|
|profissionalizantes|profissionalizante|AQ0MP0|
|profissionalizantes|profissionalizante|AQ0FP0|

To get a valid POS I try to find other words whose Priberam dictionary says it is of the same kind and found on the morphological database of LanguageTool.

Could you confirm if the three entries I added are the most correct ones?

Notice that for the plural above, Priberam says “masculine and feminine” so I added two entries, one masculine and other feminine as I was not sure how to do it in one POS.

Thanks!

Hi Marco,

Given that profissionalizante is an ungendered adjective you can either add more POS with the feminine form or change M to C. Notice that in your list you forgot to add the feminine form for the singular form of profissionalizante, as you did with the plural.

Hello Tiago,

I have just fixed it:

Thank you!

@tiagosantos

Hello!

A few days ago I added the POS for “driver” and “drivers”.

Could you suggest that it is a foreign word and to replace with “controlador” or “controladores”?

Thanks!

Hello @tiagosantos

I am adding POS to words.

The word “t-shirts” triggers a false positive in LibreOffice.

Could you check?

Thanks!

Hi @marcoagpinto,

This needs the dictionary to be changes. Have you tried replacing the standard hunspells libreoffice dictionaries with the ones I am maintaining (GitHub - TiagoSantos81/PortugueseLibreOfficeExtension)?
They are a bit outdated now, and I will push a new version one of these days, but they shoud work.

@tiagosantos
I am using the Minho university speller.

I am about to download and install your version.

The bad thing is that while adding POS to words, several words (from the list generated by the other LT member) appear as typos, and I have only been adding POS to words that appear as not identified and not to the ones that appear as typos :frowning:

My silly idea was to first process all based on the Minho speller and then do a second check with your speller.

This was a silly idea since I should have done it from the beginning with yours.

Now I will have twice the work.

:slight_smile:

Marco, both ideas are good. It is a daunting task. If you are already using those dictionaries, I may suggest one way to accelerate the task.
You can replace the U.Minho tags by POS if you decode them. For example:

|...|[CAT=punctj]|
|---|---|
|à|[$ao$CAT=cp,Prep=a,Art=o$G=f,N=s]|
|abacateiro/p|[CAT=nc,G=m,N=s]|
|abacate/p|[CAT=nc,G=m,N=s]|
|abacaxi/p|[CAT=nc,G=m,N=s]|
|ábaco/p|[CAT=nc,G=m,N=s]|

[CAT=punct*] is equivelent to POS _PUNCT
[CAT=nc,G=m,N=s]| is equivelent to POS NCMS000

if you replace all those by their POS equivelent, remove the affixes and open in a Calc (for example) you can create a simple “POS dictionary”. Then you have to just run LT on it to triage the words that don’t have POS. It still takes a lot of time, but it is faster because you can just delete large chunks of the table.
If you need help I can provide a baseline with part of the dictionary with this conversion.