Chinese part development daily record

t0iiz · May 11, 2018, 7:55am

Hi dnaber

I have finished that. Since the member variable wordTokenizer in Chinese.java is private and the fact is that SC and TC use different word tokenizer, I think the code is quite ugly when they extend Chinese. Here is my code. What do you think?

And one more thing. I have already refactored current codes with the new library now. The existing grammar.xml only serves for SC and it conflicts with the new library. I will fix that next week. After fixing, I need to write an extra grammar.xml for TC. Do you have any advice?

P.S.
SC means Simplified Chinese.
TC means Traditional Chinese.

dnaber · May 11, 2018, 8:45am

Is there a reason to still use Chinese or will everybody use the sub classes? If so, please make Chinese and its getWordTokenizer method abstract.

Please check English as an example, where we have a grammar.xml for en, en-US, and en-GB. The en-GB for example extends the en grammar.xml, so that all rules from en and en-GB will be active. Could we use the same for Chinese?

BTW, please indent your code with 2 spaces (not 4, and not tab), as we use the everywhere in LT.

t0iiz · May 11, 2018, 9:08am

Thank you

I think grammar.xml for SC and TS are independent. That means neither of them should extend the other one and I need two grammar.xml separately. Because in order to express the same meaning, the characters are different in most cases. e.g. 电视 - 電視 (TV)

Or, maybe I can create 3 files. One for the common grammar cases. One for the SC special cases. One for the TC special cases.

dnaber · May 11, 2018, 9:30am

Maybe you could still use the same approach as for English - it’s just that the Chinese grammar.xml would be empty and only SC and TC would actually have rules?

t0iiz · May 20, 2018, 2:14pm

May 20th

Complete the first part of my proposal.
Feature

Now LT can check both Simplified Chinese and Traditional Chinese now.

Command-line Usage

Check Simplified Chinese text with the option -l zh-CN.
Check Traditional Chinese text with the option -l zh-TW.

e.g.

No.	Option	Input	Output	Description
1	zh-CN	简直是走投无路。	No error.	Correct sentence.
2	zh-CN	简直是走头无路。	走头无路 ->走投无路	ConfusionProbabilityRule
3	zh-CN	簡直是走投無路。	No error.	Correct sentence with TC characters
4	zh-CN	簡直是走頭無路。	No error.	TC characters sentence with an error 走投無路

And vice versa for zh-TW.

~~[Question:]~~
How can I write rules.xml message, suggestion or correction to substitute part of characters in a token?
For example.

<rule>
    <pattern>
        <marker>
            <token regexp="yes">(夜|春|通|元)(霄)</token>
        </marker>
    </pattern>
    <message>您的意思是"\1宵"吗？</message>
    <example correction="\1宵">然而，这顿<marker>夜霄</marker>吃得并不开心。</example>
    <example correction="\1宵">今天我准备<marker>通霄</marker>。</example>
</rule>

The rule above didn’t work actually. Is there any codes that can represent something like \1宵?

dnaber · May 20, 2018, 2:45pm

Your <message> has no <suggestion>, so it will not work, i.e. it will not create a suggestion for the user. You also cannot use \1 in <example>, you need to specify the complete suggestion there in the correction attribute.

t0iiz · May 20, 2018, 2:56pm

I mean I want to know is there a syntax in correction or <suggestion> that using a symbol to represent a specified group in a token?

dnaber · May 20, 2018, 3:16pm

Have you tried regexp_match and regexp_replace? (documentation)

t0iiz · May 20, 2018, 3:40pm

Thank you!

t0iiz · May 27, 2018, 2:39am

May 27th

[Discussion]
To make the output better for what I have talked above.


3	zh-CN	簡直是走投無路。	No error.	Correct sentence with TC characters
4	zh-CN	簡直是走頭無路。	No error.	TC characters sentence with an error 走投無路

I add a new feature when a user inputs TC sentences but decides to checks by zh-CN or the contrary situation. And I have mainly completed that with a tiny annoying issue.

For example. Say the feature named ChineseCharactersConversionRule.

No.	Option	Input	Suggestion
3	zh-CN	簡直是走投無路。	簡直_ -> 简直
			走投無路 -> 走投无路
4	zh-CN	簡直是走頭無路。	簡直_ -> 简直
			走頭無路 -> 走头无路

The tiny issue is that only if the user corrects the characters to SC and check again, will he get the result that 走头无路 should be 走投无路.

In my opinion, there are two solutions.

Abandon zh-CN’s grammar.xml and zh-TW’s grammar.xml. Combine them
together in the root folder. (I don’t like it actually. I think there must be some culture
collisions when only using one table.)
Directly tell the user that he chooses SC checker but inputs TC characters, he
should use the TC checker for his input. Then when he inputs and chooses correctly
the result should be fine. (This idea may be the safest one. We can avoid many potential
risks.). Mutual conversion between TC and SC is just a one-to-one mapping.
If a user selects the wrong checker by mistake. Rather than correct the characters, just
telling him to use another is a win-win situation for user and us.

How do you think so? Is there another way to solve the problem?

dnaber · May 27, 2018, 10:34am

Do you think users will mix SC and TC input in one text? If not, we could maybe use our language identifier (GitHub - optimaize/language-detector: Language Detection Library for Java) and see if that can detect both variants reliably. This would be useful for all languages, e.g. someone might check German text but still has the setting on “English”. Currently we don’t give the user a useful hint in those cases.

t0iiz · May 27, 2018, 10:48am

I don’t think they will do that.

dnaber · May 27, 2018, 11:58am

Then could you try if org.languagetool.language.LanguageIdentifier can reliably tell SC and TC apart?

t0iiz · May 27, 2018, 1:01pm

What’s wrong when I create an instance of LanguageIdentifier?

java.lang.IllegalStateException: A language profile for language zh-CN was added already!

dnaber · May 27, 2018, 3:24pm

Please see the code of LanguageIdentifier, I think there’s a special case for Chinese which you might need to comment out for now.

lena.feinbube · May 28, 2018, 12:27pm

Hi!

I’ve been following this conversation and I’m impressed with your progress so far

Could you (in a free minute, no hurries) provide me with a current build of the version you are developing (mvn package)?
Then I can join in the testing efforts…

Thanks!

t0iiz · May 28, 2018, 12:38pm

I test it in the design pattern that SimplifiedChinese class and TraditionalChinese class which have no getShortCode() method extend Chinese class since zh-CN and zh-TW use the same tokenizer, tagger and part of the grammar.xml. (You can see the codes in my github.

In this case, there is something different with other languages.

private static List<String> getLanguageCodes() {
    List<String> langCodes = new ArrayList<>();
    for (Language lang : Languages.get()) {
      String langCode = lang.getShortCode(); 
      // langCode will return zh for Chinese (Simplified) and Chinese (Traditional)
      // lang.getShortCodeWithCountryAndVariant() returns zh-CN and zh-TW respectively.
      boolean ignore = lang.isVariant() || ignoreLangCodes.contains(langCode) || externalLangCodes.contains(langCode);
      if (ignore) { // **ignore will be true for zh-CN and zh-TW**
        continue; 
      }
      if ("zh".equals(langCode)) {
        langCodes.add("zh-CN");
        langCodes.add("zh-TW");
      } else {
        langCodes.add(langCode);
      }
    }
    return langCodes;

After editing the codes above, the result shows that LanguageIdentifier.detectLanguage() only returns Chinese (Simplified) for Language reference type no matter what kind of Chinese characters I input.
However, when I change the languageDetector in LanguageIdentifier to public. languageDetector can totally identify TC and SC after 100 tests.

So, should I develop zh-CN and zh-TW seperately?

t0iiz · May 28, 2018, 12:51pm

Welcome!

I am uploading it now. Link.
And you can also see my codes on github.If you download the codes from github. You have to download extra data in addition. Here is the link. Then you need to unzip it somewhere and modify root=G:/languagetool/languagetool-language-modules/zh/src/main/resources/org/languagetool/resource/ to the place you unzipped in resources/hanlp.properties.

dnaber · May 28, 2018, 7:11pm

Great, then let’s assume that future versions of LT will tell the users if they have selected the wrong variant or language (without the need for you to implement anything for that). As mentioned, this will also be useful for all other languages. But I cannot tell yet, when I’ll be able to implement that. It’s not that much work, but the UI will also be affected a bit.

t0iiz · May 29, 2018, 1:15pm

May 29th

Add an extra feature(rule) which is out of my proposal. Link
- A specific rule for zh-TW which finds ambiguous words and correct them.
Plan
- Add more comments for the whole codes I have created.
- Turn to the second phase of work.