Chinese part development daily record

Hi dnaber

I have finished that. Since the member variable wordTokenizer in Chinese.java is private and the fact is that SC and TC use different word tokenizer, I think the code is quite ugly when they extend Chinese. Here is my code. What do you think?

And one more thing. I have already refactored current codes with the new library now. The existing grammar.xml only serves for SC and it conflicts with the new library. I will fix that next week. After fixing, I need to write an extra grammar.xml for TC. Do you have any advice?

P.S.
SC means Simplified Chinese.
TC means Traditional Chinese.

Is there a reason to still use Chinese or will everybody use the sub classes? If so, please make Chinese and its getWordTokenizer method abstract.

Please check English as an example, where we have a grammar.xml for en, en-US, and en-GB. The en-GB for example extends the en grammar.xml, so that all rules from en and en-GB will be active. Could we use the same for Chinese?

BTW, please indent your code with 2 spaces (not 4, and not tab), as we use the everywhere in LT.

Thank you

I think grammar.xml for SC and TS are independent. That means neither of them should extend the other one and I need two grammar.xml separately. Because in order to express the same meaning, the characters are different in most cases. e.g. 电视 - 電視 (TV)

Or, maybe I can create 3 files. One for the common grammar cases. One for the SC special cases. One for the TC special cases.

Maybe you could still use the same approach as for English - it’s just that the Chinese grammar.xml would be empty and only SC and TC would actually have rules?

May 20th

Complete the first part of my proposal.
Feature

  • Now LT can check both Simplified Chinese and Traditional Chinese now.

Command-line Usage

  • Check Simplified Chinese text with the option -l zh-CN.
  • Check Traditional Chinese text with the option -l zh-TW.

e.g.

No. Option Input Output Description
1 zh-CN 简直是走投无路。 No error. Correct sentence.
2 zh-CN 简直是走头无路。 走头无路 ->走投无路 ConfusionProbabilityRule
3 zh-CN 簡直是走投無路。 No error. Correct sentence with TC characters
4 zh-CN 簡直是走頭無路。 No error. TC characters sentence with an error 走投無路

And vice versa for zh-TW.

[Question:]
How can I write rules.xml message, suggestion or correction to substitute part of characters in a token?
For example.

<rule>
    <pattern>
        <marker>
            <token regexp="yes">(夜|春|通|元)(霄)</token>
        </marker>
    </pattern>
    <message>您的意思是"\1宵"吗?</message>
    <example correction="\1宵">然而,这顿<marker>夜霄</marker>吃得并不开心。</example>
    <example correction="\1宵">今天我准备<marker>通霄</marker>。</example>
</rule>

The rule above didn’t work actually. Is there any codes that can represent something like \1宵?

Your <message> has no <suggestion>, so it will not work, i.e. it will not create a suggestion for the user. You also cannot use \1 in <example>, you need to specify the complete suggestion there in the correction attribute.

I mean I want to know is there a syntax in correction or <suggestion> that using a symbol to represent a specified group in a token?

Have you tried regexp_match and regexp_replace? (documentation)

Thank you!

May 27th

[Discussion]
To make the output better for what I have talked above.

3 zh-CN 簡直是走投無路。 No error. Correct sentence with TC characters
4 zh-CN 簡直是走頭無路。 No error. TC characters sentence with an error 走投無路

I add a new feature when a user inputs TC sentences but decides to checks by zh-CN or the contrary situation. And I have mainly completed that with a tiny annoying issue.

For example. Say the feature named ChineseCharactersConversionRule.

No. Option Input Suggestion
3 zh-CN 簡直是走投無路。 簡直_ -> 简直
走投無路 -> 走投无路
4 zh-CN 簡直是走頭無路。 簡直_ -> 简直
走頭無路 -> 走头无路

The tiny issue is that only if the user corrects the characters to SC and check again, will he get the result that 走头无路 should be 走投无路.

In my opinion, there are two solutions.

  1. Abandon zh-CN’s grammar.xml and zh-TW’s grammar.xml. Combine them
    together in the root folder. (I don’t like it actually. I think there must be some culture
    collisions when only using one table.)

  2. Directly tell the user that he chooses SC checker but inputs TC characters, he
    should use the TC checker for his input. Then when he inputs and chooses correctly
    the result should be fine. (This idea may be the safest one. We can avoid many potential
    risks.). Mutual conversion between TC and SC is just a one-to-one mapping.
    If a user selects the wrong checker by mistake. Rather than correct the characters, just
    telling him to use another is a win-win situation for user and us.

How do you think so? Is there another way to solve the problem?

Do you think users will mix SC and TC input in one text? If not, we could maybe use our language identifier (GitHub - optimaize/language-detector: Language Detection Library for Java) and see if that can detect both variants reliably. This would be useful for all languages, e.g. someone might check German text but still has the setting on “English”. Currently we don’t give the user a useful hint in those cases.

I don’t think they will do that.

Then could you try if org.languagetool.language.LanguageIdentifier can reliably tell SC and TC apart?

What’s wrong when I create an instance of LanguageIdentifier?

java.lang.IllegalStateException: A language profile for language zh-CN was added already!

Please see the code of LanguageIdentifier, I think there’s a special case for Chinese which you might need to comment out for now.

Hi!

I’ve been following this conversation and I’m impressed with your progress so far :slight_smile:

Could you (in a free minute, no hurries) provide me with a current build of the version you are developing (mvn package)?
Then I can join in the testing efforts…

Thanks!

I test it in the design pattern that SimplifiedChinese class and TraditionalChinese class which have no getShortCode() method extend Chinese class since zh-CN and zh-TW use the same tokenizer, tagger and part of the grammar.xml. (You can see the codes in my github.

In this case, there is something different with other languages.

private static List<String> getLanguageCodes() {
    List<String> langCodes = new ArrayList<>();
    for (Language lang : Languages.get()) {
      String langCode = lang.getShortCode(); 
      // langCode will return zh for Chinese (Simplified) and Chinese (Traditional)
      // lang.getShortCodeWithCountryAndVariant() returns zh-CN and zh-TW respectively.
      boolean ignore = lang.isVariant() || ignoreLangCodes.contains(langCode) || externalLangCodes.contains(langCode);
      if (ignore) { // **ignore will be true for zh-CN and zh-TW**
        continue; 
      }
      if ("zh".equals(langCode)) {
        langCodes.add("zh-CN");
        langCodes.add("zh-TW");
      } else {
        langCodes.add(langCode);
      }
    }
    return langCodes;

After editing the codes above, the result shows that LanguageIdentifier.detectLanguage() only returns Chinese (Simplified) for Language reference type no matter what kind of Chinese characters I input.
However, when I change the languageDetector in LanguageIdentifier to public. languageDetector can totally identify TC and SC after 100 tests.

So, should I develop zh-CN and zh-TW seperately?

Welcome!

I am uploading it now. Link.
And you can also see my codes on github.If you download the codes from github. You have to download extra data in addition. Here is the link. Then you need to unzip it somewhere and modify root=G:/languagetool/languagetool-language-modules/zh/src/main/resources/org/languagetool/resource/ to the place you unzipped in resources/hanlp.properties.

Great, then let’s assume that future versions of LT will tell the users if they have selected the wrong variant or language (without the need for you to implement anything for that). As mentioned, this will also be useful for all other languages. But I cannot tell yet, when I’ll be able to implement that. It’s not that much work, but the UI will also be affected a bit.

May 29th

  • Add an extra feature(rule) which is out of my proposal. Link
    • A specific rule for zh-TW which finds ambiguous words and correct them.
  • Plan
    • Add more comments for the whole codes I have created.
    • Turn to the second phase of work.