LT and GSoC 2018 - looking for students

oserikov · February 20, 2018, 1:07pm

Hello, my name is Oleg and I hope to participate in GSoC this year.
I’ve read the ideas list and now am really interested in improving the spell checker because of the following: I’ve practiced NLP and spelling correction tools development, am experienced in elasticsearch (and suggest to use elasticsearch in the server-based spell-checker).
Also when thinking about compressing the data when(if) moving the task to the client’s side, the first idea that comes to mind is the usage of the data structures such as suffix trees to deal with the size of the dictionaries.
Now I’m going to find some Java-related bug in the issues list and to fix it, but am happy to discuss stated ideas.

dnaber · February 20, 2018, 1:13pm

Hi Oleg, thanks for your interest in LT. We do already use Lucene in LT, so maybe we can avoid the complexity of ElasticSearch.

dnaber · February 21, 2018, 9:37am

Hi Christian, thanks for your interest in LT. For GSoC, you’ll need to write an application that describes what you want to do and what the timeline is. You can get some ideas from the wiki, but you’re encouraged to come up with your own ideas. Whether you have your own idea or select a task from the wiki, it should first be discussed here. We also expect you to provide pull requests or prototype code before you hand in the application.

oserikov · February 21, 2018, 3:40pm

I thought that there is no enough data for the ml-based spelling corrector – that’s why I wanted to use ES suggester- based algorithmic approach to solve the task, but having read the updated task description I see that LanguageTool has enough data to train the model.
Can you refer me to the spellchecker’s entry point in the code? Do the plugins or add-ons do spellchecking now on their side? Earlier in the task description was mentioned the server-only spellchecking. How the app functions are divided between the client and server?
Having the trained model we will be able to use it on the client’s side, but it’s important to find a way to update client’s versions of models.

dnaber · February 21, 2018, 11:25pm

Here’s where the suggestions are collected.

No, it’s completely server-side and has to stay that way to prevent massive code duplication.

The server does all the checking. The client just displays the results, in some cases filtering spelling errors which the user has added to their personal dictionary.

Maybe there’s some confusion here: if we have small data, users can download it and use LT offline. That’s a good thing. But when you mention client/server above, “client” to me means “add-on in some software that uses the HTTP API”, not “stand-alone version of LT”.

drex · February 22, 2018, 6:17am

Hey dnaber,
So I extracted 26,046 sentences containing the words ‘your’ or “you’re” and changed random occurrences of “your” to “you’re” and vice versa. You can find it here
Next step train a model and see if it works for this confusion pair?

dnaber · February 22, 2018, 8:24am

I think so.

arselyne · February 24, 2018, 4:59pm

Hello, my name is Arselyne, and I am interested in adding a new language to the LanguageTool, which is Haitian Creole. However, I am not sure if that can stand alone as a GSoC project. I would be, as part of establishing the new language, making rules and working on speller checker suggestions. Would that be enough? Or are there other things I should be thinking about? Thanks in advance!

noob_rick · February 25, 2018, 5:06am

Hi LT,
I’m Vasudev Singh a junior year undergrad and is interested in contributing to the “Extended AI approach” , I have some prior experience in working with seq2seq models and would like to know how to start with the problem ??

noob_rick · February 25, 2018, 5:13am

It’ll be great if you could provide few insights on “Integrate a dependency parser” project as well .
Thanks

dnaber · February 25, 2018, 9:43am

Hi Vasudev, thanks for your interest in LT. Please have a look at this complete thread - you could try generating artificial errors for one pair (e.g. your and you're) and see if you manage to develop a seq2seq approach to tell apart the errors from the correct sentences.

dnaber · February 25, 2018, 9:46am

Hi Arselyne, welcome to LT. Indeed, “just” adding a language is a nice (and no small) task, but probably not enough for GSoC. Did you have a look at Missing Features - LanguageTool Wiki?

oserikov · February 25, 2018, 2:02pm

Should I create a new topic to continue the discussion?

dnaber · February 25, 2018, 3:20pm

If the discussion gets very specific, it’s maybe best to open a new topic, yes. Feel free to do that.

noob_rick · February 25, 2018, 4:36pm

So you basically want a seq2seq bases classifier in this case which differentiates b/w the correct and incorrect sentences in this case ??

dnaber · February 25, 2018, 4:54pm

We want to detect when users use your when they should be using you're and vice versa. How to implement this is basically up to you. GSoC is not a static TODO list where student implement our wishes - it’s up to you to come up with suggestions and ideas about how to improve LT. Of course, any system that detects the your/you’re issue should be generic enough to work for other pairs too, when properly trained.

noob_rick · February 25, 2018, 5:03pm

okh i get your point, i’ll train a model for the same and let you know the result and my approach.

Hsankesara · February 26, 2018, 6:06am

Hey all,
As we required a deep learning model for homonyms detections and corrections, I go through some of the recent research papers in this field. I found some very good deep network architectures and I’d like to create a deep learning architecture with homonyms detection as well as grammar detection with respect to the context of the sentence.The Architecture will use a recurrent neural network to perform sequence-to-sequence mapping from erroneous to well-formed sentences. Additionally, it’ll rely on a post-processing step based on statistical word-based translation models to replace out-of-vocabulary words.This will use LSTM model as hidden layer and depending on the size of the dataset it consists of 2-4 hidden layer. Please suggest some ideas for this model.

dnaber · February 26, 2018, 9:52am

Everyone interested in GSoC, please check out https://languagetool.org/gsoc2018/, I’ve updated that page with some details about how to apply (application period will be 2018-03-12 to 2018-03-27).

t0iiz · February 27, 2018, 2:28pm

Hi,
I am a undergraduate student in China. My name is Ze Dang. I am good at Chinese and Japanese.
I am interested in the idea Take an orphaned language and make it state of the art. I have also some experience in AI and Java. But, it is the first time I participated in gsoc. Can you give me some advice?
Thank you.