Social Media Understanding
Text from social media is significant key information to understand social movement. However, the length of the social media text is typically short and concise with a lot of absent words. Our task is to identify the proper keyword representing the message content that we are accounting for. Instead of training the model for keyword extraction directly from the Twitter messages, we propose a new method to fine-tune the model trained from some known documents containing richer context information. We conducted the experiment on Twitter messages and expressed in word cloud timeline. It shows a promising result.
Asian WordNet (http://www.asianwordnet.org)
WordNet is widely used in NLP research because its important feature of computability. Each word in WordNet is expressed in a set of synonym words called synset, and defined in a semantic relational structure to each other. The based WordNet is created for the English language definition. Since the words in WordNet are defined by set of words, it is quite acceptable to generate other language WordNet by translating each word in the synset. We proposed an algorithm that can disambiguate the word by considering the synonym, and its English translation. Asian WordNet (AWN) is therefore generated for many Asian languages by using each local English dictionary.
Digitized Thailand (http://www.digitized-thailand.org)
The aim of the project is to create the country digital platform. We realize that many database and applications cannot be easily shared or connected to provide a higher integrated solution. Initially, the project stimulated the data digitization and application development following the provided API standard. Beyond the efforts in Digitized Thailand initiative, we created a huge useful database such as cultural information, language corpora, and in the same time many research algorithms such as word segmentation, keyword extraction, information extraction have also been developed to provide a service via the standard API.
In current statistical approach for NLP research, collections of language resources are crucial in generating the language model. Many types of language resources can be prepared depending on the purpose of study. They can be part-of-speech (POS) tagged corpus, bracketed corpus (syntactic and/or semantic annotated corpus, or parse tree corpus), parallel translated corpus, speech corpus and several types of lexicon. To generate such a kind of language resource there are many issues to overcome such as annotation consistency, word/phrase level alignment, multi level annotation, standardization since we have to handle a large number of language data.