Natural Language Processing Group Fukumoto Laboratory
University of Yamanashi


Research Topics

  • 1. Linguistic Knowledge Acquisition

     Acquisition of linguistic knowledge is the study of techniques for the collection of linguistic or extra-linguistic information from corpora. We are working on ways to extract morphological, syntactic, and semantic information in natural language by using statistics and machine learning techniques. Some recent topics are:

  • Retrieving bilingual verb-noun collections

     Retrieving Japanese and English bilingual verb-noun collocations such as "メダルを獲得する(Medal-wo Kakutokusuru)- earn medal" and "三振する(Sanshinsuru)-get strikeout" from non-parallel corpora.

  • Linking and creating bilingual word senses

     Identifying bilingual noun word senses between Japanese and English dictionaries based on sentence-based similarity, and the results of domain-specific senses by using a textual corpus with category information.

  • Semantic tagging of unknown words

     Semantic classification of unoknown words which are not described in the thesaurus dictionary.

  • 2. Text Categorization

  • Text Categorization supports and improves several tasks such as automated topic tagging, building topic directory, spam filtering, creating digital libraries, sentiment analysis in user reviews, Information Retrieval, and even helping users to interact with serach engines. Much of the previous work on text categorization use superviesd machine learning techniqhes where they used training texts with category label to train classifiers. Once category models are trained, each text of the test data is classified by using these models. Moreover, with the growth of Big data on the Internet, term selection, categorization techniqes based on hierarchical structure of the Internet directory, and more sophisticated machine learning techniques for text categorization have been an interest and concern. Some recent topics concenrning to our text categorization task are:

  • Text categorization with relatively small positive documents and unlabeled data

     This work addresses the problem of dealing with a collection of negative training documents which is suitable for relatively small number of positive documents, and presents a method for eliminating the need for manually collecting negative training documents based on supervised machine learning techniques.

  • Learning time difference for text categorization

     The work addresses text categorization problem that training data may derive from a different time period from the test data.

  • Short text categorization

     Short texts categorization such as search snippets, Web page titles, product reviews, and sciencetific paper titles that maximizes the impact of informative words due to the sparseness of short length of texts.

  • Large scale hierarchical categorization

    Classifying a large, heterogeneous collection of web content by using hierarchical structure of internet directory.

  • 3. Text Summarization

  • Multi-document summarization

     This work focuses on continuous news documents and presents a method for extractive multi-document summarization.

  • 4. Recommendation

  • Incorporating guest preferences into collaborative filtering for hotel recommendation

     Hotel recommendation that incorporates different aspects of a product/hotel to improve quality of the score.

  • Job recommendation for recruiting candidates

     Developing a recommendation system to assist candidates in finding jobs that best fit with their individual preferences.

  • 5. Prediction

     Prediction of future behaviour by analysing past data, and generating a model.

  • Prediction of company's future prospect

     Predicting company's future prospect on R\ampD in business area by using publication statistics such as frequency on scientific papers and open patents to be published in time series, and sentiment analysis to extract positive news reports related to the companies, and estimated prediction models.