Fukumoto Laboratory

Research Topics

Knowledge Acquisition

A. Natural Language Processing
Acquisition of linguistic knowledge is the study of techniques for the collection of linguistic or extra-linguistic information from corpora. We are working on ways to extract morphological, syntactic, and semantic information in natural language by using statistics and machine learning techniques. Some recent topics are:

A.1. Retrieving bilingual verb-noun collections
Retrieving Japanese and English bilingual verb-noun collocations such as "メダルを獲得する(Medal-wo Kakutokusuru)- earn medal" and "三振する(Sanshinsuru)-get strikeout" from non-parallel corpora.
A.2. Linking and creating bilingual word senses
Identifying bilingual noun word senses between Japanese and English dictionaries based on sentence-based similarity, and the results of domain-specific senses by using a textual corpus with category information.
A.3. Semantic tagging of unknown words
Semantic classification of unknown words which are not described in the thesaurus dictionary.

B. Crowdsourcing
The approaches for improving the data quality and extracting valuable information from the large amount of data provided by the crowd workers who contain many non-experts, for label, pairwise, text, unstructured data and so on respectively; the methods for ranking the objects from matchup and comparison data and predicting the pairwise matchup results. The methods for cost optimization which maximizes the volume and quality of data collected by crowdsourcing with limited budget.
C. Human Computation
It is to solving the complex real-world problems by coordinating human and computer, i.e., human-in-the-loop problem solving. For example, interactive machine learning for improving the model performance; crowd-based data cleansing; crowd-based feature engineering.

Applications

A. Text Categorization
Text Categorization supports and improves several tasks such as automated topic tagging, building topic directory, Spam filtering, creating digital libraries, sentiment analysis in user reviews, Information Retrieval, and even helping users to interact with search engines. Much of the previous work on text categorization use supervised machine learning techniques where they used training texts with category label to train classifiers. Once category models are trained, each text of the test data is classified by using these models. Moreover, with the growth of Big data on the Internet, term selection, categorization techniques based on hierarchical structure of the Internet directory, and more sophisticated machine learning techniques for text categorization have been an interest and concern. Some recent topics concerning to our text categorization task are:

A.1. Learning time difference for text categorization
The work addresses text categorization problem that training data may derive from a different time period from the test data.
A.2. Short text categorization
Short texts categorization such as search snippets, Web page titles, product reviews, and scientific paper titles that maximizes the impact of informative words due to the sparseness of short length of text.
A.3. Large scale hierarchical categorization
Classifying a large, heterogeneous collection of web content by using hierarchical structure of Internet directory.

B. Text Summarization

B.1. Multi-document summarization
This work focuses on continuous news documents and presents a method for extractive multi-document summarization.

C. Recommendation
- C.1. Incorporating guest preferences into collaborative filtering for hotel recommendation
  Hotel recommendation that incorporates different aspects of a product/hotel to improve quality of the score.
- C.2. Job recommendation for recruiting candidates
  Developing a recommendation system to assist candidates in finding jobs that best fit with their individual preferences.
D. Prediction
Prediction of future behaviour by analyzing past data, and generating a model.
- D.1. Prediction of company's future prospect
  Predicting company's future prospect on R&D in business area by using publication statistics such as frequency on scientific papers and open patents to be published in time series, and sentiment analysis to extract positive news reports related to the companies, and estimated prediction models.
E. Information Retrieval
The scalable term indexing approach for large amount of data; the search models which can provide high quality results and results matching user intention; the learning to rank approaches for ranking the search results based on multiple kinds of information; the evaluation methods for search results.
F. Question and Answering
For generating the correct answers to questions, the knowledge extraction methods from large amount data with question-answer pairs collected by crowdsourcing; the automatic classification methods on reasoning patterns for answer generation; the answer generation methods based on the question content as well as the question context.