Knowledge Processing Group Fukumoto and Li Laboratory
University of Yamanashi

 


Research Topics

  • 1. Knowledge Acquisition

  • A. Natural Language Processing

     Acquisition of linguistic knowledge is the study of techniques for the collection of linguistic or extra-linguistic information from corpora. We are working on ways to extract morphological, syntactic, and semantic information in natural language by using statistics and machine learning techniques. Some recent topics are:

  • A.1. Retrieving bilingual verb-noun collections

     Retrieving Japanese and English bilingual verb-noun collocations such as "メダルを獲得する(Medal-wo Kakutokusuru)- earn medal" and "三振する(Sanshinsuru)-get strikeout" from non-parallel corpora.

  • A.2. Linking and creating bilingual word senses

     Identifying bilingual noun word senses between Japanese and English dictionaries based on sentence-based similarity, and the results of domain-specific senses by using a textual corpus with category information.

  • A.3. Semantic tagging of unknown words

     Semantic classification of unknown words which are not described in the thesaurus dictionary.

  • B. Crowdsourcing

     The approaches for improving the data quality and extracting valuable information from the large amount of data provided by the crowd workers who contain many non-experts, for label, pairwise, text, unstructured data and so on respectively; the methods for ranking the objects from matchup and comparison data and predicting the pairwise matchup results. The methods for cost optimization which maximizes the volume and quality of data collected by crowdsourcing with limited budget.

  • C. Human Computation

     It is to solving the complex real-world problems by coordinating human and computer, i.e., human-in-the-loop problem solving. For example, interactive machine learning for improving the model performance; crowd-based data cleansing; crowd-based feature engineering.

  • 2. Applications

  • A. Text Categorization

    Text Categorization supports and improves several tasks such as automated topic tagging, building topic directory, Spam filtering, creating digital libraries, sentiment analysis in user reviews, Information Retrieval, and even helping users to interact with search engines. Much of the previous work on text categorization use supervised machine learning techniques where they used training texts with category label to train classifiers. Once category models are trained, each text of the test data is classified by using these models. Moreover, with the growth of Big data on the Internet, term selection, categorization techniques based on hierarchical structure of the Internet directory, and more sophisticated machine learning techniques for text categorization have been an interest and concern. Some recent topics concerning to our text categorization task are:

  • A.1. Learning time difference for text categorization

     The work addresses text categorization problem that training data may derive from a different time period from the test data.

    A.2. Short text categorization

     Short texts categorization such as search snippets, Web page titles, product reviews, and scientific paper titles that maximizes the impact of informative words due to the sparseness of short length of text.

    A.3. Large scale hierarchical categorization

    Classifying a large, heterogeneous collection of web content by using hierarchical structure of Internet directory.

  • B. Text Summarization

    B.1. Multi-document summarization

     This work focuses on continuous news documents and presents a method for extractive multi-document summarization.

  • C. Recommendation

    C.1. Incorporating guest preferences into collaborative filtering for hotel recommendation

     Hotel recommendation that incorporates different aspects of a product/hotel to improve quality of the score.

    C.2. Job recommendation for recruiting candidates

     Developing a recommendation system to assist candidates in finding jobs that best fit with their individual preferences.

  • D. Prediction

     Prediction of future behaviour by analyzing past data, and generating a model.

  • D.1. Prediction of company's future prospect

     Predicting company's future prospect on R&D in business area by using publication statistics such as frequency on scientific papers and open patents to be published in time series, and sentiment analysis to extract positive news reports related to the companies, and estimated prediction models.

  • E. Information Retrieval

    The scalable term indexing approach for large amount of data; the search models which can provide high quality results and results matching user intention; the learning to rank approaches for ranking the search results based on multiple kinds of information; the evaluation methods for search results.

  • F. Question and Answering

     For generating the correct answers to questions, the knowledge extraction methods from large amount data with question-answer pairs collected by crowdsourcing; the automatic classification methods on reasoning patterns for answer generation; the answer generation methods based on the question content as well as the question context.