Constructing a paraphrase database for agglutinative languages

摘要

Paraphrase databases (PPDBs) are valuable resources for applications that use natural language processing (NLP) technology. In order to construct a high-quality PPDB for agglutinative languages, we propose a phrasal paraphrase extraction method; namely, affix modification-based bilingual pivoting method (AMBPM). AMBPM is suitable for agglutinative languages because it addresses the problems of lexical data sparsity and of not considering morphological word structure. In addition, we propose “improved AMBPM,” which is an improvement on AMBPM by addressing the problem of extracting incorrect stem paraphrase pairs caused by low semantic content stems (LSCSs) by using a rule-based filtering approach. In our experiments on AMBPM, we evaluate AMBPM and compare two state-of-the-art paraphrase extraction methods: the syntactic constraints-based bilingual pivoting method (SCBPM) and word embedding method. In the experiments on improved AMPBM, we evaluate our method and compare the resulting PPDB with four types of databases; PPDB constructed by using the original AMBPM, two PPDBs constructed by using two types of word-embedding-based methods (stem embedding and phrase embedding), and an existing thesaurus. The comparison is performed by using two NLP applications: sentential paraphrase generation and a question answering (QA) system. The experimental results demonstrate that, AMBPM outperforms the state-of-the-art paraphrase extraction methods. In addition, the improved AMBPM, which uses a rule-based filtering method, significantly improves AMBPM. Moreover, although a small amount of training data was used with no aid from linguistic resources, the PPDB constructed with the improved AMBPM is more useful than the four databases for the agglutinative language used in our study. We also publicized the Korean PPDB that was constructed using the improved AMBPM.