Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched Word Embeddings

作者:

Highlights:

摘要

Natural Language Processing (NLP) field is taking great advantage from adopting models and methodologies from Artificial Intelligence. In particular, Part-Of-Speech (POS) tagging is a building block for many NLP applications. In this paper, a POS tagging system based on a deep neural network is proposed. It is made of a static and task-independent pre-trained model for representing words semantics enriched by morphological information, by approximating the Word Embedding representation learned from an unlabelled corpus by the fastText model, so as to handle consistently common and known words as well as rare and Out-of-Vocabulary words. A character-level representation of words is dynamically learned according to the POS tagging task, and is concatenated to the previous one. This joint representation is fed to the main network, comprising a Bi-LSTM layer, trained to associate a sequence of tags to a sequence of words. The effectiveness of the contributions of the proposed system with respect to the state-of-the-art is proven by an extensive experimental campaign, which provides evidence that improvements are gained in POS tagging accuracy by using Word Embeddings enriched with morphological information, by estimating embeddings for both known and unknown words, and by concatenating Word Embeddings with character-level information of the same size. Similar trends are obtained for two languages of different characteristics, namely English and Italian: in both cases, the overall accuracy on the POS tagging test set was increased with respect to the most advanced existing systems, with particular improvements on the accuracy of Out-of-Vocabulary words. Finally, the method has a general basis, and could be proficiently used for all languages, particularly for those showing a wide morphological richness.

论文关键词:NLP,POS tagging,Deep neural networks,Bi-LSTM,Out of Vocabulary

论文评审过程:Received 10 August 2018, Revised 28 October 2018, Accepted 1 November 2018, Available online 7 November 2018, Version of Record 19 December 2018.

论文官网地址:https://doi.org/10.1016/j.knosys.2018.11.003