MICE: Mining Idioms with Contextual Embeddings

作者:

Highlights:

摘要

Idiomatic expressions can be problematic for natural language processing applications as their meaning cannot be inferred from their constituting words. A lack of successful methodological approaches and sufficiently large datasets prevents the development of machine learning approaches for detecting idioms, especially for expressions that do not occur in the training set. We present an approach called MICE that uses contextual embeddings for that purpose. We present a new dataset of multi-word expressions with literal and idiomatic meanings and use it to train a classifier based on two state-of-the-art contextual word embeddings: ELMo and BERT. We show that deep neural networks using both embeddings perform much better than existing approaches and are capable of detecting idiomatic word use, even for expressions that were not present in the training set. We demonstrate the cross-lingual transfer of developed models and analyze the size of the required dataset.

论文关键词:Machine learning,Natural language processing,Idiomatic expressions,Word embeddings,Contextual embeddings,Cross-lingual transfer

论文评审过程:Received 8 August 2020, Revised 13 October 2021, Accepted 14 October 2021, Available online 19 October 2021, Version of Record 28 October 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107606