Unsupervised dialectal neural machine translation

作者:

Highlights:

摘要

In this paper, we present the first work on unsupervised dialectal Neural Machine Translation (NMT), where the source dialect is not represented in the parallel training corpus. Two systems are proposed for this problem. The first one is the Dialectal to Standard Language Translation (D2SLT) system, which is based on the standard attentional sequence-to-sequence model while introducing two novel ideas leveraging similarities among dialects: using common words as anchor points when learning word embeddings and a decoder scoring mechanism that depends on cosine similarity and language models. The second system is based on the celebrated Google NMT (GNMT) system. We first evaluate these systems in a supervised setting (where the training and testing are done using our parallel corpus of Jordanian dialect and Modern Standard Arabic (MSA)) before going into the unsupervised setting (where we train each system once on a Saudi-MSA parallel corpus and once on an Egyptian-MSA parallel corpus and test them on the Jordanian-MSA parallel corpus). The highest BLEU score obtained in the unsupervised setting is 32.14 (by D2SLT trained on Saudi-MSA data), which is remarkably high compared with the highest BLEU score obtained in the supervised setting, which is 48.25.

论文关键词:Neural machine translation,Unsupervised dialectal translation,Regression-based decoding,Shared embedding

论文评审过程:Received 1 November 2018, Revised 19 September 2019, Accepted 11 December 2019, Available online 3 January 2020, Version of Record 3 January 2020.

论文官网地址:https://doi.org/10.1016/j.ipm.2019.102181