CJE-TIG: Zero-shot cross-lingual text-to-image generation by Corpora-based Joint Encoding

作者：

Highlights：

•

摘要

Recently, text-to-Image (T2I) generation has been well developed by improving synthesis authenticity, text-consistency and generation diversity. However, large amount of pairwise image–text data required restricts generalization of synthesis models only to its pre-trained language. In this paper, a cross-lingual pre-training method is proposed to adapt target low-resource language to pre-trained generative models. As far as we known, this is the first time that arbitrary input languages could access T2I generation. This joint encoding scheme fulfills both universal and visual semantic alignment. With any prepared GAN-based T2I framework, pre-trained source encoder model could be easily fine-tuned to construct target encoder model and hence entirely enable transfer of T2I synthesis ability between languages. After that, a semantic-level alignment independent of source T2I structure is established to guarantee optimal text consistency and detail generation. Different from monolingual T2I methods that apply discriminator to enhance generation quality, we use an adversarial training scheme that optimizes the sentence-level alignment along with the word-level alignment with a self-attention mechanism. Considering of training for low-resource languages lack of parallel texts in practice, target input embedding is designed available for zero-shot learning. Experimental results prove robustness of the proposed cross-lingual T2I pre-training on multiple downstream generative models and target languages applied.

论文关键词：Cross-lingual pre-training,Text-to-image synthesis,Universal contextual word vector space,Semantic alignment,Joint adversarial training

论文评审过程：Received 28 August 2021, Revised 14 December 2021, Accepted 17 December 2021, Available online 24 December 2021, Version of Record 4 January 2022.

论文官网地址：https://doi.org/10.1016/j.knosys.2021.108006