An architecture for Malay Tweet normalization

作者:

Highlights:

• To observe features of Malay Tweets, three distinct corpus-based analyses are done.

• A rule-based architecture is developed based on results of the analyses.

• The architecture consists of seven distinct modules in a pipeline structure.

• Experimental results indicate high accuracy in term of BLEU score.

• The architecture outperforms SMT-like normalization approach.

摘要

•To observe features of Malay Tweets, three distinct corpus-based analyses are done.•A rule-based architecture is developed based on results of the analyses.•The architecture consists of seven distinct modules in a pipeline structure.•Experimental results indicate high accuracy in term of BLEU score.•The architecture outperforms SMT-like normalization approach.

论文关键词:Malay,Twitter,Text normalization,Noisy text

论文评审过程:Received 10 October 2013, Revised 25 April 2014, Accepted 28 April 2014, Available online 24 May 2014.

论文官网地址:https://doi.org/10.1016/j.ipm.2014.04.009