Language identification in web documents using discrete HMMs

作者:

Highlights:

摘要

This paper deals with language identification in the domain of web documents. The proposed system is built on hidden Markov models (HMMs) that enable the modeling of character sequences. Furthermore, the use of HMMs provides the means for language tracking, that is, language identification across the segments of a multilingual document.

论文关键词:Statistical language identification,Web documents,Tourism domain,Language tracking,Discrete hidden Markov models (DHMMs)

论文评审过程:Received 17 October 2002, Accepted 22 May 2003, Available online 19 September 2003.

论文官网地址:https://doi.org/10.1016/j.patcog.2003.05.001