Big data for Natural Language Processing: A streaming approach

作者:

Highlights:

摘要

Requirements in computational power have grown dramatically in recent years. This is also the case in many language processing tasks, due to the overwhelming and ever increasing amount of textual information that must be processed in a reasonable time frame. This scenario has led to a paradigm shift in the computing architectures and large-scale data processing strategies used in the Natural Language Processing field. In this paper we present a new distributed architecture and technology for scaling up text analysis running a complete chain of linguistic processors on several virtual machines. Furthermore, we also describe a series of experiments carried out with the goal of analyzing the scaling capabilities of the language processing pipeline used in this setting. We explore the use of Storm in a new approach for scalable distributed language processing across multiple machines and evaluate its effectiveness and efficiency when processing documents on a medium and large scale. The experiments have shown that there is a big room for improvement regarding language processing performance when adopting parallel architectures, and that we might expect even better results with the use of large clusters with many processing nodes.

论文关键词:Natural Language Processing,Distributed NLP architectures,Big data,Storm,NLP tools

论文评审过程:Received 30 March 2014, Revised 22 October 2014, Accepted 8 November 2014, Available online 20 November 2014.

论文官网地址:https://doi.org/10.1016/j.knosys.2014.11.007