A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks

作者:

Highlights:

摘要

Unstructured tweet feeds are becoming the source of real-time information for various events. However, extracting actionable information in real-time from this unstructured text data is a challenging task. Hence, researchers are employing word embedding approach to classify unstructured text data. We set our study in the contexts of the 2014 Ebola and 2016 Zika outbreaks and probed the accuracy of domain-specific word vectors for identifying crisis-related actionable tweets. Our findings suggest that relatively smaller domain-specific input corpora from the Twitter corpus are better in extracting meaningful semantic relationship than generic pre-trained Word2Vec (contrived from Google News) or GloVe (of Stanford NLP group). However, domain-specific quality tweet corpora during the early stages of outbreaks are normally scant, and identifying actionable tweets during early stages is crucial to stemming the proliferation of an outbreak. To overcome this challenge, we consider scholarly abstracts, related to Ebola and Zika virus, from PubMed and probe the efficiency of cross-domain resource utilization for word vector generation. Our findings demonstrate that the relevance of PubMed abstracts for the training purpose when Twitter data (as input corpus) would be scant during the early stages of the outbreak. Thus, this approach can be implemented to handle future outbreaks in real time. We also explore the accuracy of our word vectors for various model architectures and hyper-parameter settings. We observe that Skip-gram accuracies are better than CBOW, and higher dimensions yield better accuracy.

论文关键词:Epidemics,Ebola,Zika,PubMed,Twitter,Text classification,Word Vectors

论文评审过程:Received 5 November 2017, Revised 29 August 2018, Accepted 12 October 2018, Available online 25 October 2018, Version of Record 25 October 2018.

论文官网地址:https://doi.org/10.1016/j.ipm.2018.10.010