Information theoretic-PSO-based feature selection: an application in biomedical entity extraction

作者:Shweta Yadav, Asif Ekbal, Sriparna Saha

摘要

Named entity recognition is a vital task for various applications related to biomedical natural language processing. It aims at extracting different biomedical entities from the text and classifying them into some predefined categories. The types could vary depending upon the genre and domain, such as gene versus non-gene in a coarse-grained scenario, or protein, DNA, RNA, cell line, and cell-type in a fine-grained scenario. In this paper, we present a novel filter-based feature selection technique utilizing the search capability of particle swarm optimization (PSO) for determining the most optimal feature combination. The technique yields in the most optimized feature set, that when used for classifiers learning, enhance the system performance. The proposed approach is assessed over four popular biomedical corpora, namely GENIA, GENETAG, AIMed, and Biocreative-II Gene Mention Recognition (BC-II). Our proposed model obtains the F score values of \(74.49\%\), \(91.11\%\), \(90.47\%\), \(88.64\%\) on GENIA, GENETAG, AIMed, and BC-II dataset, respectively. The efficiency of feature pruning through PSO is evident with significant performance gains, even with a much reduced set of features.

论文关键词:Named entity recognition, Feature selection, Binary PSO, Correlation, Mutual information, Normalized mutual information, Particle swarm optimization

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-018-1265-z