Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs

作者:Li Yang, Yanhong Zhou

摘要

This paper represents a two-phase approach based on semi-Markov conditional random fields model (semi-CRFs) and explores novel feature sets for identifying the entities in text into 5 types: protein, DNA, RNA, cell_line and cell_type. Semi-CRFs put the label to a segment not a single word which is more natural than the other machine learning methods such as conditional random fields model (CRFs). Our approach divides the biomedical named entity recognition task into two sub-tasks: term boundary detection and semantic labeling. At the first phase, term boundary detection sub-task detects the boundary of the entities and classifies the entities into one type C. At the second phase, semantic labeling sub-task labels the entities detected at the first phase the correct entity type. We explore novel feature sets at both phases to improve the performance. To make a comparison, experiments conducted both on CRFs and on semi-CRFs models at each phase. Our experiments carried out on JNLPBA 2004 datasets achieve an F-score of 74.64 % based on semi-CRFs without deep domain knowledge and post-processing algorithms, which outperforms most of the state-of-the-art systems.

论文关键词:Conditional random fields, Semi-conditional random fields, Feature sets, Two-phase

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-013-0637-7