TripleRank: An unsupervised keyphrase extraction algorithm

作者:

Highlights:

摘要

Automatic keyphrase extraction algorithms aim to identify words and phrases that contain the core information in documents. As online scholarly resources have become widespread in recent years, better keyphrase extraction techniques are required to improve search efficiency. We present two features, keyphrase semantic diversity and keyphrase coverage, to overcome limitations of existing methods for unsupervised keyphrase extraction. Keyphrase semantic diversity is the degree of semantic variety in the extraction result, which is introduced to avoid extracting synonym phrases that contain the same high-score candidate. Keyphrase coverage refers to candidates’ representativeness of other words in documents. We propose an unsupervised keyphrase extraction method called TripleRank, which evaluates three features: word position (a sensitive feature for academic documents) and two innovative features mentioned above. The architecture of TripleRank includes three sub-models that score the three features and a summing model. Though involving multiple models, there is no typical iteration process in TripleRank; hence, the computational cost is relatively low. TripleRank has led the experiment results on four academic datasets compared to four state-of-the-art baseline models, which confirmed the influence of keyphrase semantic diversity and keyphrase coverage and proved the efficiency of our method.

论文关键词:Keyphrase extraction,Keyphrase semantic diversity,Keyphrase coverage,Unsupervised approach

论文评审过程:Received 16 November 2020, Revised 16 January 2021, Accepted 4 February 2021, Available online 19 February 2021, Version of Record 4 March 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.106846