MGRank: A keyword extraction system based on multigraph GoW model and novel edge weighting procedure

作者:

Highlights:

摘要

Keyword extraction is the process of extracting the most descriptive words from a textual document. State-of-the-art graph-based keyword extraction systems generally represent text documents using a simple graph called a graph-of-words (GoW), based on the sliding window concept. This representation of a text document requires determining a proper window size, models the document on a local scale, and allows the establishment of a single relation between two candidate keywords. In this study, we address these problems and propose a keyword extraction system called MGRank which uses a complete multigraph structure to build a GoW model to represent a text document. The completeness property of the proposed GoW model provides a means to represent a document globally and eliminates the need to determine the window-size parameter. Parallel edges allow the establishment of multiple relations between candidate keywords. In this study, we also propose a new edge-weighting procedure based on the positional distance of candidate keywords. To evaluate the performance of MGRank, we performed experiments on seven benchmark datasets and compared the results with those of six baseline methods. The experimental results show that MGRank outperforms the baseline methods statistically in precision, recall, and F1-score in almost all cases. In terms of mean average precision and mean reciprocal rank, MGRank performs statistically better than node ranking-based and statistical baseline methods and achieves on-par results with topic-based baseline methods. Furthermore, the experimental results showed that MGRank extracted the most relevant keywords.

论文关键词:Keyword extraction,Multigraph,Complete graph,Window size,Edge weighting

论文评审过程:Received 14 December 2021, Revised 15 June 2022, Accepted 16 June 2022, Available online 21 June 2022, Version of Record 1 July 2022.

论文官网地址:https://doi.org/10.1016/j.knosys.2022.109292