Extracting keywords of educational texts using a novel mechanism based on linguistic approaches and evolutive graphs

摘要

Keyword extraction is an important topic applicable to a wide range of areas such as span detection, information classification, sentiment analysis, and so on. There are hundreds of algorithms which can extract keywords from text documents. Many of these algorithms also use the functionality of the keywords, which is important, especially if we need to limit to a specific area of knowledge. This research work focuses on extracting keywords from educational texts. In an educational context, the keywords are the most important parts of the lesson and may answer the professor’s questions. Classic keyword extraction algorithms have a very low success rate extracting keywords from educational texts, as the words extracted by these algorithms are very different from those selected by the teachers. Normally, the most important words from an educational point of view would not match with the most repeated words in that text. This research work attempts to improve automatic keyword extraction in educative texts, avoiding professors from having to do this tedious task. The possibility of detecting keywords automatically could be a starting point for the creation of applications capable of generating questions and exercises automatically. We tested whether the most popular word extraction algorithms were able to extract the keywords selected by professors efficiently. The result obtained by current algorithms were no good at all, as they showed a low true positive rate or very high rates of false positive. Due to these reasons, we designed a novel algorithm based on linguistic approaches and evolutive graphs. The research method to obtain the new algorithm was the design of a complex graph which operates with numerous characteristics related to the relationships between words and their linguistic properties. The graph was trained with a set of texts and keywords to establish the optimal weights for each of the characteristics. The proposal achieves a rate of true positives (TP) and F1 score significantly better than other algorithms.