Combining text and link analysis for focused crawling—An application for vertical search engines

作者:

Highlights:

摘要

The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler self-evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain-specific web documents. Our implementation presents a different approach to focused crawling and aims to overcome the limitations imposed by the need to provide initial data for training, while maintaining a high recall/precision ratio. We compare its efficiency with other well-known web information retrieval techniques.

论文关键词:Focused crawling,Information retrieval,Latent semantic indexing,Text categorisation,Vertical search engines

论文评审过程:Received 25 November 2005, Revised 7 July 2006, Accepted 29 September 2006, Available online 7 November 2006.

论文官网地址:https://doi.org/10.1016/j.is.2006.09.004