A cost-effective method for detecting web site replicas on search engine databases

作者:

Highlights:

摘要

Identifying replicated sites is an important task for search engines. It can reduce data storage costs, improve query processing time and remove noise that might affect the quality of the final answers given to the user. This paper introduces a new approach to detect web sites that are likely to be replicas in a search engine database. Our method uses the websites’ structure and the content of their pages to identify possible replicas. As we show through experiments, such a combination improves the precision and reduces the overall costs related to the replica detection task. Our method achieves a quality improvement of 47.23% when compared to previously proposed approaches.

论文关键词:Site replication,Mirror,Search engines

论文评审过程:Received 12 August 2006, Accepted 12 August 2006, Available online 2 October 2006.

论文官网地址:https://doi.org/10.1016/j.datak.2006.08.010