Learning to crawl deep web

作者：

Highlights：

•

摘要

Deep web or hidden web refers to the hidden part of the Web (usually residing in structured databases) that remains unavailable for standard Web crawlers. Obtaining content of the deep web is challenging and has been acknowledged as a significant gap in the coverage of search engines. The paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment (the deep web database) according to Q-value. While the existing methods rely on an assumption that all deep web databases possess full-text search interfaces and solely utilize the statistics (TF or DF) of acquired data records to generate the next query, the reinforcement learning framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and relaxes the assumption of full-text search implied by existing methods.

论文关键词：Hidden web,Deep web crawling,Reinforcement learning

论文评审过程：Received 23 June 2011, Revised 11 November 2012, Accepted 7 February 2013, Available online 19 February 2013.

论文官网地址：https://doi.org/10.1016/j.is.2013.02.001