AutoRM: An effective approach for automatic Web data record mining

作者:

Highlights:

摘要

A Web database typically responds to a query with a Web page, which encodes the query results into semi-structured data objects using HTML tags. We call such data objects Web data records or data records. Mining Web data records is very important for many applications, e.g., meta search, comparative shopping, etc. This paper proposes a new effective approach called AutoRM, which mines data records from single Web page automatically. AutoRM involves three major steps: (1) constructing the DOM tree of the given Web page; (2) mining all sets of adjacent similar C-Records (Candidate data Records) from the constructed DOM tree; (3) mining actual data records from C-Records. In many Web pages, similar data records are distributed in bigger and adjacent similar objects. Existing approaches typically identify such objects as data records. Conversely, AutoRM views such objects as C-Records, and mines actual data records from them. One key issue for mining similar data records is the boundary detection of each data record. Existing approaches typically make some brittle assumptions for handling this issue. By making more robust assumptions, AutoRM tends to detect data record boundaries more accurately. Experimental results show that AutoRM is highly effective, and outperforms state-of-the-art approaches.

论文关键词:Web data extraction,Web mining,Data record mining

论文评审过程:Received 13 February 2015, Revised 24 May 2015, Accepted 14 July 2015, Available online 21 July 2015, Version of Record 19 October 2015.

论文官网地址:https://doi.org/10.1016/j.knosys.2015.07.012