DERIN: A data extraction method based on rendering information and n-gram

作者:

Highlights:

• DERIN that aim to improve the wrapper performance.

• DERIN selects automatically the main data region from a search result page and extracting its records and attributes based on rendering information.

• DERIN is able to detect different record structures using techniques based on n-gram.

• DERIN needs no examples to learn how to extract the data.

• DERIN performs independently of domain.

• Experimental results using web pages from several domains show that DERIN is highly effective and very competitive compared with representative methods.

摘要

•DERIN that aim to improve the wrapper performance.•DERIN selects automatically the main data region from a search result page and extracting its records and attributes based on rendering information.•DERIN is able to detect different record structures using techniques based on n-gram.•DERIN needs no examples to learn how to extract the data.•DERIN performs independently of domain.•Experimental results using web pages from several domains show that DERIN is highly effective and very competitive compared with representative methods.

论文关键词:Rendering information,Visual information,Wrapper,Main data region,Path expression,N-gram,Data extraction

论文评审过程:Received 5 April 2016, Revised 6 April 2017, Accepted 26 April 2017, Available online 11 May 2017, Version of Record 11 May 2017.

论文官网地址:https://doi.org/10.1016/j.ipm.2017.04.007