Automating extraction of logical domains in a web site

作者:

Highlights:

摘要

The domain name field in a universal resource locator (URL) has been viewed as a natural choice to organize Web pages. For example, Web search results may be grouped in terms of domains and presented to users as clusters for ease of visualization. However, using this approach, large Web sites, such as Geocities, W3C, and www.cs.umd.edu, tend to yield many matches that leads to a few large, flat structured, and unorganized clusters. As a matter of fact, many pages in these sites are actually “logical domains” by themselves. For example, Web sites for projects at a university or the XML section at W3C could be viewed as “logical domains”. In this paper, we propose the concept of a logical domain, which is identified by semantic relatedness, as opposed to a physical domain, which is identified simply by domain name. The identification of logical domain is important to many Web applications, such as query result reorganization, site map generation, and topic distillation. We have developed and implemented a set of rules based on link structure, path information, document metadata, and citations to identify logical domain entry pages (i.e., root pages of logical domains). The importance of these rules are automatically adjusted using a novel decision tree algorithm and training data provided by human feedback. We also develop techniques to define the boundary of each logical domain based on identified logical domain entry pages. We have conducted extensive experiments on real Web sites to evaluate the effectiveness of our proposed techniques. The experimental results show that our techniques perform very well in extracting logical domains in a Web site.

论文关键词:Logical domain,Domain boundary,Decision tree algorithm,Information gain,WWW,Link structures

论文评审过程:Received 8 December 2000, Revised 2 May 2001, Accepted 6 February 2002, Available online 27 March 2002.

论文官网地址:https://doi.org/10.1016/S0169-023X(02)00055-1