Grammars have exceptions

作者:

Highlights:

摘要

Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools. However, the degree of heterogeneity and the lack of structure make standard grammar parsers excessively rigid, and often unable to capture the richness of constructs in these documents. This paper presents Minerva, a formalism for writing wrappers around Web sites and other textual data sources. The key feature of Minerva is the attempt to couple the benefits of a declarative, grammar-based approach, with the flexibility of procedural programming. This is done by enriching regular grammars with an explicit exception-handling mechanism. Contributions of the paper stand in the definition of the formalism, and in the description of its implementation, which relies on a number of ad-hoc techniques for parsing documents, among which an extension of the traditional LL(1) policy based on dynamic tokenization.

论文关键词:Wrappers,Grammars,Exceptions,Documents,Web

论文评审过程:Received 17 February 1998, Revised 15 October 1998, Available online 12 February 1999.

论文官网地址:https://doi.org/10.1016/S0306-4379(98)00028-3