INDREX: In-database relation extraction

作者:

Highlights:

摘要

The management of text data has a long-standing history in the human mankind. A particular common task is extracting relations from text. Typically, the user performs this task with two separate systems, a relation extraction system and an SQL-based query engine for analytical tasks. During this iterative analytical workflow, the user must frequently ship data between these systems. Worse, the user must learn to manage both systems. Therefore, end users often desire a single system for both analytical and relation extraction tasks.We propose INDREX, a system that provides a single and comprehensive view of the whole process combining both relation extraction and later exploitation with SQL. The system permits a data warehouse style extract-transform-load of generic relations extracted from text documents and can support additional text mining analysis libraries or systems. Once generic relations are loaded, the user can define SQL queries on the extracted relations to discover higher level semantics or to join them with other relational data.For executing this powerful task, our system extends the SQL-based analytical capabilities of a columnar-based massively parallel query processing engine with a broad set of user-defined functions and a data model that supports this task. Our white-box approach permits INDREX to benefit from built-in query optimization and indexing techniques of the underlaying query execution engine.Applications that support both text mining and analytical workflows leverage new analytical platforms based on the MapReduce framework and its open source Hadoop implementation. We compare our system against this base line. We measure execution times for common workflows and demonstrate orders of magnitude improvement in execution time using INDREX.

论文关键词:Iterative text mining in a RDBMS,Ad-hoc reports from text data,Information extraction

论文评审过程:Received 20 March 2014, Revised 5 September 2014, Accepted 18 November 2014, Available online 10 December 2014, Version of Record 26 June 2015.

论文官网地址:https://doi.org/10.1016/j.is.2014.11.006