An empirical symbolic approach to natural language processing

摘要

Empirical methods in the field of natural language processing (NLP) are usually based on a probabilistic model of language. These methods recently gained popularity because of the claim that they provide a better coverage of language phenomena. Though this claim is not entirely proved, empirical methods certainly outperform in this regard rationalist, or symbolic, methods. However, empirical methods provide a probabilistic, not conceptual, explanation of the analyzed linguistic phenomena. Probabilistic systems do “work” in real applications, and this is meritorious, but in our view they are intrinsically unable to provide insight into the mechanisms of human communication, because the output is represented by plain words, or word clusters, with attached probabilities. Eventually, a human analyst must make sense of these data. In the past few years, we explored the possibility of combining the advantages of empirical and rationalist approaches in NLP. Our objective was to define methods for lexical knowledge acquisition that are both scalable and linguistically “appealing”, that is, amenable to a theoretically founded analysis of language. In this paper we describe and evaluate the results of a large-scale lexical learning system, ARIOSTO_LEX, that uses a combination of probabilistic and knowledge-based methods for the acquisition of selectional restrictions of words in sublanguages. We present many experimental data obtained from different corpora in different domains and languages, and show that the acquired lexical data not only have practical applications in NLP, but they are indeed useful for a comparative analysis of sublanguages. Importantly, ARIOSTO_LEX shed light on recurrent linguistic phenomena that have a problematic impact on the large-scale applicability of commonly used NLP techniques.