Inference of regular languages using state merging algorithms with search

摘要

State merging algorithms have emerged as the solution of choice for the problem of inferring regular grammars from labeled samples, a known NP-complete problem of great importance in the grammatical inference area. These methods derive a small deterministic finite automaton from a set of labeled strings (the training set), by merging parts of the acceptor that corresponds to this training set. Experimental and theoretical evidence have shown that the generalization ability exhibited by the resulting automata is highly correlated with the number of states in the final solution.As originally proposed, state merging algorithms do not perform search. This means that they are fast, but also means that they are limited by the quality of the heuristics they use to select the states to be merged. Sub-optimal choices lead to automata that have many more states than needed and exhibit poor generalization ability.In this work, we survey the existing approaches that generalize state merging algorithms by using search to explore the tree that represents the space of possible sequences of state mergings. By using heuristic guided search in this space, many possible state merging sequences can be considered, leading to smaller automata and improved generalization ability, at the expense of increased computation time.We present comparisons of existing algorithms that show that, in widely accepted benchmarks, the quality of the derived solutions is improved by applying this type of search. However, we also point out that existing algorithms are not powerful enough to solve the more complex instances of the problem, leaving open the possibility that better and more powerful approaches need to be designed.