Coupled grouping and matching for sign and gesture recognition

摘要

Matching an image sequence to a model is a core problem in gesture or sign recognition. In this paper, we consider such a matching problem, without requiring a perfect segmentation of the scene. Instead of requiring that low- and mid-level processes produce near-perfect segmentation, we take into account that such processes can only produce uncertain information and use an intermediate grouping module to generate multiple candidates. From the set of low-level image primitives, such as constant color region patches found in each image, a ranked set of salient, overlapping, groups of these primitives are formed, based on low-level cues such as region shape, proximity, or color. These groups corresponds to underlying object parts of interest, such as the hands. The sequence of these frame-wise group hypotheses are then matched to a model by casting it into a minimization problem. We show the coupling of these hypotheses with both non-statistical matching (match to sample-based modeling of signs) and statistical matching (match to HMM models) are possible. Our algorithm not only produces a matching score, but also selects the best group in each image frame, i.e. recognition and final segmentation of the scene are coupled. In addition, there is no need for tracking of features across sequences, which is known to be a hard task. We demonstrate our method using data from sign language recognition and gesture recognition, we compare our results with the ground truth hand groups, and achieved less than 5% performance loss for both two models. We also tested our algorithm on a sports video dataset that has moving background.