Ternary encoding based feature extraction for binary text classification

作者:Hakan Altınçay, Zafer Erenel

摘要

A novel framework for termset based feature extraction is proposed for binary text classification. The proposed approach is based on the encoding of the terms within a termset. The ternary codes ‘+1’ and ‘−1’ are used to represent the class that the term supports, whereas ‘0’ denotes no support to any of the classes. Four different encoding schemes are proposed where the term weights and the term occurrence probabilities in the positive and negative documents are used to define the ternary code of a given term. The ternary patterns are utilized to define novel features by splitting them into positive and negative codes where each code is treated as a different feature extractor. Use of the derived features individually and together with bag of words representation are both investigated. The histograms of the resultant features are also employed to study the improvements that can be achieved using a small number of additional features to augment bag of words representation. Experiments conducted on four benchmark datasets with different characteristics have shown that the proposed feature extraction framework provides significant improvements compared to the bag of words representation.

论文关键词:Local ternary patterns, Feature extraction, Termsets, n-grams, Termset weighting, Text classification

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-014-0515-3