Deep neural network framework and transformed MFCCs for speaker's age and gender classification

摘要

Speaker age and gender classification is one of the most challenging problems in speech processing. Although many studies have been carried out focusing on feature extraction and classifier design for improvement, classification accuracies are still not satisfactory. The key issue in identifying speaker's age and gender is to generate robust features and to design an in-depth classifier. Age and gender information is concealed in speaker's speech, which is liable for many factors such as, background noise, speech contents, and phonetic divergences. The success of DNN architecture in many applications motivated this work to propose a new speaker's age and gender classification system that uses BNF extractor together with DNN. This work has two major contributions: Introduction of shared class labels among misclassified classes to regularize the weights in DNN and generation of transformed MFCCs feature set. The proposed system uses HTK to find tied-state triphones for all utterances, which are used as labels for the output layer in the DNNs for the first time in age and gender classification. BNF extractor is used to generate transformed MFCCs features. The performance evaluation of the new features is done by two classifiers, DNN and I-Vector. It is observed that the transformed MFCCs are more effective than the traditional MFCCs in speaker's age and gender classification. By using the transformed MFCCs, the overall classification accuracies are improved by about 13%.

论文关键词：Deep neural network,DNN,I-Vector,MFCCs,Speaker age and gender classification,DNN,Deep neural network,aGender,Age-annotated database of German telephone speech,HTK,Hidden Markov model toolkit,MFCCs,Mel frequency cepstral coefficients,RBM,Restricted Boltzmann machine,DBN,Deep belief networks,GMM,Gaussian mixtures models,SVM,Support vector machines,MLLR,Maximum likelihood linear regression,TPP,Tandem posterior probability,UBM,Universal background model,PPR,Parallel phoneme recognizer,MAP,Maximum-a-posteriori,BNF,Bottle-neck feature,BB-RBM,Bernoulli-Bernoulli RBM,GB-RBM,Gaussian-Bernoulli RBM