Properties of the sample estimators used for statistical normalization of feature vectors

作者：Mikhail Y. Prostov, Maria M. Suarez-Alvarez, Yuriy I. Prostov

摘要

Normalization of feature vectors is often used as a step of data preprocessing for clustering. A unified statistical approach to feature vector normalization has been proposed recently by the authors. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. In spite of the importance for estimators to be consistent, the consistency of the sample estimators used for normalization, has never been considered. A mathematical justification of the statistical normalization procedure is given here. The sample estimators proposed for normalization of attributes of feature vectors are proven to have desirable properties, namely they are consistent and unbiased. Some other mathematical questions related to clustering have got here a rigorous treatment. In particular, the statistical normalization procedure is discussed in detail in the cases of the objective functions being based on the Chebyshev, attribute mismatch categorical and Minkowski mixed p-metrics. As an application of the normalization procedure, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the \(k\)-prototypes (for \(p=2\)) or another algorithm (for \(p\not = 2\)).

论文关键词：Estimators, Normalization, Minkowski metrics, Mixed databases

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10618-014-0395-5