Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling

作者:

Highlights:

• Several studies achieving near-perfect prediction results on the TPEHGDB dataset do this by introducing a methodological flaw in the data processing, in particular in the application of over-sampling to counter class imbalance.

• When reproducing the proposed methods with correct data processing, they often do not perform significantly better than random guessing.

• Over-sampling, when correctly applied, has a noticeable yet more moderate impact on prediction effectiveness.

摘要

•Several studies achieving near-perfect prediction results on the TPEHGDB dataset do this by introducing a methodological flaw in the data processing, in particular in the application of over-sampling to counter class imbalance.•When reproducing the proposed methods with correct data processing, they often do not perform significantly better than random guessing.•Over-sampling, when correctly applied, has a noticeable yet more moderate impact on prediction effectiveness.

论文关键词:Preterm birth risk estimation,Over-sampling,Electrohysterography

论文评审过程:Received 15 January 2020, Revised 9 September 2020, Accepted 12 November 2020, Available online 20 November 2020, Version of Record 4 December 2020.

论文官网地址:https://doi.org/10.1016/j.artmed.2020.101987