Machine learning-based risk prediction model for cardiovascular disease using a hybrid dataset

作者：

Highlights：

•

摘要

CVD (cardiovascular disease) is one of the most common causes of death in the world today. CVD prediction allows health professionals to make an informed decision about their patients’ health. Data mining is the process of transforming large amounts of medical data in its raw form into actionable insights that can be used to make intelligent forecasts and decisions. Machine learning (ML) based prediction models provide a better solution to help patients’ health diagnoses in the health care industry. The objective of this research is to create a hybrid dataset to aid in the development of a best CVD risk prediction model. The Hungarian, the Switzerland, the Cleveland, and the Long Beach datasets are the most commonly used datasets in heart disease (HD) prediction. These datasets have a maximum of 303 instances with missing values in their features, and the presence of missing values reduces the accuracy of the prediction model. So, in this article, we created the ”Sathvi” dataset by combining these datasets, and it has 531 instances with 12 attributes with no missing data. The Pearson’s correlation method was used to eliminate redundant features during the feature selection process. The Naive Bayes (NB), XGBoost, k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), and CatBoost ML classifiers have been applied for prediction. The CatBoost ML classifier was validated with 10-fold cross validation, and the best accuracy ranged from 88.67% to 98.11%, with a mean of 94.34%.

论文关键词：Heart disease,Machine learning classifier,Feature selection,Hybrid heart disease dataset,CatBoost

论文评审过程：Received 27 December 2021, Revised 8 May 2022, Accepted 21 May 2022, Available online 1 June 2022, Version of Record 10 June 2022.

论文官网地址：https://doi.org/10.1016/j.datak.2022.102042