Issue |
SHS Web Conf.
Volume 218, 2025
2025 2nd International Conference on Development of Digital Economy (ICDDE 2025)
|
|
---|---|---|
Article Number | 02014 | |
Number of page(s) | 8 | |
Section | Finance Tech Advances: Impacts and Innovations | |
DOI | https://doi.org/10.1051/shsconf/202521802014 | |
Published online | 03 July 2025 |
A Comparative Study on Missing Value Imputation Techniques in Machine Learning
College of Liberal Arts & Sciences (LAS), University of Illinois at Urbana-Champaign (UIUC), Urbana, Illinois, 61820, United States
* Corresponding author: haoyum4@illinois.edu
Handling missing values is a crucial step in data preprocessing, as incomplete data can significantly impact model performance and overall data integrity. This study explores and compares various missing value imputation techniques, including deletion methods, simple imputations (mean, median), machine learning-based approaches (k-Nearest Neighbors (k-NN), multiple imputation), and hybrid strategies. The research utilizes an extensive dataset from National Football League Play-by-Play, implementing these techniques and evaluating their effectiveness using Root Mean Squared Error (RMSE) as the primary performance metric. The methodology involves identifying missing values, applying different imputation strategies, and assessing their impact on model performance. Experimental results indicate that machine learning-based imputation methods preserve data distribution better than simple imputations, while hybrid techniques combining multiple approaches yield the most robust results. The study further highlights that improper handling of missing data can lead to biased insights and reduced predictive accuracy. Findings suggest that while deletion is the simplest method, it often results in excessive data loss. Simple imputation introduces biases, whereas k-NN and multiple imputation provide superior accuracy and data retention. Future work should explore deep learning-driven imputation methods and automated techniques like AutoML-based imputation to enhance adaptability across diverse datasets.
© The Authors, published by EDP Sciences, 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.