Use of Data Mining for Intelligent Evaluation of Imputation Methods.

David L. la Red Martínez; Carlos R. Primorac

doi:10.9781/ijimai.2023.03.002

Authors

David L. la Red Martínez National Technological University, Resistencia Regional Faculty.
Carlos R. Primorac Computer Science Department, National University of the Northeast.

DOI:

https://doi.org/10.9781/ijimai.2023.03.002

Keywords:

Computer Science, Imputation, Data Mining, Interdisciplinary Applications, Performance Evaluation

Supporting Agencies

This work has been developed in the context of the Research Project code SIUTIRE0005231TC, of the Resistencia Regional Faculty of the National Technological University, Argentine. We would like to thank the Co-Director of this project, Dr. Marcelo Karanik, for reviewing this work, Dr. Jorge Emilio Monzón, for reviewing the English version, and the scholarship holder, student Alejandro Nadal, for his effort and dedication to the multiple data mining processes.

Abstract

In real-world situations, researchers frequently face the difficulty of missing values (MV), i.e., values not observed in a data set. Data imputation techniques allow the estimation of MV using different algorithms, by means of which important data can be imputed for a particular instance. Most of the literature in this field deals with different imputation methods. However, few studies deal with a comparative evaluation of the different methods as to provide more appropriate guidelines for the selection of the method to be applied to impute data for specific situations. The objective of this work is to show a methodology for evaluating the performance of imputation methods by means of new metrics derived from data mining processes, using quality metrics of data mining models. We started from the complete dataset that was amputated with different amputation mechanisms to generate 63 datasets with MV; these were imputed using Median, k-NN, k-Means and Hot-Deck imputation methods. The performance of the imputation methods was evaluated using new metrics derived from quality metrics of the data mining processes, performed with the original full file and with the imputed files. This evaluation is not based on measuring the error when imputing (usual operation), but on considering the similarity of the values of the quality metrics of the data mining processes obtained with the original file and with the imputed files. The results show that –globally considered and according to the new proposed metric, the imputation methods that showed the best performance were k-NN and k-Means. An additional advantage of the proposed methodology is that it provides predictive data mining models that can be used a posteriori.

Downloads

Download data is not yet available.

References

P. Schmitt, J. Mandel, and M. Guedj, “A comparison of six methods for missing data imputation,” Journal of Biometrics & Biostatistics, vol. 6, no. 1, pp. 1–6, 2015, doi: 10.4172/2155-6180.1000224.

A. Farhangfar, L. A. Kurgan, and W. Pedrycz, “A novel framework for imputation of missing values in databases,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 37, no. 5, pp. 692–709, 2007.

T. Aljuaid and S. Sasi, “Proper imputation techniques for missing values in data sets,” in Proc. 2016 International Conference on Data Science and Engineering (ICDSE), 2016.

M. S. Santos, R. C. Pereira, A. F. Costa, J. P. Soares, J. Santos, and P. H. Abreu, “Generating synthetic missing data: A review by missing mechanism,” IEEE Access, vol. 7, pp. 11651–11667, 2019, doi: 10.1109/access.2019.2891360.

Y. Liu and V. Gopalakrishnan, “An overview and evaluation of recent machine learning imputation methods using cardiac imaging data,” Data, vol. 2, no. 8, pp. 1–15, 2017, doi: 10.3390/data2010008.

P. J. García-Laencina, J. L. Sancho-Gómez, and A. R. Figueiras-Vidal, “Pattern classification with missing data: A review,” Neural Computing and Applications, vol. 19, no. 2, pp. 263–282, 2010.

M. M. Rahman and D. N. Davis, “Machine learning-based missing value imputation method for clinical datasets,” in IAENG Transactions on Engineering Technologies, Lecture Notes in Electrical Engineering, vol. 229, pp. 245–257, 2013.

J. M. Jerez, I. Molina, E. A. García-Laencina, N. Ribelles, M. Martín, and L. Franco, “Missing data imputation using statistical and machine learning methods in a real breast cancer problem,” Artificial Intelligence in Medicine, vol. 50, pp. 105–115, 2010.

N. Z. Abidin, A. R. Ismail, and N. A. Emran, “Performance analysis of machine learning algorithms for missing value imputation,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 9, no. 6, pp. 442–447, 2018.

J. Luengo, S. García, and F. Herrera, “On the choice of the best imputation methods for missing values considering three groups of classification methods,” Knowledge and Information Systems, vol. 32, no. 1, pp. 77–108, 2012.

C. R. Primorac, D. L. La Red Martínez, and M. E. Giovannini, “Metodología de evaluación del desempeño de métodos de imputación mediante una métrica tradicional complementada con un nuevo indicador,” European Scientific Journal (ESJ), vol. 16, no. 18, pp. 61–92, 2020.

C. Ballard, J. Rollins, J. Ramos, A. Perkins, R. Hale, A. Dorneich, E. C. Milner, and J. Chodagam, Dynamic warehousing: Data mining made easy, IBM Corporation, 2007.

G. Madhu and T. V. Rajinikanth, “A novel index measure imputation algorithm for missing data values: A machine learning approach,” in 2012 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), 2012, doi: 10.1109/ICCIC.2012.6510198.

D. L. La Red Martínez, M. Karanik, M. Giovannini, M. E. Báez, and J. Torre, “Descubrimiento de perfiles de rendimiento estudiantil: un modelo de integración de datos académicos y socioeconómicos,” Revista Científica Iberoamericana de Tecnología Educativa - Scientific Journal of Educational Technology, vol. 5, no. 2, pp. 70–83, 2016.

J. Han, M. Kamber, and J. Pei, Data mining: Concepts and techniques, 3rd ed., Amsterdam, Netherlands: Elsevier, 2012.

R. J. Roiger, Data mining: A tutorial-based primer, 2nd ed., Boca Raton, FL, USA: CRC Press, Taylor & Francis Group, 2016.

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery in databases,” AI Magazine, vol. 17, no. 3, pp. 37–54, 1996.

I. Kononenko and M. Kukar, Machine learning and data mining: Introduction to principles and algorithms, Amsterdam, Netherlands: Elsevier, 2007.

S. Chakrabarti, E. Cox, E. Frank, R. H. Güting, J. Han, X. Jiang, M. Kamber, S. S. Lightstone, T. P. Nadeau, R. E. Neapolitan, D. Pyle, M. Refaat, M. Schneider, T. J. Teorey, and I. H. Witten, Data mining: Know it all, Amsterdam, Netherlands: Elsevier, 2009.

C. Ballard, N. Harris, A. Lawrence, M. Lowry, A. Perkins, and S. Voruganti, InfoSphere Warehouse: A robust infrastructure for business intelligence, IBM Corporation, 2010.

D. L. La Red Martínez and J. C. Acosta, “Aggregation operators review - mathematical properties and behavioral measures,” International Journal of Intelligent Systems and Applications (IJISA), vol. 7, no. 10, pp. 63–76, 2015.

P. Chan Chiu, A. Selamat, O. Krejcar, K. Kuok Kuok, E. Herrera-Viedma, and G. Fenza, “Imputation of rainfall data using the sine cosine function fitting neural network,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 7, pp. 39–48, 2021.