Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches

Enrique Alejandro Chim Mex; Antonio Armando Aguileta Güemez; Raúl Antonio Aguilar Vera

doi:10.14483/23448350.24617

Authors

Enrique Alejandro Chim Mex Universidad Autónoma de Yucatán

Competing Interests

Ingeniería de Software.
Antonio Armando Aguileta Güemez Universidad Autónoma de Yucatán https://orcid.org/0000-0001-5155-3543

Competing Interests

Ingeniería de Software, Inteligencia Artificial.
Raúl Antonio Aguilar Vera Universidad Autónoma de Yucatán https://orcid.org/0000-0002-1711-7016

Keywords:

code smells, machine learning, deep learning, software metrics, software engineering (en).

Keywords:

olores de código, aprendizaje automático, aprendizaje profundo, métricas de software, ingeniería de software (es).

Downloads

PDF

Abstract Authors Available Metrics References How to Cite

Abstract (en)

Detecting Code Smells (CS) is important for preventing future problems in software development. It also helps improve software quality and save time on maintenance. This study contributes with a systematic experiment that integrates Data Leakage control, rigorous preprocessing, and the comparison of Machine Learning (ML) and Deep Learning (DL) models, contributing with a replicable methodology for CS detection. To this end, an experiment was designed that focused on CS analysis using artificial intelligence approaches. ML and DL models were applied to the dataset based on method-level software metrics. The methodological process included comprehensive processing, which addressed variable cleaning and normalization, transformations, and feature reduction. In addition, the problem of data leakage was controlled to ensure the validity of the results. Multiple ML models (Random Forest, Support Vector Machine, Decision Tree, K-Nearest Neighbors, Naive Bayes, and Logistic Regression) and a DL model based on a MLP were trained and evaluated. The results showed remarkable performance in most models, achieving accuracy between 94\% and 98\% after cross-validation with 10 folds. However, the MLP stood out with an accuracy close to 99\%, positioning it as the best-performing classifier for CS detection.

Abstract (es)

La detección de olores de código (OC) es importante para prevenir futuros problemas en el desarrollo de software. También ayuda a mejorar la calidad del software y a ahorrar tiempo en mantenimiento. Este estudio contribuye con un experimento sistemático que integra el control de fugas de datos, un preprocesamiento riguroso y la comparación de modelos de aprendizaje automático (AA) y aprendizaje profundo (AA), lo que proporciona una metodología replicable para la detección de OC. Para ello, se diseñó un experimento centrado en el análisis de OC mediante enfoques de inteligencia artificial. Se aplicaron modelos de AA y AA al conjunto de datos basándose en métricas de software a nivel de método. El proceso metodológico incluyó un procesamiento integral que abordó la limpieza y normalización de variables, las transformaciones y la reducción de características. Además, se controló el problema de la fuga de datos para garantizar la validez de los resultados. Se entrenaron y evaluaron múltiples modelos de AA (bosque aleatorio, máquina de vectores de soporte, árbol de decisión, k-vecinos más cercanos, bayesiano ingenuo y regresión logística) y un modelo de AA basado en un MLP. Los resultados mostraron un rendimiento notable en la mayoría de los modelos, alcanzando una precisión de entre el 94 % y el 98 % tras la validación cruzada de 10 pliegues. Sin embargo, el MLP destacó con una precisión cercana al 99 %, lo que lo posiciona como el clasificador con mejor rendimiento para la detección de CS.

Author Biographies

Enrique Alejandro Chim Mex, Universidad Autónoma de Yucatán

Ingeniero de Software por la Universidad Autónoma de Yucatán.

Antonio Armando Aguileta Güemez, Universidad Autónoma de Yucatán

Profesor de la Universidad Autónoma de Yucatán. Doctor en Ciencias de la Computación.

References

Arcelli Fontana, F., Mäntylä, M. V., Zanoni, M., & Marino, A. (2016). Comparing and experimenting machine learning techniques for code smell detection. Empirical Software Engineering, 21(3), 1143-1191. https://doi.org/10.1007/s10664-015-9378-4

Betancourt, G. A. (2005). Las máquinas de soporte vectorial (SVMs). Scientia et technica, 1(27), 67-72.

https://www.redalyc.org/pdf/849/84911698014.pdf

Bouke, M. A., Zaid, S. A., & Abdullah, A. (2024). Implications of data leakage in machine learning preprocessing: A multi-domain investigation [Preprint].

https:/doi.org/10.21203/rs.3.rs-4579465/v1

Caram, F. L., Rodrigues, B. R. D. O., Campanelli, A. S., & Parreiras, F. S. (2019). Machine learning techniques for code smells detection: A systematic mapping study. International Journal of Software Engineering and Knowledge Engineering, 29(02), 285-316. https://doi.org/10.1142/S021819401950013X

Cruz, D., Santana, A., & Figueiredo, E. (2020, June). Detecting bad smells with machine learning algorithms: an empirical study. In ACM (Eds.), TechDebt '20: Proceedings of the 3rd International Conference on Technical Debt (pp. 31-40). ACM. https://doi.org/10.1145/3387906.3388618

Di Nucci, D., Palomba, F., Tamburri, D. A., Serebrenik, A., & De Lucia, A. (2018, March). Detecting code smells using machine learning techniques: Are we there yet? In IEEE (Eds.), 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (saner) (pp. 612-621). IEEE. https://doi.org/10.1109/SANER.2018.8330266

Fontana, F. A., Braione, P., & Zanoni, M. (2012). Automatic detection of bad smells in code: An experimental assessment. The Journal of Object Technology, 11(2), 5-1. https://doi.org/10.5381/jot.2012.11.2.a5

Forero-Corba, W., & Bennasar, F. N. (2024). Técnicas y aplicaciones del Machine Learning e Inteligencia Artificial en educación: una revisión sistemática. RIED-Revista Iberoamericana de Educación a Distancia, 27(1), 209-253. https://doi.org/10.5944/ried.27.1.37491

Falahi, T., Nassreddine, G., & Younis, J. (2023). Detecting data outliers with machine learning. Al-Salam Journal for Engineering and Technology, 2(2), 152-164. https://doi.org/10.55145/ajest.2023.02.02.018

Fowler, M. (2018). Refactoring: Improving the design of existing code. Addison-Wesley Professional.

García, R. (2021). El perceptrón: una red neuronal artificial para clasificar datos. Revista de Investigación en Modelos Matematicos Aplicados a la Gestión de la Economía, 8(1), 1-14.

https://www.economicas.uba.ar/investigacion/wp-content/uploads/Garcia-Roberto-1.pdf

Guggulothu, T., & Moiz, S. A. (2020). Code smell detection using multi-label classification approach. Software Quality Journal, 28(3), 1063-1086. https://doi.org/10.1007/s11219-020-09498-y

Hall, M. A. (2000). Correlation-based feature selection of discrete and numeric class machine learning.

https://researchcommons.waikato.ac.nz/server/api/core/bitstreams/95d64129-4e47-485b-a6f4-bca127238988/content.

Hernández Vargas, L. A. (2015). Selección de la metodología para determinar atipicos en las bases de cálculo de un índice de costos [Trabajo de Investigación Aplicada para Especialización en Estadística Aplicada, Fundacion Universitaria los Libertadores]. https://repository.libertadores.edu.co/server/api/core/bitstreams/06bd2e2c-db97-4769-8d05-fa5ac769d729/content.

Kiyak, E. O., Birant, D., & Birant, K. U. (2019, October). Comparison of multi-label classification algorithms for code smell detection [Conference article]. 2019 3rd international symposium on multidisciplinary studies and innovative technologies (ISMSIT). https://doi.org/10.1109/ISMSIT.2019.8932855

Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Emerging Artificial Intelligence Applications in Computer Engineering, 160(1), 3-24.

Luiz, F. C., de Oliveira Rodrigues, B. R., & Parreiras, F. S. (2019, May). Machine learning techniques for code smells detection: An empirical experiment on a highly imbalanced setup [Conference article]. XV Brazilian Symposium on Information Systems. https//doi.org/10.1145/3330204.3330275

Nguyen Thanh, B., Nguyen NH, M., Le Thi My, H., & Nguyen Thanh, B. (2022, December). ml-Codesmell: A code smell prediction dataset for machine learning approaches. In ACM (Eds.), Proceedings of the 11th International Symposium on Information and Communication Technology (pp. 368-374). ACM. https://doi.org/10.1145/3568562.3568643

Noroozi, Z., Orooji, A., & Erfannia, L. (2023). Analyzing the impact of feature selection methods on machine learning algorithms for heart disease prediction. Scientific reports, 13(1), 22588. https://doi.org/10.1038/s41598-023-49962-w

Paiva, T., Damasceno, A., Figueiredo, E., & Sant’Anna, C. (2017). On the evaluation of code smells and detection tools. Journal of Software Engineering Research and Development, 5(1), 7. https://doi.org/10.1186/s40411-017-0041-1

Palma-Mendoza, R. J., de-Marcos, L., Rodriguez, D., & Alonso-Betanzos, A. (2019). Distributed correlation-based feature selection in spark. Information Sciences, 496, 287-299. https://doi.org/10.1016/j.ins.2018.10.052

Tempero, E. (2011). Qualitas Corpus [Dataset]. http://qualitascorpus.com/

Ramírez-Gallego, S., Krawczyk, B., Garca, S., Woniak, M., & Herrera, F. (2017). A survey on data preprocessing for data stream mining. Neurocomputing, 239(C), 39-57. https://doi.org/10.1016/j.neucom.2017.01.078

Raymaekers, J., & Rousseeuw, P. J. (2024). Transforming variables to central normality. Machine Learning, 113(8), 4953-4975. https://doi.org/10.1007/s10994-021-05960-5

dos Reis, J. P., Abreu, F. B. E., & Carneiro, G. D. F. (2022). Crowdsmelling: A preliminary study on using collective knowledge in code smells detection. Empirical Software Engineering, 27(3), 69. https://doi.org/10.1007/s10664-021-10110-5

dos Reis, J. P., Brito e Abreu, F., & Carneiro, G F. (2022). Code smells dataset (oracles) [Dataset]. https://doi.org/10.5281/zenodo.6555241

Vinay S. (2021). Standardization in machine learning. https://www.researchgate.net/publication/349869617_STANDARDIZATION_IN_MACHINE_LEARNING#fullTextFileContent

Yeo, I. K., & Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954-959.

How to Cite

APA

Chim Mex, E. A., Aguileta Güemez, A. A., and Aguilar Vera, R. A. (2026). Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches. Revista Científica, 53(1), e24617. https://doi.org/10.14483/23448350.24617

ACM

[1]

Chim Mex, E.A. et al. 2026. Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches. Revista Científica. 53, 1 (Jun. 2026), e24617. DOI:https://doi.org/10.14483/23448350.24617.

ACS

(1)

Chim Mex, E. A.; Aguileta Güemez, A. A.; Aguilar Vera, R. A. Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches. Rev. Cient. 2026, 53, e24617.

ABNT

CHIM MEX, Enrique Alejandro; AGUILETA GÜEMEZ, Antonio Armando; AGUILAR VERA, Raúl Antonio. Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches. Revista Científica, [S. l.], v. 53, n. 1, p. e24617, 2026. DOI: 10.14483/23448350.24617. Disponível em: https://revistas.udistrital.edu.co/index.php/revcie/article/view/24617. Acesso em: 21 jul. 2026.

Chicago

Chim Mex, Enrique Alejandro, Antonio Armando Aguileta Güemez, and Raúl Antonio Aguilar Vera. 2026. “Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches”. Revista Científica 53 (1):e24617. https://doi.org/10.14483/23448350.24617.

Harvard

Chim Mex, E. A., Aguileta Güemez, A. A. and Aguilar Vera, R. A. (2026) “Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches”, Revista Científica, 53(1), p. e24617. doi: 10.14483/23448350.24617.

IEEE

[1]

E. A. Chim Mex, A. A. Aguileta Güemez, and R. A. Aguilar Vera, “Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches”, Rev. Cient., vol. 53, no. 1, p. e24617, Jun. 2026.

MLA

Chim Mex, Enrique Alejandro, et al. “Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches”. Revista Científica, vol. 53, no. 1, June 2026, p. e24617, doi:10.14483/23448350.24617.

Turabian

Chim Mex, Enrique Alejandro, Antonio Armando Aguileta Güemez, and Raúl Antonio Aguilar Vera. “Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches”. Revista Científica 53, no. 1 (June 19, 2026): e24617. Accessed July 21, 2026. https://revistas.udistrital.edu.co/index.php/revcie/article/view/24617.

Vancouver

1.

Chim Mex EA, Aguileta Güemez AA, Aguilar Vera RA. Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches. Rev. Cient. [Internet]. 2026 Jun. 19 [cited 2026 Jul. 21];53(1):e24617. Available from: https://revistas.udistrital.edu.co/index.php/revcie/article/view/24617

DOI:

Published:

Issue:

Section:

Analysis of Code Smells Detection Using Machine Learning and Deep Learning Approaches

Authors

Keywords:

Keywords:

Downloads

Abstract (en)

Abstract (es)

Author Biographies

Enrique Alejandro Chim Mex, Universidad Autónoma de Yucatán

Antonio Armando Aguileta Güemez, Universidad Autónoma de Yucatán

References

How to Cite

APA

ACM

ACS

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Visitas

Dimensions

PlumX

Downloads

License

Publication Facts

Author statements

Indexed in

botons

Keywords

Normatividad académica