The impact of SMOTE and hyperparameter tuning on Random Forest for predicting student attrition
DOI:
https://doi.org/10.52465/josre.v4i2.7Keywords:
Student attrition, Random forest, SMOTE, Hyperparameter tuning, Machine LearningAbstract
Student attrition remains a major challenge in higher education, requiring early intervention for at-risk students. This study examines the effect of Synthetic Minority Oversampling Technique (SMOTE) and hyperparameter tuning on Random Forest performance for predicting student attrition. Using 4,424 student records from the UCI Predict Students’ Dropout and Academic Success dataset, the target variable was converted into binary classification (Dropout vs. Non-Dropout). Four Random Forest models were evaluated: RF-Baseline, RF-SMOTE, RF-Tuned, and RF-SMOTE-Tuned. Data were split into 80% training and 20% testing sets, while Grid Search with 5-fold Stratified Cross-Validation was applied for optimization. Performance was measured using accuracy, precision, recall, F1-score, and AUC. The RF-SMOTE-Tuned model achieved the best results with 0.8825 accuracy and 0.9314 AUC. Results show that SMOTE improved minority-class detection, while hyperparameter tuning increased model stability. Feature importance analysis identified approved curricular units, semester grades, and tuition fee status as the strongest predictors of student attrition.
References
S. C. Matz, C. S. Bukow, H. Peters, C. Deacons, and C. Stachl, “Using machine learning to predict student retention from socio-demographic characteristics and app-based engagement metrics,” Sci. Rep., vol. 13, no. 1, pp. 1–16, 2023, doi: 10.1038/s41598-023-32484-w.
B. Duro, A. Gomes, F. B. Correia, A. R. Borges, and J. Bernardino, “Machine Learning and Deep Learning for Dropout Prediction in Higher Education: A Review,” Computers, vol. 15, no. 3, pp. 1–26, 2026, doi: 10.3390/computers15030164.
R. D. Deleña et al., “Predicting student retention: A comparative study of machine learning approach utilizing sociodemographic and academic factors,” Syst. Soft Comput., vol. 7, no. July 2024, 2025, doi: 10.1016/j.sasc.2025.200352.
A. Bettahi, F. Z. Belouadha, and H. Harroud, “A Modular and Explainable Machine Learning Pipeline for Student Dropout Prediction in Higher Education,” Algorithms, vol. 18, no. 10, pp. 1–31, 2025, doi: 10.3390/a18100662.
M. Vaarma and H. Li, “Predicting student dropouts with machine learning: An empirical study in Finnish higher education,” Technol. Soc., vol. 76, no. September 2023, p. 102474, 2024, doi: 10.1016/j.techsoc.2024.102474.
A. Gonzalez-Nucamendi, J. Noguez, L. Neri, V. Robledo-Rella, and R. M. G. García-Castelán, “Predictive analytics study to determine undergraduate students at risk of dropout,” Front. Educ., vol. 8, no. October, pp. 1–14, 2023, doi: 10.3389/feduc.2023.1244686.
S. Dass, K. Gary, and J. Cunningham, “Predicting student dropout in self-paced mooc course using random forest model,” Inf., vol. 12, no. 11, 2021, doi: 10.3390/info12110476.
S. Lee and J. Y. Chung, “The machine learning-based dropout early warning system for improving the performance of dropout prediction,” Appl. Sci., vol. 9, no. 15, 2019, doi: 10.3390/app9153093.
C. L. Kok, C. K. Ho, L. Chen, Y. Y. Koh, and B. Tian, “A Novel Predictive Modeling for Student Attrition Utilizing Machine Learning and Sustainable Big Data Analytics,” Appl. Sci., vol. 14, no. 21, 2024, doi: 10.3390/app14219633.
D. Opazo, S. Moreno, E. Álvarez-Miranda, and J. Pereira, “Analysis of first-year university student dropout through machine learning models: A comparison between universities,” Mathematics, vol. 9, no. 20, pp. 1–27, 2021, doi: 10.3390/math9202599.
E. C. Umendu, M. Ghanzanfar, A. Kans, and M. A. R. Ahad, “Enhancing Student Retention in Higher Education Institutions (HEIs): Machine Learning Approach,” Electron., vol. 15, no. 4, 2026, doi: 10.3390/electronics15040734.
M. Rebelo Marcolino et al., “Student dropout prediction through machine learning optimization: insights from moodle log data,” Sci. Rep., vol. 15, no. 1, pp. 1–16, 2025, doi: 10.1038/s41598-025-93918-1.
M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical learning,” Stata J., vol. 20, no. 1, pp. 3–29, 2020, doi: 10.1177/1536867X20909688.
U. Olive, M. J. Bosco, and N. M. Enan, “Predicting Student Dropout in Higher Education: An Ensemble Learning Approach with Feature Importance Analysis,” J. Inf. Technol., vol. 5, no. 4, pp. 31–40, 2025, doi: 10.70619/vol5iss4pp31-40.
V. Flores, S. Heras, and V. Julian, “Comparison of Predictive Models with Balanced Classes Using the SMOTE Method for the Forecast of Student Dropout in Higher Education,” Electron., vol. 11, no. 3, 2022, doi: 10.3390/electronics11030457.
Y. Li, Y. Yang, P. Song, L. Duan, and R. Ren, “An improved SMOTE algorithm for enhanced imbalanced data classification by expanding sample generation space,” Sci. Rep., vol. 15, no. 1, pp. 1–21, 2025, doi: 10.1038/s41598-025-09506-w.
J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.
V. Realinho, J. Machado, L. Baptista, and M. V. Martins, “Predicting Student Dropout and Academic Success,” Data, vol. 7, no. 11, 2022, doi: 10.3390/data7110146.
P. Probst, M. N. Wright, and A. L. Boulesteix, “Hyperparameters and tuning strategies for random forest,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 9, no. 3, pp. 1–19, 2019, doi: 10.1002/widm.1301.
G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Syst. Appl., vol. 73, pp. 220–239, 2017, doi: 10.1016/j.eswa.2016.12.035.
H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009, doi: 10.1109/TKDE.2008.239.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002, doi: 10.1613/jair.953.
T. A. Marzuqi, E. Kristiani, and Marcel, “Prediksi Mahasiswa Drop-Out Di Universitas XYZ,” J. Teknol. Inf. dan Ilmu Komput., vol. 11, no. 6, pp. 1345–1350, 2024, doi: 10.25126/jtiik.2024118689.
N. Mduma, K. Kalegele, and D. Machuve, “A survey of machine learning approaches and techniques for student dropout prediction,” Data Sci. J., vol. 18, no. 1, pp. 1–10, 2019, doi: 10.5334/dsj-2019-014.
I. M. S. Bimantara, I. W. Supriana, and I. K. G. Suhartana, “Strategi optimalisasi hyperparameter model machine learning untuk prediksi putus studi dini mahasiswa,” JELIKU (Jurnal Elektron. Ilmu Komput. Udayana), vol. 14, no. 1, pp. 141–156, 2025, [Online]. Available: https://ojs.unud.ac.id/index.php/jlk/article/view/130088

