Comparison of the Performances of Machine Learning Algorithms Using WEKA Feature Selection Methods"


Abstract views: 20 / PDF downloads: 7

Authors

  • Zeynep Güven Aydın Dogus University
  • Rüya Şamlı İstanbul Üniversitesi-Cerrahpaşa, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, İstanbul/Türkiye

DOI:

https://doi.org/10.5281/zenodo.14568594

Keywords:

Machine learning algorithms, software defect prediction, feature seletion, accuracy score, WEKA

Abstract

In this study, the topic of feature selection, which is featured in many publications on software fault prediction in the literature, has been investigated. Feature selection is generally used to increase the accuracy of the classifier by reducing irrelevant and unnecessary features in datasets. In the study, various feature extraction methods were tested on NASA datasets and an experimental dataset, and the operations were performed using the two most suitable methods, Cfs Subset Eval algorithm and Principal Component feature selection methods. As a result, an attempt was made to determine which algorithms have higher success rates.

When the obtained results were examined, an improvement in accuracy rates was generally observed, while some algorithms showed only a minimal difference. When different feature extraction methods were tested on the JM1, KC1, CM1, and PC1 datasets, the 22 features present in all datasets were reduced to 8 features by selecting the most appropriate methods, namely the Cfs Subset Eval algorithm and Principal Component feature selection methods. Subsequently, the accuracy rates of 46 classification algorithms were calculated on the WEKA platform. The best changes in accuracy rates across all datasets were observed with the Bayes Net, Voted Perceptron, K*, and Random Forest algorithms.

It was observed that the loc, n, v, and defect features of the software metrics should definitely be included in all feature selection methods applied on the NASA datasets and experimental datasets. It is clear that the loc (lines of code), n (total number of distinct operators and distinct operands), v (program volume), and defect (whether there is a fault or not) features are quite important in the calculation of software metrics that constitute each dataset.

References

Çatal Ç., (2008). Yazılım Kusur Kestirimi Probleminde Yapay Bağışıklık Sistemlerinin Uygulanması, Doktora, Yıldız Teknik Üniversitesi Fen Bilimleri Enstitüsü.

Güven Aydın, Z. B. (2021). Makine Öğrenmesi Yöntemleri İle Yazılım Hata Tahmini, Doktora Tezi, İstanbul Üniversitesi-Cerrahpaşa, Lisansüstü Eğitim Fakültesi

Abe, S., Thawonmas, R. and Kobayashi, Y., (1998). Feature selection by analyzing class regions approximated by ellipsoids, IEEE Trans. On Systems, Man, and Cybernetics-Part C: Applications and Reviews, 28(2), 282 – 287.

Huang, D., Chow, T. W. S., (2005). Efficiently searching the important input variables using Bayesian discriminant. IEEE Trans. on Circuits and Systems-I: Regular Papers, 52(4), 785

Hall, Mark A., (1999). Correlation-based Feature Selection for Machine Learning, Doktora Tezi, University of Waikato, Department of Computer Science.

Gümüşçü, İ. B. Aydilek ve R. Taşaltın, “Mikro-dizilim Veri Sınıflandırmasında Öznitelik Seçme Algoritmalarının Karşılaştırılması,” Harran Üniversitesi Mühendislik Dergisi, 1(1), 1-7, 2016.

Published

2024-12-29

How to Cite

Güven Aydın, Z., & Şamlı, R. (2024). Comparison of the Performances of Machine Learning Algorithms Using WEKA Feature Selection Methods". Euroasia Journal of Mathematics, Engineering, Natural & Medical Sciences, 11(37), 1–13. https://doi.org/10.5281/zenodo.14568594

Issue

Section

Articles