Abstract:
Recognizing gender from audio speech can improve human interactions with technologies. Though the human ear can identify the gender of a person from the sound of their voice, it can be quite complicated for an artificial intelligence (AI) system. Effective classification of gender from audio speech depends not only on the effective representation of the audio signal but also on the implementation of robust algorithms. In this research, we utilized Mel Frequency Cepstral Coefficients (MFCC) to represent the audio samples due to their effectiveness in capturing spectral characteristics, which highlight differences in the vocal tract structures between males and females. MFCC is considered one of the most efficient representations of audio signals that mimic human auditory systems. This study aims to analyze the effectiveness of various machine learning (ML) and deep learning (DL) methods in classifying gender from audio speech utilizing the MFCC feature representation. We experimented with several algorithms: SVM, KNN, stacking ensemble method, and LSTM. Three audio speech datasets were utilized to assess the performance of these algorithms. The best accuracies achieved in these datasets are 93.889\%, 99.371\%, and 94.558\%. Furthermore, based on the findings of the experiment, this study proposes a framework for effective gender classification from audio speech.