You are viewing the site in preview mode

Skip to main content

Hearing vocals to recognize schizophrenia: speech discriminant analysis with fusion of emotions and features based on deep learning

Abstract

Background and objective

Accurate detection of schizophrenia poses a grand challenge as a complex and heterogeneous mental disorder. Current diagnostic criteria rely primarily on clinical symptoms, which may not fully capture individual differences and the heterogeneity of the disorder. In this study, a discriminative model of schizophrenic speech based on deep learning is developed, which combines different emotional stimuli and features.

Methods

A total of 156 schizophrenia patients and 74 healthy controls participated in the study, reading three fixed texts with varying emotional stimuli. The log-Mel spectrogram and Mel-frequency cepstral coefficients (MFCCs) were extracted using the librosa-0.9.2 toolkit. Convolutional neural networks were applied to analyze the log-Mel spectrogram. The effects of different emotional stimuli and the fusion of demographic information and MFCCs on schizophrenia detection were examined.

Results

The discriminant analysis results showed superior performance for neutral emotional stimuli compared to positive and negative stimuli. Integrating different emotional stimuli and fusing features with personal information improved sensitivity and specificity. The best discriminant model achieved an accuracy of 91.7%, sensitivity of 94.9%, specificity of 85.1%, and ROC-AUC of 0.963.

Conclusions

Speech analysis under neutral emotional stimulation demonstrated greater differences between schizophrenia patients and healthy controls, enhancing discriminative analysis of schizophrenia. Integrating different emotions, demographic information and MFCCs improved the accuracy of schizophrenia detection. This study provides a methodological foundation for constructing a personalized speech detection model for schizophrenia.

Peer Review reports

Introduction

Schizophrenia has a profound impact on both patients and their families, leading to a significant healthcare burden with a lifetime prevalence of approximately 1% [1]. According to China’s 2015 survey data, the prevalence of schizophrenia has more than doubled from 0.57 to 0.83% in urban areas and 0.37 to 0.50% in rural areas from 1990 to 2010 [2]. Despite extensive research efforts, the etiology and pathological mechanisms of schizophrenia remain incompletely understood. It has been demonstrated that schizophrenia exhibits considerable heterogeneity, suggesting that patients may present with diverse symptoms and clinical manifestations [3,4,5,6]. Therefore, the detection of schizophrenia is a complex process that requires careful consideration of the patient’s symptoms, medical history, and physical and laboratory examinations [7, 8]. Even when employing the diagnostic criteria outlined in the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV), psychiatric specialists can still make diagnostic errors [9]. Furthermore, the process of psychiatric diagnosis is time-consuming and heavily reliant on the experience and knowledge of psychiatrists. Developing robust diagnostic models, through stable biomarkers or easily obtainable characteristics would greatly alleviate the burden on physicians and patients alike.

Speech analysis has emerged as a promising diagnostic tool for schizophrenia, as it reveals notable differences in speech patterns between patients with schizophrenia (SZ) and healthy controls (HC). These differences are closely related to the symptom characteristics of schizophrenia and are manifested in multiple aspects. Patients with negative symptoms, such as aphasia and emotional flattening, have slower speaking speed, longer pauses, poorer fluency, and monotonous rhythms [10,11,12,13], reflecting a decrease in motivation and emotional expression. Positive symptoms in speech are usually characterized by confusion in language content, lack of thematic coherence [14, 15], and are closely related to thinking disorders. Difficulty in finding words and producing specific words [16,17,18,19] may be indicative features of cognitive impairment. Furthermore, abnormalities in speech rhythm and intonation have been observed in schizophrenia patients, characterized by erratic changes and instability [20, 21]. This is not merely artefacts of pathology, but may reflect disrupted neural integration of linguistic and emotional processing. These speech abnormalities are quantified through speech analysis to obtain features, which are often combined with machine learning algorithms for auxiliary detection of schizophrenia and predicting the severity of clinical symptoms. For instance, Chakraborty et al. [22] achieved an accuracy range of 61–85% in predicting the severity of schizophrenia-related emotional symptoms by low-level acoustic signals of prosody. Xu et al. [23] employed conversational, phonatory, articulatory, and prosodic features obtained from interviews to predict negative symptoms in SZ, achieving a differentiation accuracy range of 69 to 75% between schizophrenia and depression. The combination of speech analysis and machine learning has achieved some results in the medical field. The rise of deep learning has facilitated end-to-end automated prediction, which has gained popularity. Deep learning approaches often utilize speech spectrograms as an input to neural networks, surpassing traditional machine learning methods in recognition accuracy. Spectral maps enable the observation of signal intensity changes over time across different frequency bands. Voppel [24] successfully improved the identification accuracy of schizophrenia spectrum disorders from 80 to 85% by integrating acoustic and semantic features. Fu et al. [25] designed Sch-Net for end-to-end speech detection and analysis, achieving an accuracy of 97.76% on the speech dataset of schizophrenia. He et al. [26] utilized wideband and narrowband speech spectrograms as inputs to WNSA-Net, achieving an accuracy of 97.37% on the schizophrenia dataset.

Deep learning has the potential to reveal the unique vocal patterns of patients with schizophrenia from spectral maps, but each individual’s vocal differences may not only be influenced by disease factors. For example, the vocal cords and laryngeal muscle tissue tend to atrophy with age, leading to a raspy, low-pitched voice [27]. There were also significant differences in voice between the sex. Differences in the size and position of laryngeal nodes and vocal cord lengths result in a lower and thicker voice in males than in females [28, 29]. Another potential confounder is education. Education level primarily affects pronunciation accuracy and fluency [30, 31]. However, even individuals of the same age, gender, and education level may still produce different sounds due to differences in their vocal tract. The Mel-frequency cepstral coefficients (MFCCs) are a feature representation of speech signals that captures information related to timbre and pitch. It is obtained from the log-Mel spectrum and reflects the individual characteristics of the vocal tract [32, 33]. In the acoustic analysis of speech, MFCCs are commonly used due to its ability to encode both low-frequency envelope and high-frequency details of the spectrum. It is also utilized in speaker recognition as it carries individual attribute information [34] and exhibits significant differences between SZ and HC [35,36,37]. Additionally, SZ often experience impaired emotional perception and expression abilities [38], leading to subtle differences in speech quality across different emotions. These factors may influence accuracy and stability of diagnostic models for heterogeneous diseases like schizophrenia.

Therefore, it is crucial to consider individual speech differences when developing a speech detection model for schizophrenia. To control the impact of these factors, a common approach is to employ pre-control through matching design. However, in practical applications, simultaneously controlling multiple confounding factors, especially when there are interactions or complex relationships between them, can be challenging. So, can the integration of individual difference variables in the process of detection model construction reduce the interference of non-independent variables and achieve more accurate detection?

In this study, to avoid the influence of semantic interference and explicit emotional expression, we choose fixed text with positive, neutral and negative emotions as speech materials. The log-Mel spectrum was used as input, the convolution layer of ResNet18 was employed as feature extractor to extract deep features, and the full connection layer was reconstructed to integrate MFCCs, demographic information and deep features. To mitigate the variability of speech emotion expression, the speech discrimination results of the three emotional texts were weighted and integrated. This study aimed to explore the effects of different emotional stimuli, demographic information, and MFCCs on detection of schizophrenia through speech, providing a methodological foundation for constructing a personalized speech detection model for schizophrenia.

Methods

Participant selection criteria

This study enrolled a total of 230 participants, including 156 SZ (78 males, 78 females) and 74 HC (28 males, 46 females). Prior to enrollment, all participants underwent screening by an experienced psychiatrist and received detailed explanations about the study’s objectives and procedures. The demographic information of the participants is presented in Table 1.

Table 1 Demographic characteristics of the two groups

The inclusion criteria for the patient group were as follows: (1) diagnosis of schizophrenia based on the DSM-IV; (2) age ranging between 18 and 60 years old; (3) completion of at least a junior high school education; (4) a stable condition, with no risk of aggression and maintain a fixed antipsychotic medication regimen; (5) ability to cooperate to complete evaluation. The exclusion criteria for the patient group included: (1) presence of a diagnosed neurological disorder related to the brain; (2) suffering from a severe physical illness; (3) afflicted with acute nasopharyngeal or respiratory disorders impacting vocal cord function; (4) existence of auditory or visual perception impairments.

The healthy control group was recruited from among hospital staff members and their relatives. They were selected based on these inclusion criteria: (1) demonstrating normal mental functioning without any identified mental health or behavioral issues ascertained through doctor interviews; (2) no history of psychiatric treatment or consultations and no known familial predisposition to psychiatric disorder; (3) age ranging between 18 and 60 years old; (4) possessing at least a junior high school education or higher and fluency in reading Chinese; (5) scores on the Generalized Anxiety Disorder Screener (GAD-7) and the Patient Health Questionnaire-9 (PHQ-9) less than 4. Exclusion criteria for the healthy controls mirrored those of the patient group, specifically excluding individuals who met any of the following conditions: (1) diagnosis of a neurological disorder affecting the brain; (2) suffering from serious physical illnesses; (3) diagnosis of acute nasopharyngeal or respiratory disorders that affect vocal cords; (4) existence of auditory or visual perception barriers.

The study received approval from the Ethics Committee of Beijing Huilongguan Hospital, and all participants provided written informed consent. The study adhered to the principles outlined in the Declaration of Helsinki.

Experimental task

Participants were instructed to read three Chinese texts displayed on a computer screen. The selection of these texts was based on previous research [39], which consisted of one positive text, one negative text, and one neutral text. The positive text depicted a joyful family reunion during the Spring Festival, whereas the negative text portrayed various hardships in life. The neutral text introduced the LuGou Bridge in Beijing. The English version of the manuscript is presented in Fig. 1 of the supplementary material. The speech duration of two groups of people reading three types of texts is shown in Table 1.

Fig. 1
figure 1

Flow chart of processing

Data acquisition and preprocessing

General demographic information, including age, sex, and years of education, was collected from all participants. For the patient group, additional data such as age at onset and disease duration were recorded, and they completed the Positive and Negative Syndrome Scale (PANSS) [40]. All participants were tested in a quiet room with controlled background noise levels below 30 db. Voice recordings were captured using a fixed ISK BM-5000 microphone with a sampling frequency of 44.1 kHz and a bit rate of 1411 kbps. Participants were instructed to read each of the three emotional texts in their natural tone of voice. Three audio signals were recorded for each participant, and each signal file was processed individually. The signal processing flowchart was illustrated in Fig. 1. The processing process of audio signal was implemented by Librosa-0.9.2 toolkit [41]. The recorded audio was segmented using a window length of 4 s and step length of 2 s, respectively. Next, pre-emphasis with a parameter of 0.97 was applied. Pre-emphasis is placed on enhancing the high-frequency components of speech signals through a first-order high pass filter \({S}^{\prime}\left(n\right)=S\left(n\right)- \alpha \cdot S(n-1)\). This parameter \((\alpha\) = 0.97) is a commonly used value in speech analysis, which can effectively balance the spectral energy distribution and improve the robustness of subsequent feature extraction. Subsequently, a short-term Fourier transform was performed to obtain an acoustic spectrogram. The modulus of the spectrogram was squared, and Mel-filtering was applied. The resulting power spectrogram was then converted to decibel (dB) units, resulting in the log-Mel spectrogram. From the log-Mel spectrogram, 40-dimensional MFCCs were extracted using cepstral analysis.

Deep learning tasks

Convolutional neural networks

The depth feature extractor employed in this study was the convolutional layer of ResNet18 [42]. To accommodate the varying lengths of the fused information, the fully connected layer was reconstructed to develop three additional models. The architectures of the four models were demonstrated in Fig. 2, named as follows: ResNet18 (input: speech spectrogram), ResNet18_ASE (input: speech spectrogram along with age, sex, and education (ASE)), ResNet18_MFCC (input: speech spectrogram and MFCC), and ResNet18_ASE_MFCC (input: speech spectrogram, age, sex, education, and MFCC).

Fig. 2
figure 2

Architecture of ResNet18, ResNet18_ASE, ResNet18_MFCC, and ResNet18_ASE_MFCC

Details of training

The deep learning models were trained by PyTorch 1.10.1 + cu102 in Python 3.8.0. For the weight initialization of the ResNet18 convolutional layer, transfer learning based on ImageNet was employed, and subsequently, all model parameters were retrained on our datasets. In the data preprocessing stage, all Mel spectrum dimensions were resized to 224 × 224 and normalized according to the mean and variance of ImageNet. To enhance model stability, temporal and frequency shields were introduced during the training process with a probability of 0.5, as illustrated in Fig. 2a of supplementary materials. A focal loss [43] with parameters gamma = 0 and alpha = 0.333 was utilized to address data imbalance. The stochastic gradient descent (SGD) optimizer was used with preheating and the cosine annealing strategy. The training process consisted of 100 epochs, with the first 10 epochs serving as the warm-up phase, followed by 30 epochs of cosine annealing with a learning rate ranging from 1.0E-4 to 0.1, as shown in Fig. 2b of supplementary materials. To obtain robust model performance, we applied individual-based five-fold cross-validation. The number of participants and samples in the training set, validation set, and test set for five-fold cross-validation was presented in Table 1 of supplementary materials.

Evaluation metrics

The performance metrics encompassed accuracy, balanced accuracy, sensitivity, specificity, and the Receiver Operating Characteristic Area Under the Curve (AUC). Distinctively, the ultimate test outcome focused on classifying speakers rather than individual speech segments. Each audio recording from participants was segmented into four-second intervals. The neural network’s output denoted the likelihood that a given speech segment originated from a schizophrenic individual. During the training and validation phases, the model’s performance was gauged on segments. However, when it came to testing, a speaker-independent strategy was adopted: all segments pertaining to each subject in the test set were fed through the model. Consequently, for each volunteer in the test dataset, their final assessment was derived by averaging the probability scores of all their corresponding speech segments. This method thus provided an aggregated evaluation of their speech patterns indicative of schizophrenia.

Statistical analysis

The diagnosis of schizophrenia served as the dependent variable, whereas gender, age, and years of education were treated as independent variables. Notably, there were missing values for the education years of four participants, two each in the schizophrenia group and the healthy control group. For each volunteer with missing data, we searched for individuals whose age difference was no more than zero years according to the members of the same gender in their group, and then took the average years of education of these individuals as the filling value. If no matching object could be found under the same age condition, the age difference limit would be relaxed year by year until a suitable data source was found. For the categorical variable of gender, a chi-square test was utilized for comparative analysis. Meanwhile, independent sample t-tests were conducted for continuous variables such as age and years of education. Additionally, we computed the mean and standard deviation for the onset time, duration, and PANNS scale scores exclusively for the schizophrenia group. Statistical analyses for this study were executed utilizing IBM SPSS Statistics software version 26.0 (IBM Corp., Armonk, NY, USA). A significance level of p < 0.05 was set for all hypothesis testing.

Results

General demographic data of the two groups

There were no significant differences in age, education, or sex distribution between the groups (Table 1). The age of onset and disease duration was 23.961 ± 7.457 years and 19.313 ± 11.587 years, respectively. The scores of PANSS were 12.736 ± 5.390 (positive score), 18.560 ± 6.528 (negative score), 29.120 ± 7.550 (general score), and 60.416 ± 14.400 (total score). The control group had average scores of 1.105 ± 1.658 on the Generalized Anxiety Disorder Scale (GAD-7) and 1.324 ± 2.067 on the Patient Health Questionnaire (PHQ-9).

Classification results of feature fusion

The discrimination results under all conditions are presented in Table 2 of supplementary materials, summarizing the speech classification performance of different models across various emotional texts. Accuracy served as the primary criterion for evaluating the superior performance of each model. Models with higher accuracy indicated better overall performance in schizophrenia classification tasks. Balanced accuracy complemented this primary metric by addressing potential class imbalance in the dataset.

Figure 3 illustrates the classification effectiveness of the log-Mel spectrogram combined with diverse personalized features under distinct emotional conditions and during emotion fusion. For positive emotional stimuli, the ResNet18_MFCC model demonstrated the highest accuracy, achieving 89.6%. For neutral and negative emotions, the ResNet18_ASE_MFCC model consistently outperformed others, attaining accuracies of 89.6% and 89.1%, respectively. Upon implementing emotion fusion, the ResNet18_ASE_MFCC model continued to excel, with accuracy improving to 91.7%, suggesting that the integration of individual characteristics enhances classification capabilities. Broadly, appending either demographic data or MFCC features to the models significantly boosted classification power across all emotional contexts. Notably, the combination of MFCCs and demographic information yielded marginally better results compared to employing MFCCs alone. Furthermore, the results of spectrograms incorporating MFCCs proved more beneficial than adding demographic data alone, and the results of spectrograms incorporating demographic data was advantageous compared to incorporating no additional information.

Fig. 3
figure 3

Panels a, b, c and d depict the five-evaluation metrics of the classification models under positive, neutral, negative and emotion fusion conditions, respectively. The labels 1, 2, 3, and 4 on the horizontal axis correspond to the ResNet18, ResNet18_ASE, ResNet18_MFCC, and ResNet18_ASE_MFCC models, respectively. Panel e highlights the improvements in sensitivity and specificity for schizophrenia classification, achieved through feature fusion across various emotional conditions

Moreover, we investigated the impact of feature fusion on sensitivity and specificity metrics. The improvement in performance resulting from feature fusion was calculated using Equation 1. The enhancement in discriminatory performance achieved through personalized feature fusion quantified by subtracting the performance of the model without feature fusion from the average performance of the model incorporating all three fused features. As shown in Fig. 3e, the integration of personalized features played a crucial role in improving specificity, particularly under negative emotional text conditions, where specificity increased by 13.70%. This result highlights that incorporating individual-specific features helps improve true negative rates, enhances the model’s ability to distinguish between patients and non-patients, and ultimately improves diagnostic accuracy.

Classification results of emotion fusion

Figure 4 presents the comparison of classification performance of each model under various emotional conditions and emotion fusion. Without incorporating demographic information or MFCCs, the classifiers achieved their highest accuracy with neutral emotional stimuli, reaching 85.7%. When demographic data were integrated, the model’s performance improved significantly, particularly for neutral stimuli, achieving a peak accuracy of 87.8%. For models incorporating MFCCs, those trained on positive emotional stimuli exhibited the highest accuracy, reaching 89.6%. However, when both demographic attributes and MFCCs were combined, the model again demonstrated superior performance with neutral emotional stimuli, achieving an accuracy of 89.6%, matching the performance of the best positive-stimuli model.

Fig. 4
figure 4

Panels a, b, c, and d present the classification results of the ResNet18, ResNet18_ASE, ResNet18_MFCC, and ResNet18_ASE_MFCC models under different emotional text conditions. Panel e illustrates the improvements in schizophrenia classification performance brought about by emotion fusion across the various models

To further explore the pivotal role of neutral emotions in the classification process, three parameters (a, b, c) were assigned to the classification probability values of positive, neutral, and negative emotions, for emotion fusion, as expressed in Equation 2. The results detailed in Table 2, revealed that combining different emotional stimuli can improve the accuracy of classification, and neutral emotional input always occupies the highest weight in the optimal parameter set of various fusion strategies.

Table 2 Classification results of emotion fusion

Using Equation 3 to calculate the performance improvement brought by emotion fusion, as shown in Fig. 4e, emotion fusion led to improvements in both sensitivity and specificity. As demographic information and MFCCs were gradually integrated into the model, sensitivity improvements diminished slightly while specificity improvements became more pronounced. This dynamic adjustment effectively balanced sensitivity and specificity, addressing potential disparities between the two metrics.

$${Improvement}_{Feature\;Fusion}=\overline{{ResNet18\_ASE}_{emotion}+{ResNet18\_MFCC}_{emotion}+ResNet18\_ASE\_MFCC_{emotion}}-{ResNet18}_{emotion}$$
(1)

Equation 1. Calculation formula for improvement of feature fusion

$$\left\{\begin{array}{c}probability= a*positive+b*neutral+c*negative\\ label=Sigmoid\left(probability\right)\\ a+b+c=1\\ 0\le a\le 1\\ 0\le b\le 1\\ 0\le c\le 1\end{array}\right.$$
(2)

Equation 2. The calculation formula of emotional fusion

$${Improvement}_{Emotion\;Fusion}={Model}_{Emotion\;Fusion}-\overline{{{Model}_{positive}+{Model}_{neutral}+Model}_{negative}}$$
(3)

Equation 3. Calculation formula for improvement of emotion fusion

Visualization

We applied the Grad-CAM [44] technique to visualize the activation of log-Mel spectrograms under three emotional texts in schizophrenia and healthy control groups. One subject from each group was randomly selected, and the activation maps were generated for the middle segments of the emotional texts. The visualization results were presented in Fig. 5, with a log-Mel spectrogram on the left side for each group and an activation map on the right side. The brighter regions of the activation map indicated higher weight and greater contribution to the predicted value.

Fig. 5
figure 5

Activation map

In the log-Mel spectrogram, brighter color indicates higher energy level. Along the frequency axis, SZ exhibited concentrated energy in low-frequency regions (e.g., below 512 Hz), whereas HC showed balanced energy distribution across frequencies, with sustained brightness above 1024 Hz. This contrast suggests that HC speech encompasses both low-pitched and high-pitched components, reflecting layered emotional expression, while SZ speech is characterized by monotonicity and reduced high-frequency modulation, which may be a sign of emotional flattening and prosodic poverty. Along the temporal axis, SZ spectrograms displayed continuous spectral stripes with blurred transitions between speech segments, whereas HC exhibited distinct blank intervals (Fig. 5). These blank intervals correspond to rhythmic pauses and prosodic variations in HC speech, indicative of natural cadence and emotional modulation. In contrast, the blurred transitions in SZ align with prolonged phonation and reduced rhythmic clarity, consistent with their slower speech rates and extended utterance durations (Table 1). Grad-CAM visualizations revealed that high-frequency blurred regions were more prominently activated in SZ spectrograms (Fig. 5). This activation pattern correlates with SZ-specific vocal traits, such as slurred articulation, monotonous pitch, and diminished emotional prosody. These features likely serve as discriminative biomarkers for convolutional neural networks to identify schizophrenia through spectral mapping.

Discussion

This study aimed to investigate the impact of demography information, MFCCs and different emotional stimuli on speech discrimination of schizophrenia. A neural network model, ResNet18, was constructed using deep learning techniques. The base input was the log-Mel spectrum, and demographic information or MFCCs were jointly input into the network structure as individual features. The performance of the discriminative models was analyzed based on classification results to assess the contribution of different types of information. Furthermore, Grad-CAM was utilized to visualize class activation maps, providing insights into potential vocal patterns for schizophrenia detection. These data-driven methods aimed to enhance the interpretability of the neural network and facilitate a better understanding of the underlying factors involved in schizophrenia detection.

In our study, the classification results of Resnet showed that the neutral emotion yielded better performance compared to other emotional stimuli. Notably, in emotion fusion experiments, the neutral emotion consistently received the highest weight, further underscoring its discriminative power. It is possible that under neutral stimuli, the discrimination of speech produced by the two groups of participants is greater. SZ experienced reduced electrophysiological discrimination between pleasant and neutral stimuli [45] and felt more negative emotions during emotional experiences [46]. Previous research has demonstrated that SZ exhibited a relatively strong aversion to processing stimuli perceived as pleasant or neutral by others [47]. Meta-analyses have further revealed increased brain reactivity to neutral stimuli [48] and stronger responses to neutral stimuli compared to HC [49]. We infer that during neutral-text reading, individuals with SZ may unconsciously infuse subtle negative affect into their speech production. This divergence could explain the enhanced classification accuracy observed under neutral conditions, as their vocal patterns may deviate more distinctly from HC than in other emotional contexts.

The utilization of speech patterns for schizophrenia detection has become feasible due to the distinctive vocal characteristics exhibited by patients. However, healthy individuals may demonstrate similar vocal patterns owing to physiological variations. Age-related vocal fold atrophy leads to decreased pitch and increased hoarseness [27]. Such baseline variations enable models to avoid misclassifying age-dependent acoustic changes as pathological. Anatomical differences between sexes (e.g., vocal tract length) significantly influence timbre and pitch modulation [28, 29]. Higher education levels correlate with enhanced articulatory fluency [30, 31], potentially reducing verbal disorganization in HC and thereby improving diagnostic specificity. By controlling these confounders, the model focuses on disease-specific deviations rather than natural vocal diversity. MFCCs, manually defined shallow acoustic features derived from log-mel spectrograms, primarily reflect vocal tract timbre properties. Demographic information and acoustic features exhibit complementarity: MFCCs capture physical vocal tract characteristics, while age/sex parameters provide contextual reference for acoustic features. Consequently, the fusion of deep acoustic features extracted from log-Mel spectrograms via neural networks with demographic information or MFCCs enables the model to differentiate disease-related anomalies from normative vocal patterns. This integration strategy effectively identifies schizophrenia-like vocal patterns in healthy populations, explaining the enhanced specificity observed in our experimental results.

The superiority of deep learning has been proven in practice in the medical field [50,51,52,53], yet its black-box learning [54] and inference processes still cause confusion and hesitation in users. We visualized the class activation maps. In the visualization, the neural network discriminated schizophrenia using high-frequency ambiguous discontinuous regions of the spectrogram, such as pronunciation patterns with a single tone, slurred articulation, and reduced rhythm. This result was consistent with traditional speech analysis [6, 20, 55, 56] and illustrated the reliability of our interpretation of class activation maps.

The study demonstrated significant differences in speech patterns between SZ and mentally healthy individuals. The diagnostic accuracy of the log-Mel spectrogram analysis using a convolutional neural network ranged from 80.9% to 85.7% when considering a single emotional stimulus. Emotion fusion improved both sensitivity and specificity, resulting in an overall accuracy of 87.0%. Incorporating demographic information or MFCCs further enhanced specificity, leading to classification accuracies ranging from 86.1% to 89.6%, with the fusion of these two features yielding the best results. By combining different emotional stimuli and individualized features, the diagnostic model achieved a remarkable individual-discriminative performance, with an accuracy of 91.7% and an AUC of 0.963. The inclusion of individual information in an automatic detection model can closely approximate the diagnostic approaches of professional physicians, thereby addressing the challenge of heterogeneity and facilitating rapid and comprehensive schizophrenia detection based on patients’ specific conditions.

In Table 3, we presented a comparative analysis with preceding similar studies, several of which had achieved remarkable results. Our primary topic of comparison included the format of speech tasks employed, sample size, classification accuracy rates, and testing methodologies. Although interviews conducted by clinicians are a prevalent approach that could yield a rich insight, they may inadvertently introduce semantic interference in the inherent speech characteristics by overemphasizing differences in content. In contrast, our method applied reading standardized text to assess participants’ implicit emotional expression abilities. This approach minimizes semantic variability, eliminates the need for clinician involvement, and focuses purely on vocal patterns. Furthermore, our study boasted a larger subject cohort (156 patients, 74 controls) compared to prior work, enhancing the stability and reliability of the results. Notably, our test outcome was based on all speech segments from each individual within an independent test set, rather than analyzing isolated snippets, thereby making our conclusions more reflective of individual diagnostic profiles.

Table 3 Comparative analysis of previous studies and this paper

While this study aimed to enhance the diagnostic precision and reliability of schizophrenia via detection through personalized approaches, we also recognized the limitations of our work in light of prior research. First, the cross-sectional design and focus on chronic schizophrenia patients (mean disease duration: 19.3 years) limit the generalizability of our findings to early diagnostic scenarios, such as first-episode psychosis (FEP). Chronic patients often exhibit vocal patterns shaped by long-term antipsychotic use, psychosocial stressors, and natural aging, which may differ from those in early-stage or untreated individuals. Future studies should validate our model across diverse clinical stages (e.g., FEP, acute-phase schizophrenia) and symptom severities. Longitudinal designs could further assess the temporal stability of vocal biomarkers and their sensitivity to therapeutic interventions, thereby improving diagnostic specificity and clinical utility. Second, while our use of individual-based five-fold cross-validation aimed to improve model generalizability, recent methodological critiques caution against potential overestimation of feature importance in deep learning frameworks. Repeated parameter tuning or feature selection across cross-validation folds may inflate the perceived contribution of specific features (e.g., MFCCs, demographic variables) due to data leakage or spurious correlations. Although our fusion strategy (demographics + MFCCs + spectrogram) significantly improved classification accuracy, the relative importance of individual features requires validation through independent external cohorts or ablation studies. Furthermore, the limited sample size and single-center recruitment may introduce bias, as the symptom profiles of our cohort might not represent the broader schizophrenia population. Future work should validate the model in larger, more diverse populations (e.g., varying geographic regions, languages, disease stages) and employ interpretability techniques to rigorously disentangle the contributions of heterogeneous features while controlling for confounders. Finally, while our study demonstrated strong discriminative performance between schizophrenia patients and healthy controls, its clinical utility must be interpreted cautiously. In practice, individuals presenting to psychiatric clinics are more likely to exhibit symptoms of depression, anxiety, or bipolar disorder rather than being healthy volunteers. For instance, monotonous speech—observed in SZ—is also characteristic of major depressive disorder (MDD). Such overlapping vocal patterns across psychiatric conditions may reduce the model’s specificity in real-world applications. Future studies should validate the proposed framework against cohorts with diverse psychiatric diagnoses (e.g., MDD, bipolar disorder) to ensure its ability to distinguish schizophrenia from other disorders. Additionally, integrating multimodal biomarkers (e.g., linguistic content, physiological signals) could enhance diagnostic precision by capturing disease-specific signatures.

Data availability

The datasets generated and/or analyzed during the current study are not publicly available due to privacy.

References

  1. McCutcheon RA, Reis Marques T, Howes OD. Schizophrenia—an overview. JAMA Psychiatry. 2020;77(2):201–10.

    Article  PubMed  Google Scholar 

  2. Chan KY, Zhao FF, Meng S, Demaio AR, Reed C, Theodoratou E, Campbell H, Wang W, Rudan I. Prevalence of schizophrenia in China between 1990 and 2010. J Glob Health. 2015;5(1):010410–8.

    PubMed  PubMed Central  Google Scholar 

  3. McCutcheon RA, Pillinger T, Mizuno Y, Montgomery A, Pandian H, Vano L, Marques TR, Howes OD. The efficacy and heterogeneity of antipsychotic response in schizophrenia: a meta-analysis. Mol Psychiatry. 2021;26(4):1310–20.

    Article  CAS  PubMed  Google Scholar 

  4. Lv J, Di Biase M, Cash RFH, Cocchi L, Cropley VL, Klauser P, Tian Y, Bayer J, Schmaal L, Cetin-Karayumak S, et al. Individual deviations from normative models of brain structure in a large cross-sectional schizophrenia cohort. Mol Psychiatry. 2021;26(7):3512–23.

    Article  PubMed  Google Scholar 

  5. Di Biase MA, Geaghan MP, Reay WR, Seidlitz J, Weickert CS, Pébay A, Green MJ, Quidé Y, Atkins JR, Coleman MJ, et al. Cell type-specific manifestations of cortical thickness heterogeneity in schizophrenia. Mol Psychiatry. 2022;27(4):2052–60.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Oomen PP, de Boer JN, Brederoo SG, Voppel AE, Brand BA, Wijnen FNK, Sommer IEC. Characterizing speech heterogeneity in schizophrenia-spectrum disorders. J Psychopathol Clin Sci. 2022;131(2):172–81.

    Article  PubMed  Google Scholar 

  7. Skikic M, Arriola JA. First episode psychosis medical workup: evidence-informed recommendations and introduction to a clinically guided approach. Child Adolesc Psychiatr Clin. 2020;29(1):15–28.

    Article  Google Scholar 

  8. Srivastava A, Nair R. Utility of investigations, history, and physical examination in “medical clearance” of psychiatric patients: a meta-analysis. Psychiatr Serv. 2022;73(10):1140–52.

    Article  PubMed  Google Scholar 

  9. Oskolkova SN. Schizophrenia: a narrative review of etiological and diagnostic issues. Consort Psychiatr. 2022;3(3):20–35.

    Google Scholar 

  10. Espinola CW, Gomes JC, Pereira JMS, dos Santos WP. Vocal acoustic analysis and machine learning for the identification of schizophrenia. Res Biomed Eng. 2021;37(1):33–46.

    Article  Google Scholar 

  11. Figueroa-Barra A, Del Aguila D, Cerda M, Gaspar PA, Terissi LD, Durán M, Valderrama C. Automatic language analysis identifies and predicts schizophrenia in first-episode of psychosis. Schizophrenia. 2022;8(1):53–61.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Tyburski E, Sokołowski A, Chęć M, Pełka-Wysiecka J, Samochowiec A. Neuropsychological characteristics of verbal and non-verbal fluency in schizophrenia patients. Arch Psychiatr Nurs. 2015;29(1):33–8.

    Article  PubMed  Google Scholar 

  13. Lundin NB, Jones MN, Myers EJ, Breier A, Minor KS. Semantic and phonetic similarity of verbal fluency responses in early-stage psychosis. Psychiatry Res. 2022;309:114404.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Ehlen F, Montag C, Leopold K, Heinz A. Linguistic findings in persons with schizophrenia—a review of the current literature. Front Psychol. 2023;14:1287706–27.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Pallickal M, Deepak P, Abhishek B, Hema N. An exploration of cohesion and coherence skills in neuropsychiatric disorder: speech language pathologist perspective. J Schizophr Res. 2022;8(1):1041–7.

    Google Scholar 

  16. Tang SX, Kriz R, Cho S, Park SJ, Harowitz J, Gur RE, Bhati MT, Wolf DH, Sedoc J, Liberman MY. Natural language processing methods are sensitive to sub-clinical linguistic differences in schizophrenia spectrum disorders. npj Schizophr. 2021;7(1):25–33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Meyer L, Lakatos P, He Y. language dysfunction in schizophrenia: assessing neural tracking to characterize the underlying disorder(s)? Front Neurosci. 2021;15:640502–14.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Alonso-Sánchez MF, Ford SD, MacKinley M, Silva A, Limongi R, Palaniyappan L. Progressive changes in descriptive discourse in first episode schizophrenia: a longitudinal computational semantics study. Schizophrenia. 2022;8(1):36–45.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Ziv I, Baram H, Bar K, Zilberstein V, Itzikowitz S, Harel EV, Dershowitz N. Morphological characteristics of spoken language in schizophrenia patients – an exploratory study. Scand J Psychol. 2022;63(2):91–9.

    Article  PubMed  Google Scholar 

  20. Parola A, Simonsen A, Lin JM, Zhou Y, Wang H, Ubukata S, Koelkebeck K, Bliksted V, Fusaroli R. Voice patterns as markers of schizophrenia: building a cumulative generalizable approach via a cross-linguistic and meta-analysis based investigation. Schizophr Bull. 2022;49(Supplement_2):S125–41.

    Article  Google Scholar 

  21. de Boer JN, Voppel AE, Brederoo SG, Schnack HG, Truong KP, Wijnen FNK, Sommer IEC. Acoustic speech markers for schizophrenia-spectrum disorders: a diagnostic and symptom-recognition tool. Psychol Med. 2023;53(4):1302–12.

    Article  PubMed  Google Scholar 

  22. Chakraborty D, Yang Z, Tahir Y, Maszczyk T, Dauwels J, Thalmann N, Zheng J, Maniam Y, Amirah N, Tan BL, et al. Prediction of negative symptoms of schizophrenia from emotion related low-level speech signals. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018. p. 6024–6028.

  23. Xu S, Yang Z, Chakraborty D, Chua YHV, Dauwels J, Thalmann D, Thalmann NM, Tan BL, Keong JLC. Automated verbal and non-verbal speech analysis of interviews of individuals with schizophrenia and depression. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2019. p. 225–228.

  24. Voppel AE, de Boer JN, Brederoo SG, Schnack HG, Sommer IEC. Semantic and acoustic markers in schizophrenia-spectrum disorders: a combinatory machine learning approach. Schizophr Bull. 2022;49(Supplement_2):S163–71.

    Article  PubMed Central  Google Scholar 

  25. Fu J, Yang S, He F, He L, Li Y, Zhang J, Xiong X. Sch-net: a deep learning architecture for automatic schizophrenia setection. Biomed Eng Online. 2021;20(1):75–83.

    Article  PubMed  PubMed Central  Google Scholar 

  26. He L, Fu J, Li Y, Xiong X, Zhang J. WNSA-Net: an axial-attention-based network for schizophrenia detection using wideband and narrowband spectrograms. IEEE/ACM Trans Audio Speech Lang Process. 2023;31:721–33.

    Article  Google Scholar 

  27. Martins RHG, Benito Pessin AB, Nassib DJ, Branco A, Rodrigues SA, Matheus SMM. Aging voice and the laryngeal muscle atrophy. Laryngoscope. 2015;125(11):2518–21.

    Article  PubMed  Google Scholar 

  28. McCollum I, Throop A, Badr D, Zakerzadeh R. Gender in human phonation: fluid–structure interaction and vocal fold morphology. Phys Fluids. 2023;35(4):041907–16.

    Article  CAS  Google Scholar 

  29. Chen S, Han C, Wang S, Liu X, Wang B, Wei R, Lei X. Hearing the physical condition: the relationship between sexually dimorphic vocal traits and underlying physiology. Front Psychol. 2022;13:983688–98.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Pérez Cañado ML, Lancaster NK. The effects of CLIL on oral comprehension and production: a longitudinal case study. Lang Cult Curric. 2017;30(3):300–16.

    Article  Google Scholar 

  31. Tanner M, Henrichsen L. Pronunciation in varied teaching and learning contexts. Hoboken: Wiley Blackwell Oxford; 2022.

  32. Vasquez-Serrano P, Reyes-Moreno J, Guido RC, Sepúlveda-Sepúlveda A. MFCC parameters of the speech signal: an alternative to formant-based instantaneous vocal tract length estimation. J Voice. 2023.

  33. Tracey B, Volfson D, Glass J, Haulcy RM, Kostrzebski M, Adams J, Kangarloo T, Brodtmann A, Dorsey ER, Vogel A. Towards interpretable speech biomarkers: exploring MFCCs. Sci Rep. 2023;13(1):22787–96.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Astuti Y, Hidayat R, Bejo A. Feature extraction using Gaussian-MFCC for speaker recognition system. In: 2021 IEEE 5th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE). 2021. p. 186–190.

  35. Agurto C, Pietrowicz M, Norel R, Eyigoz EK, Stanislawski E, Cecchi G, Corcoran C. Analyzing acoustic and prosodic fluctuations in free speech to predict psychosis onset in high-risk youths. In: Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). Montreal: IEEE; 2020. pp. 5575–9.

  36. Zhang J, Yang H, Li W, Li Y, Qin J, He L. Automatic schizophrenia detection using multimodality media via a text reading task. Front Neurosci. 2022;16:933049–66.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Chakraborty D. Automated analysis of non-verbal behaviour of schizophrenia patients. Singapore: Nanyang Technological University; 2019.

  38. Kring AM, Elis O. Emotion deficits in people with schizophrenia. Ann Rev Clin Psych. 2013;9(1):409–33.

    Article  Google Scholar 

  39. Liu Q, Huang Y, Huang XY, Xia XT, Niu XX, Lin L, Chen YW. Dynamic facial features in positive-emotional speech for identification of depressive tendencies. In: Innovation in medicine and healthcare. edn. Singapore: Springer; 2020. pp. 127–34.

  40. Kay SR, Fiszbein A, Opler LA. The positive and negative syndrome scale (PANSS) for schizophrenia. Schizophr Bull. 1987;13(2):261–76.

    Article  CAS  PubMed  Google Scholar 

  41. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O: librosa: Audio and music signal analysis in python. In: Proceedings of the 14th python in science conference. 2015. p. 18–25.

  42. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770–778.

  43. Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. 2017. p. 2980–2988.

  44. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. 2017. p. 618–626.

  45. Horan WP, Wynn JK, Kring AM, Simons RF, Green MF. Electrophysiological correlates of emotional responding in schizophrenia. J Abnorm Psychol. 2010;119(1):18–30.

    Article  PubMed  PubMed Central  Google Scholar 

  46. Hermans KSFM, Myin-Germeys I, Gayer-Anderson C, Kempton MJ, Valmaggia L, McGuire P, Murray RM, Garety P, Wykes T, Morgan C, et al. Elucidating negative symptoms in the daily life of individuals in the early stages of psychosis. Psychol Med. 2021;51(15):2599–609.

    Article  PubMed  Google Scholar 

  47. Liang S, Wu Y, Hanxiaoran L, Greenshaw AJ, Li T. Anhedonia in depression and schizophrenia: brain reward and aversion circuits. Neuropsychiatr Dis Treat. 2022;18:1385–96.

    Article  PubMed  PubMed Central  Google Scholar 

  48. Potvin S, Tikàsz A, Mendrek A. Emotionally neutral stimuli are not neutral in schizophrenia: a mini review of functional neuroimaging studies. Front Psychiatry. 2016;7:115–24.

    Article  PubMed  PubMed Central  Google Scholar 

  49. Dugré JR, Bitar N, Dumais A, Potvin S. Limbic hyperactivity in response to emotionally neutral stimuli in schizophrenia: a neuroimaging meta-analysis of the hypervigilant mind. Am J Psychiatry. 2019;176(12):1021–9.

    Article  PubMed  Google Scholar 

  50. Nam JG, Ahn C, Choi H, Hong W, Park J, Kim JH, Goo JM. Image quality of ultralow-dose chest CT using deep learning techniques: potential superiority of vendor-agnostic post-processing over vendor-specific techniques. Eur Radiol. 2021;31:5139–47.

    Article  PubMed  Google Scholar 

  51. Li Y, Zhao J, Lv Z, Li J. Medical image fusion method by deep learning. Int J Cognit Comput Eng. 2021;2:21–9.

    Google Scholar 

  52. Hesamian MH, Jia W, He X, Kennedy P. Deep learning techniques for medical image segmentation: achievements and challenges. J Digit Imaging. 2019;32(4):582–96.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Wang Q, Lyu W, Cheng Z, Yu C. Noninvasive measurement of vital signs with the optical fiber sensor based on deep learning. J Lightwave Technol. 2023;41(13):4452-62.

  54. Ramírez MAM, Benetos E, Reiss JD. Deep learning for black-box modeling of audio effects. Appl Sci. 2020;10(2):638–61.

    Article  Google Scholar 

  55. Lucarini V, Grice M, Cangemi F, Zimmermann JT, Marchesi C, Vogeley K, Tonna M. Speech prosody as a bridge between psychopathology and linguistics: the case of the schizophrenia spectrum. Front Psychiatry. 2020;11:531863–71.

    Article  PubMed  PubMed Central  Google Scholar 

  56. Parola A, Simonsen A, Bliksted V, Fusaroli R. Voice patterns in schizophrenia: a systematic review and Bayesian meta-analysis. Schizophr Res. 2020;216:24–40.

    Article  PubMed  Google Scholar 

  57. Tahir Y, Yang Z, Chakraborty D, Thalmann N, Thalmann D, Maniam Y, Binte Abdul Rashid NA, Tan BL, Lee Chee Keong J, Dauwels JJPO. Non-verbal speech cues as objective measures for negative symptoms in patients with schizophrenia. PLoS ONE. 2019;14(4):e0214314.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Berardi M, Brosch K, Pfarr J-K, Schneider K, Sültmann A, Thomas-Odenthal F, Wroblewski A, Usemann P, Philipsen A, Dannlowski U, et al. Relative importance of speech and voice features in the classification of schizophrenia and depression. Transl Psychiatry. 2023;13(1):298–306.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We want to thank Editage (www.editage.cn) for English language editing.

Funding

This work was supported by the Beijing Hospitals Authority Youth Programme [grant number QML20232004], the Beijing Natural Science Foundation [grant number 7202072], the Beijing Municipal Science & Technology Commission [grant number Z191100006619104, D171100007017002] and Beijing Hospitals Authority’ Ascent Plan [grant number DFL20192001].

Author information

Authors and Affiliations

Authors

Contributions

Jie Huang: Investigation, Methodology, Project administration, Validation, Visualization, Writing – original draft, Writing – review and editing; Yanli Zhao: Methodology, Writing – review and editing. Zhanxiao Tian: Data curation, Supervision, Writing – review and editing. Wei Qu: Supervision, Writing – review and editing. Xia Du: Data curation, Investigation, Writing – review and editing. Jie Zhang: Data curation, Investigation, Writing – review and editing. Meng Zhang: Funding acquisition, Writing – review and editing. Yunlong Tan: Supervision, Writing – review and editing. Zhiren Wang: Supervision, Writing – review and editing. Shuping Tan: Conceptualization, Funding acquisition, Methodology, Supervision, Writing – review and editing.

Corresponding author

Correspondence to Shuping Tan.

Ethics declarations

Ethics approval and consent to participate

This study was approved by Ethics Committee of Beijing Huilongguan Hospital (Approval No. Z 141107002514016). Participants in this study have signed the broad consent form at the time of their visit to Beijing Huilongguan Hospital. The participants were told they can stop their participation at any time without having to give explanations and all information they provide will be kept confidential and their identity will never be disclosed.

Consent for publication

All participants in the study gave consent for the publishing of the manuscript.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, J., Zhao, Y., Tian, Z. et al. Hearing vocals to recognize schizophrenia: speech discriminant analysis with fusion of emotions and features based on deep learning. BMC Psychiatry 25, 466 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12888-025-06888-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12888-025-06888-z

Keywords