Novel Deep Learning-Based Vocal Biomarkers for Stress Detection in Koreans
Article information
Abstract
Objective
The rapid societal changes have underscored the importance of effective stress detection and management. Chronic mental stress significantly contributes to both physical and psychological illnesses. However, many individuals often remain unaware of their stress levels until they face physical health issues, highlighting the necessity for regular stress monitoring. This study aimed to investigate the effectiveness of vocal biomarkers in detecting stress levels among healthy Korean employees and to contribute to digital healthcare solutions.
Methods
We conducted a multi-center clinical study by collecting voice recordings from 115 healthy Korean employees under both relaxed and stress-induced conditions. Stress was induced using the socially evaluated cold pressor test. The Emphasized Channel Attention, Propagation and Aggregation in Time delay neural network (ECAPA-TDNN) deep learning architecture, renowned for its advanced capabilities in analyzing person-specific voice features, was employed to develop stress prediction scores.
Results
The proposed model achieved a 70% accuracy rate in detecting stress. This performance underscores the potential of vocal biomarkers as a convenient and effective tool for individuals to self-monitor and manage their stress levels within digital healthcare frameworks.
Conclusion
The findings emphasize the promise of voice-based mental stress assessments within the Korean population and the importance of continued research on vocal biomarkers across diverse linguistic demographics.
INTRODUCTION
Although stress can have some positive effects, such as providing energy and motivation and sharpening our focus, chronic stress can be detrimental. Psychological stress has been linked to numerous health issues, including mental disorders [1], cardiovascular diseases [2], cancers [3], and autoimmune diseases [4]. Additionally, work-related stress can lead to employee burnout [5] and negatively impact an organization’s productivity [6]. Therefore, managing stress is essential for enhancing long-term quality of life. Regular stress monitoring can help identify when active self-care or professional assistance is necessary. However, stress diagnosis is subjective, which poses challenges in developing a universally applicable diagnostic model [7]. Clinically, stress levels are typically measured through self-reported, expert-verified questionnaires, such as the Perceived Stress Scale [8], or visual analog scales, such as a stress thermometer [9]. Nonetheless, these methods are subjective and may not provide a consistent measurement scale for stress levels.
Extensive efforts have been made to develop biological markers that objectively and consistently measure stress, including metrics such as blood pressure, heart rate variability, electroencephalogram signals, and finger temperature [7]. Since the early 1980s, speech and voice have also been recognized as valuable biomarkers, providing probabilistic information about mental health [10]. Stress impacts both linguistic and non-linguistic speech characteristics [11]. Linguistic features—such as vocabulary diversity, lexical and grammatical complexity, syntactic structures, and semantic skills—can indicate cognitive impairment [11]. However, these features can be influenced by an individual’s educational and social background [12]. Non-linguistic features pertain to how speech sounds are produced through the modulation of muscle tension and respiration [13]. Increased stress can raise both muscle tension and respiration rates, thereby altering vocal sounds [14]. Recent developments in speech-based vocal stress models include the work of Lu et al. [15], who developed StressSense. This system uses pitch features, spectral centroids, high-frequency ratios, speaking rates, mel-frequency cepstral coefficients (MFCCs), and nonlinear features derived from the Teager Energy Operator [16]. They trained classification models using Gaussian mixture models (GMMs) for predetermined acoustic feature sets. Similarly, Sondhi et al. [14] analyzed stress using seven acoustic features: pitch, shimmer, four formants, and mean pitch. Traditional methods for building prediction models for voice data have relied on a variety of statistical approaches. For example, GMMs model the distribution of the voice feature vectors including MFCCs, to capture the characteristics of speech sounds. Support vector machines (SVM) are utilizing hyperplanes to separate different classes in high-dimensional space of numerous acoustic features. Since the traditional approaches are limited in handling the complexity of voice data, they often struggle with the variability and complexity of real-world speech. On the other hand, deep learning models can identify complex patterns and features in voice data that traditional models often miss, leading to higher accuracy and performance in voice recognition tasks [17]. Deep learning models incorporate various architectures like Convolutional Neural Networks (CNNs) [18] and Recurrent Neural Networks (RNNs) [19] to handle diverse and complex voice signals, improving robustness and adaptability to various conditions. Additionally, these models can better capture temporal and contextual information in voice data, that is difficult to account for with traditional models. Recent deep learning models including Emphasized Channel Attention, Propagation and Aggregation in Time delay neural network (ECAPA-TDNN) [20] use sophisticated feature extraction techniques and attention mechanisms such as Squeeze and Excitation blocks [21] to focus on the most relevant parts of the audio signal. This targeted approach improves the model’s precision and discriminative power. Despite significant advances in deep learning modeling of speech signals, its application for stress modeling has been relatively rare. Additionally, most models have been developed for Western populations, and few vocal stress models have been tailored for Koreans.
In this study, we aimed to develop stress prediction models for vocal biomarkers using an in-house dataset generated from a Korean clinical study. Recently introduced high-performance models based on deep learning structures, capable of speaker recognition from speech recordings, have been explored [20,22-25]. By leveraging these speaker-verification model architectures, we aim to enhance the accuracy of detecting individual stress fluctuations by accurately capturing unique voice characteristics.
METHODS
Participants
This study was part of a project conducted by SK Telecom Co., Ltd. (Seoul, Korea), aimed at developing digital biomarkers for assessing stress in healthy employees through voice and biosensor signals. Participants were recruited through advertisements at Seoul National University Bundang Hospital (SNUBH) and Boramae Medical Center (BMC) between December 2021 and February 2022.
The inclusion criteria were: 1) age between 19 and 65 years, and 2) full-time employment. The exclusion criteria were: 1) age under 19 or over 65 years, 2) cognitive impairments, such as dementia or mental retardation, 3) serious neurological disorders, including epilepsy and stroke, 4) history of schizophrenia, bipolar I disorder, or other psychotic disorders, 5) current reports of suicidal ideation, 6) history of cardiovascular disease, pulmonary disease, and associated pharmacological therapy that could affect heart rate variability measurements, 7) vocal diseases affecting speech analysis, 8) conditions that could alter cortisol levels, such as Cushing’s disease, 9) female not in the luteal phase of their menstrual cycle, and 10) a sensitivity to temperament changes, such as Raynaud’s syndrome.
Psychiatric diagnoses were made during the screening process using the Mini-International Neuropsychiatric Interview (MINI) [26], a brief, structured psychiatric interview designed to identify a broad range of psychiatric disorders according to the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition. The interview format required responses to be either “yes” or “no.” The Korean version of the MINI is well-validated and reliable [27]. The MINI assessments were conducted by two psychologists, each holding a master’s degree and well-trained in administering the MINI.
Baseline assessments
Demographic information, including age, sex, duration of employment, marital status (such as married or separated), and educational status (college education or higher, less than college education), was collected using a self-reported questionnaire provided in paper format. To assess the baseline mental stress levels of the participants, we utilized the Perceived Stress Scale [8]. This scale is designed to measure the extent to which individuals perceive their life situations as unpredictable, uncontrollable, and overwhelming. It consists of 10 direct questions about experiences that may elicit feelings of upset, nervousness, stress, or irritation; four items are phrased negatively and six positively. Responses were recorded on a 5-point Likert scale, ranging from 0 (never) to 4 (very often), with higher scores indicating greater perceived stress. The Korean version of the Perceived Stress Scale has shown a Cronbach’s alpha of 0.74 and a test-retest reliability with an intraclass correlation coefficient of 0.81.
Stress induction and its validation measures
Voice and biological signals were collected from the participants at two different times: in a relaxed and stressed state. In the relaxed state, the participants were instructed to rest in a quiet space for 10 min.
To induce a stressed state, we employed the socially evaluated cold pressor test (SECPT) [28,29]. This widely used method for inducing acute mental stress involves participants immersing their hands in ice-cold water for up to 3 minutes while under social evaluation, which elicits both physiological and psychological stress responses. Participants recorded their speech both before and after undergoing the SECPT. Stress levels were evaluated using self-reported distress thermometer (DT) scores and salivary cortisol levels before and after stress induction.
The DT is a single-item self-report measure developed by Roth et al. [9] th a cutoff value set at 4, according to National Comprehensive Cancer Network recommendations [30]. This cutoff score has been validated in other studies, including a meta-analysis [31]. The DT uses a vertical visual analog scale ranging from 0 (no distress) to 10 (extreme distress). Although originally developed for cancer patients, it is widely used across various contexts and populations due to its simplicity and ease of use [32].
In addition to self-reported stress scores, salivary cortisol levels were objectively measured to validate participants’ stress states. Salivary cortisol is considered a reliable biomarker of psychological stress [33]. Participants collected saliva samples using a Saliva Bio Oral Swab (Salimetrics, LLC, Carlsbad, CA, USA). The swab was placed under the participant’s tongue for 2 min, extending to 5 min if not sufficiently saturated. Once collected, the swab was stored at -20°C until analysis. Salivary cortisol samples were analyzed by the Green Cross Labs (Yongin, Korea) using a time-resolved immunoassay with fluorometric detection [34]. Both inter- and intra-assay coefficients of variance were maintained below 9%.
Voice recordings
Considering that linguistic features can be influenced by various personal factors such as education and economic status, we aimed to develop acoustic biomarkers for a more universally applicable model. For our input source, we included voice recordings of participants reading a script and responding freely to questions. The script, a Korean essay titled “Autumn,” provides an objective description of the natural environment, minimizing emotional influence. This script consists of 141 words, and participants typically took about 1 min to read it. Following the script reading, participants answered questions posed by the experimenter on topics such as their daily routines, hobbies, and a recently watched movie or book. If their responses to a question were shorter than 1 min, they were prompted to answer another question. However, there was no set upper limit for the length of the speech responses.
Voice recordings were conducted using a Philips Voice Tracer VTR 7100 (Koninklijke Philips N.V., Amsterdam, The Netherlands) and saved in the WAV format. The audio was captured at a sampling rate of 24,000 Hz.
The experimental procedure used to collect participants’ biological data in both relaxed (non-stressed) (T0) and stressed (T1) states is illustrated in Figure 1.
Preprocessing of voice recording data
Raw voice-recorded data were manually reviewed for validity. We isolated the participants’ voices by excluding the experimenter’s voice and other background noises, utilizing timestamps from GoldWave (version 6.54; GoldWave Inc., St. John’s, NL, Canada). The data were divided into three sets: training, validation, and testing. The samples were shuffled to ensure similar age distributions and sex ratios in each subset and randomly rearranged to prevent sequence bias. Within each hospital’s data, samples were further divided into four subgroups based on the following criteria: 1) males below the median age, 2) males above the median age, 3) females below the median age, and 4) females above the median age. After shuffling these groups, the samples were allocated into training, validation, and test datasets, with each subgroup receiving 75, 20, and 20 samples, respectively.
Classification labels were assigned as 0 for the relaxed state (T0) and 1 for the stressed state (T1). The voice recordings, initially in waveform form, were resampled to a rate of 16,000 Hz to reduce computational costs without any noticeable performance degradation due to resampling.
Deep learning models for stress status classification
Using the training and validation datasets, we developed deep learning models to differentiate between two stress conditions, T0 (relaxed state) and T1 (stressed state), using acoustic features from voice-recording segments. The process of constructing these classification models involved three steps: segmenting the voice recordings, extracting the acoustic features, and training the deep neural network models (Figure 2).
In the segmentation of voice recording input data, our data consisted of voice recordings from participants collected before (T0) and after (T1) stress induction using SECPT [28,29]. To process the data, recordings were divided into 4-sec segments with a 75% overlap with the previous segments. This overlap between two adjacent segments aimed to improve information capture by better detecting transient patterns in voice signals and avoiding boundary issues, including the loss of information near the edges, which occurs in the fixed-window approach. Consequently, segment overlapping can enhance the resolution of the extracted features and improve model performance.
For the extraction of acoustic features, the 80th Mel spectrogram was calculated to extract features from each voice segment. The Mel spectrum feature was selected because it is widely used in speech analysis and effectively represents the acoustic properties of voice signals [34].
In the training of the deep learning model, the ECAPA-TDNN model was chosen due to its consistent performance across various fields such as speaker recognition [20,35], speaker diarization [22], text-to-speech synthesis [36], spoken language identification [37], and emotion recognition [38]. This model was selected to effectively capture personal vocal characteristics. Using the 80th Mel spectrogram as the input, a deep learning model based on the ECAPA-TDNN architecture [20] was trained to classify the stress status of the study participants. Stress status was classified into two categories: relaxed and stressed, corresponding to the recording times T0 and T1. The ECAPA-TDNN model was trained using the binary cross-entropy loss function and the adaptive moment estimation optimizer [39].
The performance of the final model was evaluated based on its accuracy on the test dataset (Figure 2). Resampling and feature extraction were conducted using Torchaudio version 0.7.0 (https://pytorch.org/audio/stable/index.html), while model optimizations were performed with Torch version 1.7.0 (https://pytorch.org).
Statistical analysis
As described in the preprocessing section, training, validation, and test datasets were consisted of 75, 20, and 20 non-overlapping samples, respectively. Stress score models were trained using the training and validation sets, and their performances were evaluated on the test dataset. The performance was assessed based on the discriminative power of the distributions of stress scores between the stress status T0 and T1. The statistical significance of the increase in stress scores at T1 compared with T0 was evaluated using a one-tailed Wilcoxon rank test. A significance level of 0.05 was used as decision cutoff. The Wilcoxon rank test is a nonparametric method that is robust to various distribution shapes of observations.
To account for intra-individual score variances as a random component, mixed-effects models were employed to assess the effects of stress induction on changes in prediction scores. Subsequently, the results were compared between two voice recording methods: script reading and free speech. All statistical tests for group comparisons of clinical features and stress scores were performed using functions from the scipy.stats library version 1.10.1 in Python (https://pytorch.org).
Ethics statement
The study protocol was approved by the Institutional Review Boards of SNUBH (approval number: B-2111-719-301) and BMC (approval number: 10-2021-138). All participants provided written informed consent after receiving a thorough explanation of the study.
RESULTS
In total, 130 participants were recruited for this study. We selected data from those who exhibited sustained stress, as indicated by self-reported stress thermometer readings or cortisol levels. Data from 15 individuals who showed no signs of stress after induction were excluded from further analysis. A significant increase in either the stress thermometer score or cortisol level was considered evidence of successful stress induction. The demographic characteristics of the 115 participants are presented in Table 1. Of these, 33.91% were male. The mean age was 33.56 years (standard deviation [SD], 7.90) at SNUBH and 37.71 years (SD, 10.18) at BMC.
The distribution of stress thermometer scores and cortisol levels is illustrated in Figure 3 at two time points: before (T0) and after (T1) stress induction. The mean stress thermometer scores at T0 and T1 were 1.304 (SD=1.482) and 2.896 (SD=2.253), respectively. The mean cortisol levels were 0.165 (SD=0.083) at T0 and 0.322 (SD=0.21) at T1. Both differences were statistically significant, with a p-value <0.0001 in the Mann–Whitney test.
After completing the stress-induction experiments and isolating participants’ voices from the interviewer’s voices and background noises, the average length of the voice recordings was 74.86 seconds (SD=8.89) for script reading and 125.89 seconds (SD=65.17) for interviews. The preprocessed voices were used to train the stress prediction models and test their performance.
In preliminary studies, the distribution of prediction scores for each acoustic feature across three different model architectures was investigated: CNN [18], conformer [24], and ECAPA-TDNN. To assess the stress level of the participants, we computed stress prediction scores for each individual segment, made a simple binary decision for each segment, and finally determined the stress level based on the ratio of predicted states from T0 to T1. The ECAPA-TDNN model exhibited the best prediction performance at 77.5%, followed by the CNN at 60%, and the conformer at 62.5%. Consequently, we selected the ECAPA-TDNN model to compute the stress status prediction scores. These prediction scores, associated with stress status, are hereafter referred to as the stress score.
To determine whether trimming the beginning or end of a voice recording could affect the accuracy of stress status detection, we examined the trend of stress model scores throughout the speech. Scatter plots of the scores against time, supplemented by smoothing splines for 20 samples from the test dataset, are presented in Figure 4 to assess the overall trend along the time axis. Although some participants exhibited a linear trend in their stress scores, we did not observe a consistent pattern that would justify the removal or selection of specific parts of the voice recordings across all participants. These varied patterns of stress score changes likely highlight interindividual variability. Therefore, full-length speech recordings were used in the subsequent stress detection tests.
Additionally, we compared the characteristics of stress scores derived from script reading and free speech in interview responses. We estimated the impact of stress induction on the scores for each type of voice recording and compared the magnitude of these impacts to determine which recording type provided the most distinctive information. When comparing T0 and T1, we used mixed-effects models that accounted for unique score variances among individuals by integrating an individual-specific random-effects term to address individual variability. After adjusting for individual effects by including the individual as a random effects covariate, the mean difference in the distribution of stress scores between the two time points was estimated to be greater for free speech (0.148 for recording type 2) compared to script reading (0.132 for recording type 1). Although both results showed significant p-values (p<0.001), the effect size and z-score of the test statistics were greater for free speech. These results indicate that the differences in stress scores between the two time points within each individual were more pronounced for free speech than for script reading, suggesting that voice recordings of free speech are more effective for distinguishing between relaxed (T0) and stressed (T1) states.
The distribution of scores for free-speech voices at T0 and T1 for the 20 test samples was further explored using box plots (Figure 5). Although the stressed state tended to have higher median scores, there was no complete separation between the relaxed and stressed states in the segment-wise stress scores, with an overlapping range across the two states. Among the 20 participants in the test dataset, BMC032, SNUBH042, and SNUBH048 showed significant differences in the median scores between T0 and T1. However, 6 participants, including BMC052, SNUBH043, and SNUBH062, showed little difference. The statistical significance of the differences between T0 and T1 for each individual was tested using the one-sided Wilcoxon rank test, a nonparametric method for group comparisons. Of these, 14 participants showed increased stress levels. In summary, through a two-step procedure—computing stress scores based on a deep learning model for free speech recordings and then conducting a statistical test for two-group comparisons—we correctly detected the stress status in 70% of the test participants. The test statistics and p-values are listed in Supplementary Table 1.
DISCUSSION
We collected data from 130 participants and obtained 115 datasets. We developed a deep learning model to compute stress-associated scores using sound frequency features, MFCCs, and the speaker identification model architecture, ECAPA-TDNN [20]. The participants’ voices were captured using two methods: reading a script and answering one or two randomly selected neutral questions. In our statistical analyses, the stress scores computed using voice signals from free speech (answering questions) showed a stronger association with an individual’s stress status. As we observed that the distribution of individuals’ stress prediction scores differed in both shape and location, we implemented a comparative test procedure that analyzed two sets of stress scores from the same individual to determine their stress status.
There are many studies on the computational modeling of stress, most of which use physiological and behavioral information requiring special sensor devices, or psychological information that could be subjective [7]. Our model uses only voices, which can be collected using personal devices such as mobile phones, tablets, smartwatches, or computers. In a prior study, Han et al. [40] developed a vocal stress classification model using long short-term memory (LSTM). They trained the model using 25 subsampled voice segments, each 4 sec long, from the same 25 participants and achieved an accuracy of 66.4% for the Korean language [40]. We adopted the latest advanced deep-learning architecture for speech models using speaker identification, which is sensitive to subtle differences in voice signals. Consequently, our algorithm for stress detection achieved an accuracy of 70% when evaluated using a test dataset that had been set aside prior to model development. As performance was evaluated using independent samples, the accuracy results provide a more rigorous measure of external validity or generalizability.
The coronavirus disease-2019 (COVID-19) pandemic has significantly stressed populations worldwide. Historically, the detrimental effects of pandemic infectious diseases on mental health are expected outcomes [41]. Concurrently, COVID-19-related mandates such as quarantine and social distancing have increased the demand for digital healthcare technologies to enhance mental health [42]. Monitoring various personal health conditions using digital devices enables personalized and precise management through self- or remote digital healthcare services. Measurements related to behavior, physiological processes, and other health aspects are referred to as digital biomarkers [43]. Our voice-based stress model is a digital biomarker compatible with various mobile devices, including smartphones and tablets. Using our vocal stress test procedure, periodic monitoring of mental stress levels can be conducted anywhere. Longitudinal records of the stress test results can serve as an objective history to provide a relative quantification of stress levels. Unlike traditional clinical questionnaires, which require introspective answers, our method potentially increases compliance by reducing inconvenience.
Our stress score model can be integrated into healthcare applications to assess mental stress. The procedure for testing mental stress levels is illustrated in Figure 6. Upon initial use, the application prompts the user to record their voice in a stress-free and relaxed state to establish a baseline reference. To capture the users’ free-speech voices, they are presented with one or more randomly selected questions from a predetermined list of neutral questions. The recording lasts for at least 1 min. The stress score is computed using the model stored in the application, and these values are set as user references. Subsequently, users can gauge their stress levels by recording their voices in response to questions. Although the score itself correlates with mental stress, the values derived from speech recordings can vary among individuals based on their speaking style. To adjust for this variability, we introduced a statistical test based on user-reference values, which enhances the detection accuracy of elevated stress levels.
The development of stress biomarkers and mobile applications offers several advantages: 1) they enable employees to monitor individualized stress patterns encountered in daily life, 2) they enhance the potential to prevent mental disorders related to stress, such as depression, and physiological conditions like cardiovascular disorders at a preclinical stage, and 3) they provide immediate stress-monitoring tools for clinical populations in need.
One concern regarding mobile health solutions is safeguarding personal information and ensuring user privacy. However, as the stress score model can be downloaded directly from the service provider, stress assessment can be performed without transmitting personal data. Using smartphone applications for stress management facilitates cost-effectiveness and consistent tracking.
Our study’s sample size was small. Future studies with larger sample sizes are required to further enhance the proposed model’s performance. Recent research has demonstrated that transfer learning, particularly using self-supervised models trained on public datasets, can improve performance across various tasks, even with a limited number of analysis samples. Thus, such advanced modeling approaches will be adopted in our future studies.
In conclusion, our study highlighted voice biomarkers as potentially effective digital tools for measuring stress levels. Using a deep learning model, we demonstrated that our method can detect stress with considerable accuracy. Moreover, this study introduces linguistically tailored digital health solutions, emphasizing the utility and novelty of using the Korean language as reliable stress biomarkers. The practical implications of this research are substantial, especially during times of challenge, such as the COVID-19 pandemic, when the demand for remote mental health monitoring surges.
Integrating this voice-based system into mobile health platforms allows individuals to track their mental stress conveniently and privately. Although our findings are promising, the sample size was limited. Future studies with larger sample sizes are crucial to refine the effectiveness of the system. We also plan to apply advanced modeling techniques, such as transfer learning, to further improve model performance.
Supplementary Materials
The Supplement is available with this article at https://doi.org/10.30773/pi.2024.0131.
Notes
Availability of Data and Material
The datasets generated or analyzed during the study are not publicly available as per the internal policy of the SK Group’s funding organization. However, these datasets can be obtained from the corresponding author upon reasonable request and with the approval of the relevant committee.
Conflicts of Interest
The authors have no potential conflicts of interest to disclose.
Author Contributions
Conceptualization: Junghyun Namkung, Jeong-Hyun Kim, Je-Yeon Yun, So Young Yoo, Beomjun Min, Heyeon Park. Data curation: Ji-Hye Lee, Beomjun Min, Soyoung Baik. Formal analysis: Junghyun Namkung. Funding acquisition: Junghyun Namkung, Jeong-Hyun Kim. Investigation: Jeong-Hyun Kim, So Young Yoo, Junghyun Namkung. Methodology: Seok Min Kim, Won Ik Cho, Junghyun Namkung, Sang Yool Lee, Nam Soo Kim. Project administration: Junghyun Namkung. Resources: Jeong-Hyun Kim, So Young Yoo, Nam Soo Kim. Software: Seok Min Kim, Won Ik Cho, Junghyun Namkung, Sang Yool Lee, Nam Soo Kim. Supervision: Jeong-Hyun Kim, So Young Yoo, Nam Soo Kim. Visualization: Junghyun Namkung. Writing—original draft: Junghyun Namkung. Writing—review & editing: Jeong-Hyun Kim, Je-Yeon Yun.
Funding Statement
This research was supported by the SK Group joint fund through the SK SUPEX Council ICT Committee in South Korea in 2021.
Acknowledgements
None