Novel Deep Learning-Based Vocal Biomarkers for Stress Detection in Koreans

Article information

Psychiatry Investig. 2024;21(11):1228-1237
Publication date (electronic) : 2024 November 18
doi : https://doi.org/10.30773/pi.2024.0131
1BioMedical AI, AI Research Center, SK Telecom, Seongnam, Republic of Korea
2Department of Electrical and Computer Engineering and Institute of New Media and Communications, Seoul National University, Seoul, Republic of Korea
3Department of Psychiatry, Seoul National University College of Medicine, Seoul, Republic of Korea
4Department of Psychiatry, SMG-SNU Boramae Medical Center, Seoul, Republic of Korea
5Department of Public Health Medical Services, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
6Liberal Arts College, Dongduk Women’s University, Seoul, Republic of Korea
7Department of Psychiatry, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
8Seoul National University Hospital, Seoul, Republic of Korea
9Yeongeon Student Support Center, Seoul National University College of Medicine, Seoul, Republic of Korea
10Institute of Human Behavioral Medicine, Seoul National University Medical Research Center, Seoul, Republic of Korea
Correspondence: Jeong-Hyun Kim, MD, PhD Department of Psychiatry, Seoul National University Bundang Hospital, 82 Gumi-ro 173beon-gil, Bundang-gu, Seongnam 13620, Republic of Korea Tel: +82-31-787-2025, Fax: +82-31-787-4050, E-mail: retrial3@hanmail.net
Received 2024 April 22; Revised 2024 July 7; Accepted 2024 August 13.

Abstract

Objective

The rapid societal changes have underscored the importance of effective stress detection and management. Chronic mental stress significantly contributes to both physical and psychological illnesses. However, many individuals often remain unaware of their stress levels until they face physical health issues, highlighting the necessity for regular stress monitoring. This study aimed to investigate the effectiveness of vocal biomarkers in detecting stress levels among healthy Korean employees and to contribute to digital healthcare solutions.

Methods

We conducted a multi-center clinical study by collecting voice recordings from 115 healthy Korean employees under both relaxed and stress-induced conditions. Stress was induced using the socially evaluated cold pressor test. The Emphasized Channel Attention, Propagation and Aggregation in Time delay neural network (ECAPA-TDNN) deep learning architecture, renowned for its advanced capabilities in analyzing person-specific voice features, was employed to develop stress prediction scores.

Results

The proposed model achieved a 70% accuracy rate in detecting stress. This performance underscores the potential of vocal biomarkers as a convenient and effective tool for individuals to self-monitor and manage their stress levels within digital healthcare frameworks.

Conclusion

The findings emphasize the promise of voice-based mental stress assessments within the Korean population and the importance of continued research on vocal biomarkers across diverse linguistic demographics.

INTRODUCTION

Although stress can have some positive effects, such as providing energy and motivation and sharpening our focus, chronic stress can be detrimental. Psychological stress has been linked to numerous health issues, including mental disorders [1], cardiovascular diseases [2], cancers [3], and autoimmune diseases [4]. Additionally, work-related stress can lead to employee burnout [5] and negatively impact an organization’s productivity [6]. Therefore, managing stress is essential for enhancing long-term quality of life. Regular stress monitoring can help identify when active self-care or professional assistance is necessary. However, stress diagnosis is subjective, which poses challenges in developing a universally applicable diagnostic model [7]. Clinically, stress levels are typically measured through self-reported, expert-verified questionnaires, such as the Perceived Stress Scale [8], or visual analog scales, such as a stress thermometer [9]. Nonetheless, these methods are subjective and may not provide a consistent measurement scale for stress levels.

Extensive efforts have been made to develop biological markers that objectively and consistently measure stress, including metrics such as blood pressure, heart rate variability, electroencephalogram signals, and finger temperature [7]. Since the early 1980s, speech and voice have also been recognized as valuable biomarkers, providing probabilistic information about mental health [10]. Stress impacts both linguistic and non-linguistic speech characteristics [11]. Linguistic features—such as vocabulary diversity, lexical and grammatical complexity, syntactic structures, and semantic skills—can indicate cognitive impairment [11]. However, these features can be influenced by an individual’s educational and social background [12]. Non-linguistic features pertain to how speech sounds are produced through the modulation of muscle tension and respiration [13]. Increased stress can raise both muscle tension and respiration rates, thereby altering vocal sounds [14]. Recent developments in speech-based vocal stress models include the work of Lu et al. [15], who developed StressSense. This system uses pitch features, spectral centroids, high-frequency ratios, speaking rates, mel-frequency cepstral coefficients (MFCCs), and nonlinear features derived from the Teager Energy Operator [16]. They trained classification models using Gaussian mixture models (GMMs) for predetermined acoustic feature sets. Similarly, Sondhi et al. [14] analyzed stress using seven acoustic features: pitch, shimmer, four formants, and mean pitch. Traditional methods for building prediction models for voice data have relied on a variety of statistical approaches. For example, GMMs model the distribution of the voice feature vectors including MFCCs, to capture the characteristics of speech sounds. Support vector machines (SVM) are utilizing hyperplanes to separate different classes in high-dimensional space of numerous acoustic features. Since the traditional approaches are limited in handling the complexity of voice data, they often struggle with the variability and complexity of real-world speech. On the other hand, deep learning models can identify complex patterns and features in voice data that traditional models often miss, leading to higher accuracy and performance in voice recognition tasks [17]. Deep learning models incorporate various architectures like Convolutional Neural Networks (CNNs) [18] and Recurrent Neural Networks (RNNs) [19] to handle diverse and complex voice signals, improving robustness and adaptability to various conditions. Additionally, these models can better capture temporal and contextual information in voice data, that is difficult to account for with traditional models. Recent deep learning models including Emphasized Channel Attention, Propagation and Aggregation in Time delay neural network (ECAPA-TDNN) [20] use sophisticated feature extraction techniques and attention mechanisms such as Squeeze and Excitation blocks [21] to focus on the most relevant parts of the audio signal. This targeted approach improves the model’s precision and discriminative power. Despite significant advances in deep learning modeling of speech signals, its application for stress modeling has been relatively rare. Additionally, most models have been developed for Western populations, and few vocal stress models have been tailored for Koreans.

In this study, we aimed to develop stress prediction models for vocal biomarkers using an in-house dataset generated from a Korean clinical study. Recently introduced high-performance models based on deep learning structures, capable of speaker recognition from speech recordings, have been explored [20,22-25]. By leveraging these speaker-verification model architectures, we aim to enhance the accuracy of detecting individual stress fluctuations by accurately capturing unique voice characteristics.

METHODS

Participants

This study was part of a project conducted by SK Telecom Co., Ltd. (Seoul, Korea), aimed at developing digital biomarkers for assessing stress in healthy employees through voice and biosensor signals. Participants were recruited through advertisements at Seoul National University Bundang Hospital (SNUBH) and Boramae Medical Center (BMC) between December 2021 and February 2022.

The inclusion criteria were: 1) age between 19 and 65 years, and 2) full-time employment. The exclusion criteria were: 1) age under 19 or over 65 years, 2) cognitive impairments, such as dementia or mental retardation, 3) serious neurological disorders, including epilepsy and stroke, 4) history of schizophrenia, bipolar I disorder, or other psychotic disorders, 5) current reports of suicidal ideation, 6) history of cardiovascular disease, pulmonary disease, and associated pharmacological therapy that could affect heart rate variability measurements, 7) vocal diseases affecting speech analysis, 8) conditions that could alter cortisol levels, such as Cushing’s disease, 9) female not in the luteal phase of their menstrual cycle, and 10) a sensitivity to temperament changes, such as Raynaud’s syndrome.

Psychiatric diagnoses were made during the screening process using the Mini-International Neuropsychiatric Interview (MINI) [26], a brief, structured psychiatric interview designed to identify a broad range of psychiatric disorders according to the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition. The interview format required responses to be either “yes” or “no.” The Korean version of the MINI is well-validated and reliable [27]. The MINI assessments were conducted by two psychologists, each holding a master’s degree and well-trained in administering the MINI.

Baseline assessments

Demographic information, including age, sex, duration of employment, marital status (such as married or separated), and educational status (college education or higher, less than college education), was collected using a self-reported questionnaire provided in paper format. To assess the baseline mental stress levels of the participants, we utilized the Perceived Stress Scale [8]. This scale is designed to measure the extent to which individuals perceive their life situations as unpredictable, uncontrollable, and overwhelming. It consists of 10 direct questions about experiences that may elicit feelings of upset, nervousness, stress, or irritation; four items are phrased negatively and six positively. Responses were recorded on a 5-point Likert scale, ranging from 0 (never) to 4 (very often), with higher scores indicating greater perceived stress. The Korean version of the Perceived Stress Scale has shown a Cronbach’s alpha of 0.74 and a test-retest reliability with an intraclass correlation coefficient of 0.81.

Stress induction and its validation measures

Voice and biological signals were collected from the participants at two different times: in a relaxed and stressed state. In the relaxed state, the participants were instructed to rest in a quiet space for 10 min.

To induce a stressed state, we employed the socially evaluated cold pressor test (SECPT) [28,29]. This widely used method for inducing acute mental stress involves participants immersing their hands in ice-cold water for up to 3 minutes while under social evaluation, which elicits both physiological and psychological stress responses. Participants recorded their speech both before and after undergoing the SECPT. Stress levels were evaluated using self-reported distress thermometer (DT) scores and salivary cortisol levels before and after stress induction.

The DT is a single-item self-report measure developed by Roth et al. [9] th a cutoff value set at 4, according to National Comprehensive Cancer Network recommendations [30]. This cutoff score has been validated in other studies, including a meta-analysis [31]. The DT uses a vertical visual analog scale ranging from 0 (no distress) to 10 (extreme distress). Although originally developed for cancer patients, it is widely used across various contexts and populations due to its simplicity and ease of use [32].

In addition to self-reported stress scores, salivary cortisol levels were objectively measured to validate participants’ stress states. Salivary cortisol is considered a reliable biomarker of psychological stress [33]. Participants collected saliva samples using a Saliva Bio Oral Swab (Salimetrics, LLC, Carlsbad, CA, USA). The swab was placed under the participant’s tongue for 2 min, extending to 5 min if not sufficiently saturated. Once collected, the swab was stored at -20°C until analysis. Salivary cortisol samples were analyzed by the Green Cross Labs (Yongin, Korea) using a time-resolved immunoassay with fluorometric detection [34]. Both inter- and intra-assay coefficients of variance were maintained below 9%.

Voice recordings

Considering that linguistic features can be influenced by various personal factors such as education and economic status, we aimed to develop acoustic biomarkers for a more universally applicable model. For our input source, we included voice recordings of participants reading a script and responding freely to questions. The script, a Korean essay titled “Autumn,” provides an objective description of the natural environment, minimizing emotional influence. This script consists of 141 words, and participants typically took about 1 min to read it. Following the script reading, participants answered questions posed by the experimenter on topics such as their daily routines, hobbies, and a recently watched movie or book. If their responses to a question were shorter than 1 min, they were prompted to answer another question. However, there was no set upper limit for the length of the speech responses.

Voice recordings were conducted using a Philips Voice Tracer VTR 7100 (Koninklijke Philips N.V., Amsterdam, The Netherlands) and saved in the WAV format. The audio was captured at a sampling rate of 24,000 Hz.

The experimental procedure used to collect participants’ biological data in both relaxed (non-stressed) (T0) and stressed (T1) states is illustrated in Figure 1.

Figure 1.

Experimental procedure to collect participants’ biological data in both relaxed (non-stressed) (T0) and stressed (T1) states. Voices were recorded in two ways: script reading and responding to questions. Stress levels are evaluated by the saliva cortisol level and self-reports of DT score. The SECPT method was used for stress reduction. SECPT, socially evaluated cold pressor test; DT, distress thermometer.

Preprocessing of voice recording data

Raw voice-recorded data were manually reviewed for validity. We isolated the participants’ voices by excluding the experimenter’s voice and other background noises, utilizing timestamps from GoldWave (version 6.54; GoldWave Inc., St. John’s, NL, Canada). The data were divided into three sets: training, validation, and testing. The samples were shuffled to ensure similar age distributions and sex ratios in each subset and randomly rearranged to prevent sequence bias. Within each hospital’s data, samples were further divided into four subgroups based on the following criteria: 1) males below the median age, 2) males above the median age, 3) females below the median age, and 4) females above the median age. After shuffling these groups, the samples were allocated into training, validation, and test datasets, with each subgroup receiving 75, 20, and 20 samples, respectively.

Classification labels were assigned as 0 for the relaxed state (T0) and 1 for the stressed state (T1). The voice recordings, initially in waveform form, were resampled to a rate of 16,000 Hz to reduce computational costs without any noticeable performance degradation due to resampling.

Deep learning models for stress status classification

Using the training and validation datasets, we developed deep learning models to differentiate between two stress conditions, T0 (relaxed state) and T1 (stressed state), using acoustic features from voice-recording segments. The process of constructing these classification models involved three steps: segmenting the voice recordings, extracting the acoustic features, and training the deep neural network models (Figure 2).

Figure 2.

Classification model construction process. A stress prediction model was fitted using both training and validation datasets. For model construction, type 1 and type 2 voice recordings were incorporated as inputs. The performance of the fitted model is assessed using a test dataset that consists of samples not used for training and validation. ECAPA-TDNN, Emphasized Channel Attention, Propagation and Aggregation in Time delay neural network.

In the segmentation of voice recording input data, our data consisted of voice recordings from participants collected before (T0) and after (T1) stress induction using SECPT [28,29]. To process the data, recordings were divided into 4-sec segments with a 75% overlap with the previous segments. This overlap between two adjacent segments aimed to improve information capture by better detecting transient patterns in voice signals and avoiding boundary issues, including the loss of information near the edges, which occurs in the fixed-window approach. Consequently, segment overlapping can enhance the resolution of the extracted features and improve model performance.

For the extraction of acoustic features, the 80th Mel spectrogram was calculated to extract features from each voice segment. The Mel spectrum feature was selected because it is widely used in speech analysis and effectively represents the acoustic properties of voice signals [34].

In the training of the deep learning model, the ECAPA-TDNN model was chosen due to its consistent performance across various fields such as speaker recognition [20,35], speaker diarization [22], text-to-speech synthesis [36], spoken language identification [37], and emotion recognition [38]. This model was selected to effectively capture personal vocal characteristics. Using the 80th Mel spectrogram as the input, a deep learning model based on the ECAPA-TDNN architecture [20] was trained to classify the stress status of the study participants. Stress status was classified into two categories: relaxed and stressed, corresponding to the recording times T0 and T1. The ECAPA-TDNN model was trained using the binary cross-entropy loss function and the adaptive moment estimation optimizer [39].

The performance of the final model was evaluated based on its accuracy on the test dataset (Figure 2). Resampling and feature extraction were conducted using Torchaudio version 0.7.0 (https://pytorch.org/audio/stable/index.html), while model optimizations were performed with Torch version 1.7.0 (https://pytorch.org).

Statistical analysis

As described in the preprocessing section, training, validation, and test datasets were consisted of 75, 20, and 20 non-overlapping samples, respectively. Stress score models were trained using the training and validation sets, and their performances were evaluated on the test dataset. The performance was assessed based on the discriminative power of the distributions of stress scores between the stress status T0 and T1. The statistical significance of the increase in stress scores at T1 compared with T0 was evaluated using a one-tailed Wilcoxon rank test. A significance level of 0.05 was used as decision cutoff. The Wilcoxon rank test is a nonparametric method that is robust to various distribution shapes of observations.

To account for intra-individual score variances as a random component, mixed-effects models were employed to assess the effects of stress induction on changes in prediction scores. Subsequently, the results were compared between two voice recording methods: script reading and free speech. All statistical tests for group comparisons of clinical features and stress scores were performed using functions from the scipy.stats library version 1.10.1 in Python (https://pytorch.org).

Ethics statement

The study protocol was approved by the Institutional Review Boards of SNUBH (approval number: B-2111-719-301) and BMC (approval number: 10-2021-138). All participants provided written informed consent after receiving a thorough explanation of the study.

RESULTS

In total, 130 participants were recruited for this study. We selected data from those who exhibited sustained stress, as indicated by self-reported stress thermometer readings or cortisol levels. Data from 15 individuals who showed no signs of stress after induction were excluded from further analysis. A significant increase in either the stress thermometer score or cortisol level was considered evidence of successful stress induction. The demographic characteristics of the 115 participants are presented in Table 1. Of these, 33.91% were male. The mean age was 33.56 years (standard deviation [SD], 7.90) at SNUBH and 37.71 years (SD, 10.18) at BMC.

Demographic and clinical characteristics of the study participants at baseline

The distribution of stress thermometer scores and cortisol levels is illustrated in Figure 3 at two time points: before (T0) and after (T1) stress induction. The mean stress thermometer scores at T0 and T1 were 1.304 (SD=1.482) and 2.896 (SD=2.253), respectively. The mean cortisol levels were 0.165 (SD=0.083) at T0 and 0.322 (SD=0.21) at T1. Both differences were statistically significant, with a p-value <0.0001 in the Mann–Whitney test.

Figure 3.

Differences in the distributions of the two stress measure scores according to the measured time points prior to the stress induction experiment (T0) and postinduction (T1). Self-reported stress thermometer levels (A) and cortisol levels (B).

After completing the stress-induction experiments and isolating participants’ voices from the interviewer’s voices and background noises, the average length of the voice recordings was 74.86 seconds (SD=8.89) for script reading and 125.89 seconds (SD=65.17) for interviews. The preprocessed voices were used to train the stress prediction models and test their performance.

In preliminary studies, the distribution of prediction scores for each acoustic feature across three different model architectures was investigated: CNN [18], conformer [24], and ECAPA-TDNN. To assess the stress level of the participants, we computed stress prediction scores for each individual segment, made a simple binary decision for each segment, and finally determined the stress level based on the ratio of predicted states from T0 to T1. The ECAPA-TDNN model exhibited the best prediction performance at 77.5%, followed by the CNN at 60%, and the conformer at 62.5%. Consequently, we selected the ECAPA-TDNN model to compute the stress status prediction scores. These prediction scores, associated with stress status, are hereafter referred to as the stress score.

To determine whether trimming the beginning or end of a voice recording could affect the accuracy of stress status detection, we examined the trend of stress model scores throughout the speech. Scatter plots of the scores against time, supplemented by smoothing splines for 20 samples from the test dataset, are presented in Figure 4 to assess the overall trend along the time axis. Although some participants exhibited a linear trend in their stress scores, we did not observe a consistent pattern that would justify the removal or selection of specific parts of the voice recordings across all participants. These varied patterns of stress score changes likely highlight interindividual variability. Therefore, full-length speech recordings were used in the subsequent stress detection tests.

Figure 4.

The segment-wise stress scores computed using our deep learning-based stress prediction model for 20 test samples plotted on the time (seconds) axis. The plot of the voice of interview responses (type 2) is displayed. The stress scores computed for voice recordings in the relaxed state (T0) are depicted as blue dots, whereas those in the stressed state (T1) are represented as red dots. Smoothing splines are also displayed to verify the overall trend along the time axis. The vertical axis represents the stress model prediction score; BMC, Boramae Medical Center; SNUBH, Seoul National University Bundang Hospital.

Additionally, we compared the characteristics of stress scores derived from script reading and free speech in interview responses. We estimated the impact of stress induction on the scores for each type of voice recording and compared the magnitude of these impacts to determine which recording type provided the most distinctive information. When comparing T0 and T1, we used mixed-effects models that accounted for unique score variances among individuals by integrating an individual-specific random-effects term to address individual variability. After adjusting for individual effects by including the individual as a random effects covariate, the mean difference in the distribution of stress scores between the two time points was estimated to be greater for free speech (0.148 for recording type 2) compared to script reading (0.132 for recording type 1). Although both results showed significant p-values (p<0.001), the effect size and z-score of the test statistics were greater for free speech. These results indicate that the differences in stress scores between the two time points within each individual were more pronounced for free speech than for script reading, suggesting that voice recordings of free speech are more effective for distinguishing between relaxed (T0) and stressed (T1) states.

The distribution of scores for free-speech voices at T0 and T1 for the 20 test samples was further explored using box plots (Figure 5). Although the stressed state tended to have higher median scores, there was no complete separation between the relaxed and stressed states in the segment-wise stress scores, with an overlapping range across the two states. Among the 20 participants in the test dataset, BMC032, SNUBH042, and SNUBH048 showed significant differences in the median scores between T0 and T1. However, 6 participants, including BMC052, SNUBH043, and SNUBH062, showed little difference. The statistical significance of the differences between T0 and T1 for each individual was tested using the one-sided Wilcoxon rank test, a nonparametric method for group comparisons. Of these, 14 participants showed increased stress levels. In summary, through a two-step procedure—computing stress scores based on a deep learning model for free speech recordings and then conducting a statistical test for two-group comparisons—we correctly detected the stress status in 70% of the test participants. The test statistics and p-values are listed in Supplementary Table 1.

Figure 5.

Box plots of stress prediction score distributions for 20 test samples from the relaxed state (T0) and the stress state (T1). BMC, Boramae Medical Center; SNUBH, Seoul National University Bundang Hospital.

DISCUSSION

We collected data from 130 participants and obtained 115 datasets. We developed a deep learning model to compute stress-associated scores using sound frequency features, MFCCs, and the speaker identification model architecture, ECAPA-TDNN [20]. The participants’ voices were captured using two methods: reading a script and answering one or two randomly selected neutral questions. In our statistical analyses, the stress scores computed using voice signals from free speech (answering questions) showed a stronger association with an individual’s stress status. As we observed that the distribution of individuals’ stress prediction scores differed in both shape and location, we implemented a comparative test procedure that analyzed two sets of stress scores from the same individual to determine their stress status.

There are many studies on the computational modeling of stress, most of which use physiological and behavioral information requiring special sensor devices, or psychological information that could be subjective [7]. Our model uses only voices, which can be collected using personal devices such as mobile phones, tablets, smartwatches, or computers. In a prior study, Han et al. [40] developed a vocal stress classification model using long short-term memory (LSTM). They trained the model using 25 subsampled voice segments, each 4 sec long, from the same 25 participants and achieved an accuracy of 66.4% for the Korean language [40]. We adopted the latest advanced deep-learning architecture for speech models using speaker identification, which is sensitive to subtle differences in voice signals. Consequently, our algorithm for stress detection achieved an accuracy of 70% when evaluated using a test dataset that had been set aside prior to model development. As performance was evaluated using independent samples, the accuracy results provide a more rigorous measure of external validity or generalizability.

The coronavirus disease-2019 (COVID-19) pandemic has significantly stressed populations worldwide. Historically, the detrimental effects of pandemic infectious diseases on mental health are expected outcomes [41]. Concurrently, COVID-19-related mandates such as quarantine and social distancing have increased the demand for digital healthcare technologies to enhance mental health [42]. Monitoring various personal health conditions using digital devices enables personalized and precise management through self- or remote digital healthcare services. Measurements related to behavior, physiological processes, and other health aspects are referred to as digital biomarkers [43]. Our voice-based stress model is a digital biomarker compatible with various mobile devices, including smartphones and tablets. Using our vocal stress test procedure, periodic monitoring of mental stress levels can be conducted anywhere. Longitudinal records of the stress test results can serve as an objective history to provide a relative quantification of stress levels. Unlike traditional clinical questionnaires, which require introspective answers, our method potentially increases compliance by reducing inconvenience.

Our stress score model can be integrated into healthcare applications to assess mental stress. The procedure for testing mental stress levels is illustrated in Figure 6. Upon initial use, the application prompts the user to record their voice in a stress-free and relaxed state to establish a baseline reference. To capture the users’ free-speech voices, they are presented with one or more randomly selected questions from a predetermined list of neutral questions. The recording lasts for at least 1 min. The stress score is computed using the model stored in the application, and these values are set as user references. Subsequently, users can gauge their stress levels by recording their voices in response to questions. Although the score itself correlates with mental stress, the values derived from speech recordings can vary among individuals based on their speaking style. To adjust for this variability, we introduced a statistical test based on user-reference values, which enhances the detection accuracy of elevated stress levels.

Figure 6.

Flowchart of individual stress test procedure. A pretrained algorithm was used to compute the stress score from the recorded speech, which served as a response to one or more questions randomly selected from a prestored list of neutral questions. To initialize the reference values, the users were asked to record their speech in their most relaxed state upon first use. The stress status of the user is then determined using statistical tests that compare the current test input with the reference values.

The development of stress biomarkers and mobile applications offers several advantages: 1) they enable employees to monitor individualized stress patterns encountered in daily life, 2) they enhance the potential to prevent mental disorders related to stress, such as depression, and physiological conditions like cardiovascular disorders at a preclinical stage, and 3) they provide immediate stress-monitoring tools for clinical populations in need.

One concern regarding mobile health solutions is safeguarding personal information and ensuring user privacy. However, as the stress score model can be downloaded directly from the service provider, stress assessment can be performed without transmitting personal data. Using smartphone applications for stress management facilitates cost-effectiveness and consistent tracking.

Our study’s sample size was small. Future studies with larger sample sizes are required to further enhance the proposed model’s performance. Recent research has demonstrated that transfer learning, particularly using self-supervised models trained on public datasets, can improve performance across various tasks, even with a limited number of analysis samples. Thus, such advanced modeling approaches will be adopted in our future studies.

In conclusion, our study highlighted voice biomarkers as potentially effective digital tools for measuring stress levels. Using a deep learning model, we demonstrated that our method can detect stress with considerable accuracy. Moreover, this study introduces linguistically tailored digital health solutions, emphasizing the utility and novelty of using the Korean language as reliable stress biomarkers. The practical implications of this research are substantial, especially during times of challenge, such as the COVID-19 pandemic, when the demand for remote mental health monitoring surges.

Integrating this voice-based system into mobile health platforms allows individuals to track their mental stress conveniently and privately. Although our findings are promising, the sample size was limited. Future studies with larger sample sizes are crucial to refine the effectiveness of the system. We also plan to apply advanced modeling techniques, such as transfer learning, to further improve model performance.

Supplementary Materials

The Supplement is available with this article at https://doi.org/10.30773/pi.2024.0131.

Supplementary Table 1.

Stress status was determined based on Mann–Whitney test p-values from 20 test samples

pi-2024-0131-Supplementary-Table-1.pdf

Notes

Availability of Data and Material

The datasets generated or analyzed during the study are not publicly available as per the internal policy of the SK Group’s funding organization. However, these datasets can be obtained from the corresponding author upon reasonable request and with the approval of the relevant committee.

Conflicts of Interest

The authors have no potential conflicts of interest to disclose.

Author Contributions

Conceptualization: Junghyun Namkung, Jeong-Hyun Kim, Je-Yeon Yun, So Young Yoo, Beomjun Min, Heyeon Park. Data curation: Ji-Hye Lee, Beomjun Min, Soyoung Baik. Formal analysis: Junghyun Namkung. Funding acquisition: Junghyun Namkung, Jeong-Hyun Kim. Investigation: Jeong-Hyun Kim, So Young Yoo, Junghyun Namkung. Methodology: Seok Min Kim, Won Ik Cho, Junghyun Namkung, Sang Yool Lee, Nam Soo Kim. Project administration: Junghyun Namkung. Resources: Jeong-Hyun Kim, So Young Yoo, Nam Soo Kim. Software: Seok Min Kim, Won Ik Cho, Junghyun Namkung, Sang Yool Lee, Nam Soo Kim. Supervision: Jeong-Hyun Kim, So Young Yoo, Nam Soo Kim. Visualization: Junghyun Namkung. Writing—original draft: Junghyun Namkung. Writing—review & editing: Jeong-Hyun Kim, Je-Yeon Yun.

Funding Statement

This research was supported by the SK Group joint fund through the SK SUPEX Council ICT Committee in South Korea in 2021.

Acknowledgements

None

References

1. Cohen S, Janicki-Deverts D, Miller GE. Psychological stress and disease. JAMA 2007;298:1685–1687.
2. Dimsdale JE. Psychological stress and cardiovascular disease. J Am Coll Cardiol 2008;51:1237–1246.
3. Soung NK, Kim BY. Psychological stress and cancer. J Anal Sci Technol 2015;6:30.
4. Goodday SM, Friend S. Unlocking stress and forecasting its consequences with digital technology. NPJ Digit Med 2019;2:75.
5. Friganović A, Selič P, Ilić B, Sedić B. Stress and burnout syndrome and their associations with coping and job satisfaction in critical care nurses: a literature review. Psychiatr Danub 2019;31(Suppl 1):21–31.
6. Kim JH. The relationship between employee’s work-related stress and work ability based on qualitative literature analysis. J Ind Distrib Bus 2021;12:15–25.
7. Sharma S, Singh G, Sharma M. A comprehensive review and analysis of supervised learning and soft computing techniques for stress diagnosis in humans. Comput Biol Med 2021;134:104450.
8. Cohen S, Kamarck T, Mermelstein R. A global measure of perceived stress. J Health Soc Behav 1983;24:385–396.
9. Roth AJ, Kornblith AB, Scher HI, Holland JC. Rapid screening for psychologic distress in men with prostate carcinoma: a pilot study. Cancer 1998;82:1904–1908.
10. Schultebraucks K, Yadav V, Shalev AY, Bonanno GA, Galatzer-Levy IR. Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychol Med 2022;52:957–967.
11. Paulmann S, Furnes D, Bøkenes AM, Cozzolino PJ. How psychological stress affects emotional prosody. PLoS One 2016;11e0165022.
12. Robin J, Harrison JE, Kaufman LD, Rudzicz F, Simpson W, Yancheva M. Evaluation of speech-based digital biomarkers: review and recommendations. Digit Biomark 2020;4:99–108.
13. Titze IR. Principles of voice production Iowa City: National Center for Voice and Speech; 2000.
14. Sondhi S, Munna K, Vijay R, Salhan A. Vocal indicators of emotional stress. Int J Comput Appl 2015;122:38–43.
15. Lu H, Frauendorfer D, Rabbi M, Mast MS, Chittaranjan GT, Campbell AT, et al. StressSense: detecting stress in unconstrained acoustic environments using smartphones. UbiComp’12: Proceedings of the 2012 ACM Conference on Ubiquitous Computing; 2012 Sep 5-8; Pittsburgh, United States. New York: Association for Computing Machinery; 2012. p. 351-360.
16. Teager H. Some observations on oral air flow during phonation. IEEE Trans Acoust Speech Signal Process 1980;28:599–601.
17. Ahamad A, Christian N, Luling, Lodhi AK, Mamodiya U, Khan IR. Evaluating AI system performance by recognition of voice during social conversation. 2022 5th International Conference on Contemporary Computing and Informatics (IC3I); 2022 Dec 14-16; Uttar Pradesh, India. New York: Institute of Electrical and Electronics Engineers (IEEE); 2022. p. 149-154.
18. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification using deep convolutional neural networks. In: Pereira F, Burges CJ, Bottou L, Weinberger KQ, editors. 26th Annual Conference on Neural Information Processing Systems 2012; 2012 Dec 3-6; Lake Tahoe, United States. New York: Institute of Electrical and Electronics Engineers (IEEE); 2012. p. 1-9.
19. Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. Parallel distributed processing: explorations in the microstructure of cognition. Cambridge: The MIT Press, 1986, p. 318-362.
20. Desplanques B, Thienpondt J, Demuynck K. ECAPA-TDNN: emphasizes channel attention, propagation, and aggregation in a TDNNbased speaker verification. INTERSPEECH 2020; 2020 Oct 25-29; Shanghai, China. 2020. p. 3830-3834.
21. Hu J, Shen L, Albanie S, Sun G, and Wu E. Squeeze-and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18-22; Salt Lake City, United States. 2018. p. 7132-7141.
22. Dawalatabad N, Ravanelli M, Grondin F, Thienpondt J, Desplanques B, Na H. ECAPA-TDNN embedding for speaker diarisation. In: editors. INTERSPEECH 2021; 2021 Aug 30-Sep 3; Brno, Czechia. 2021. p.3560-3564.
23. Gulati A, Qin J, Chiu CC, Parmar N, Zhang Y, Yu J, et al. Conformer: convolution-augmented transformer for speech recognition. INTERSPEECH 2020; 2020 Oct 25-29; Shanghai, China. 2020. p. 5036-5040.
24. Zhu Y, Ko T, Snyder D, Mak B, Povey D. Self-atentive speaker embedding for text-independent speaker verification. INTERSPEECH 2018; 2018 Sec 2-6; Hyderabad, India. 2018. p. 3573-3577.
25. Sheehan DV, Lecrubier Y, Sheehan KH, Amorim P, Janavs J, Weiller E, et al. The Mini-International Neuropsychiatric Interview (M.I.N.I.): the development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10. J Clin Psychiatry 1998;59(Suppl 20):22–33.
26. Yoo SW, Kim YS, Noh JS, Oh KS, Kim CH, Namkoong K, et al. [Validity of the Korean version of the Mini-International Neuropsychiatric Interview]. Anxiety Mood 2006;2:50–55. Korean.
27. Schwabe L, Dalm S, Schächinger H, Oitzl MS. Chronic stress modulates the use of spatial and stimulus-response learning strategies in mice and man. Neurobiol Learn Mem 2008;90:495–503.
28. Schwabe L, Wolf OT. Stress prompts habit behavior in humans. J Neurosci 2009;29:7191–7198.
29. Jacobsen PB, Donovan KA, Trask PC, Fleishman SB, Zabora J, Baker F, et al. Screening for psychological distress in ambulatory cancer patients: a multicenter evaluation of the Distress Thermometer. Cancer 2005;103:1494–1502.
30. Ma X, Zhang J, Zhong W, Shu C, Wang F, Wen J, et al. The diagnostic role of a short scanning tool—the distress thermometer: a meta-analysis. Support Care Cancer 2014;22:1741–1755.
31. Sousa H, Oliveira J, Figueiredo D, Ribeiro O. The clinical utility of the Distress Thermometers in non-oncological contexts: a scoping review. J Clin Nurs 2021;30:2131–2150.
32. Hellhammer DH, Wüst S, Kudielka BM. Salivary cortisol as a biomarker in stress research. Psychoneuroendocrinology 2009;34:163–171.
33. Dressendorfer RA, Kirschbaum C, Rohde W, Stahl F, Strasburger CJ. Synthesis of a cortisol-biotin conjugate and evaluation as a tracer in an immunoassay for salivary cortisol measurement. J Steroid Biochem Mol Biol 1992;43:683–692.
34. Flanagan JL. Speech analysis, synthesis and perception (2nd ed) New York, Heidelberg, Berlin: Springer Verlag; 1972.
35. Thienpondt J, Desplanques B, Demuynck K. Integrating the frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification. INTERSPEECH 2021; 2021 Aug 30-Sep 3; Brno, Czechia. 2021. p. 2302-2306.
36. Xue J, Deng Y, Han Y, Li Y, Sun J, Liang J. ECAPA-TDNN for multispeaker text-to-speech synthesis. In: Lee KA, Lee H, Lu Y, Dong M, editors. 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP); 2022 Dec 11-14; Singapore. 2021. p. 230-234.
37. Kang W, Alam J, Fathan A. Deep-learning-based end-to-end spoken language identification system for domain-mismatched scenarios. 13th Conference on Language Resources and Evaluation (LREC 2022); 2022 Jun 20-25; Marseille, France. 2022. p. 7339-7343.
38. Kumawat P, Routray A. Applying TDNN architectures to analyze the duration dependencies of speech emotion recognition. INTERSPEECH 2021; 2021 Aug 30-Sep 3; Brno, Czechia. 2021. p. 3410-3414.
39. Yang B, Yih W, He X, Gao J, Deng L. Embedding entities and relations for learning and inference in knowledge bases. The 3rd International Conference on Learning Representations (ICLR) 2015; 2015 May 7-9; San Diego: United States. 2015.
40. Han H, Byun K, Kang HG. A deep learning-based stress-detection algorithm with speech signals Workshop on Audiovisual Scene Understanding for Immersive Multimedia 2018; 2018 Oct 26; Seoul: Koreath ed. New York: United States; 2018.
41. World Health Organization. Mental health and COVID-19: early evidence of the pandemic’s impact [Internet]. Scientific Brief 2 (March). Available at: http://www.jstor.org/stable/resrep44578. Accessed Jun 25, 2024.
42. Murphy JK, Khan A, Sun Q, Minas H, Hatcher S, Ng CH, et al. Needs, gaps, and opportunities for standard and e-mental healthcare among at-risk populations in the Asia-Pacific in the context of COVID-19: a rapid scoping review. Int J Equity Health 2021;20:161.
43. Vasudevan S, Saha A, Tarver ME, Patel B. Digital biomarkers: convergence of digital health technologies and biomarkers. NPJ Digit Med 2022;5:36.

Article information Continued

Figure 1.

Experimental procedure to collect participants’ biological data in both relaxed (non-stressed) (T0) and stressed (T1) states. Voices were recorded in two ways: script reading and responding to questions. Stress levels are evaluated by the saliva cortisol level and self-reports of DT score. The SECPT method was used for stress reduction. SECPT, socially evaluated cold pressor test; DT, distress thermometer.

Figure 2.

Classification model construction process. A stress prediction model was fitted using both training and validation datasets. For model construction, type 1 and type 2 voice recordings were incorporated as inputs. The performance of the fitted model is assessed using a test dataset that consists of samples not used for training and validation. ECAPA-TDNN, Emphasized Channel Attention, Propagation and Aggregation in Time delay neural network.

Figure 3.

Differences in the distributions of the two stress measure scores according to the measured time points prior to the stress induction experiment (T0) and postinduction (T1). Self-reported stress thermometer levels (A) and cortisol levels (B).

Figure 4.

The segment-wise stress scores computed using our deep learning-based stress prediction model for 20 test samples plotted on the time (seconds) axis. The plot of the voice of interview responses (type 2) is displayed. The stress scores computed for voice recordings in the relaxed state (T0) are depicted as blue dots, whereas those in the stressed state (T1) are represented as red dots. Smoothing splines are also displayed to verify the overall trend along the time axis. The vertical axis represents the stress model prediction score; BMC, Boramae Medical Center; SNUBH, Seoul National University Bundang Hospital.

Figure 5.

Box plots of stress prediction score distributions for 20 test samples from the relaxed state (T0) and the stress state (T1). BMC, Boramae Medical Center; SNUBH, Seoul National University Bundang Hospital.

Figure 6.

Flowchart of individual stress test procedure. A pretrained algorithm was used to compute the stress score from the recorded speech, which served as a response to one or more questions randomly selected from a prestored list of neutral questions. To initialize the reference values, the users were asked to record their speech in their most relaxed state upon first use. The stress status of the user is then determined using statistical tests that compare the current test input with the reference values.

Table 1.

Demographic and clinical characteristics of the study participants at baseline

Characteristics SNUBH (N=64) BMC (N=51) Total (N=115)
Age (yr) 33.56±7.90 37.71±10.18 35.40±9.18
Sex (female) 43 (67.19) 33 (64.71) 76 (66.09)
Marital status: unmarried 39 (60.94) 25 (49.02) 64 (55.65)
Education state (≥college education) 64 (100.00) 37 (72.55) 101 (87.83)
Length of work (yr) 7.27±6.22 9.40±8.47 8.20±7.37
Perceived Stress Scale score 16.02±3.80 16.49±4.64 16.23±4.20

Values are presented as mean±standard deviation or number (%). SNUBH, Seoul National University Bundang Hospital; BMC, Boramae Medical Center