Comparison of Logistic Regression and Machine Learning Approaches in Predicting Depressive Symptoms: A National-Based Study
Article information
Abstract
Objective
Machine learning (ML) has been reported to have better predictive capability than traditional statistical techniques. The aim of this study was to assess the efficacy of ML algorithms and logistic regression (LR) for predicting depressive symptoms during the COVID-19 pandemic.
Methods
Analyses were carried out in a national cross-sectional study involving 21,916 participants. The ML algorithms in this study included random forest (RF), support vector machine (SVM), neural network (NN), and gradient boosting machine (GBM) methods. The performance indices were sensitivity, specificity, accuracy, precision, F1-score, and area under the receiver operating characteristic curve (AUC).
Results
LR and NN had the best performance in terms of AUCs. The risk of overfitting was found to be negligible for most ML models except for RF, and GBM obtained the highest sensitivity, specificity, accuracy, precision, and F1-score. Therefore, LR, NN, and GBM models ranked among the best models.
Conclusion
Compared with ML models, LR model performed comparably to ML models in predicting depressive symptoms and identifying potential risk factors while also exhibiting a lower risk of overfitting.
INTRODUCTION
Machine learning (ML) encompasses statistical modeling techniques and automated algorithms for decision-making. These methods are better equipped to analyze large and complex datasets than traditional statistical methods [1,2]. Moreover, the predictive capability of ML has often been emphasized, especially in the modeling of rarely perfectly linear relationships [1,3], and the predictive potential of ML for diseases has been demonstrated in previous studies [1,3-5].
However, the use of ML models for disease prediction presents several challenges that warrant consideration. In imbalanced datasets, where the number of positive cases is far outweighed by the number of negative cases, this approach often poses a major challenge [1,6,7]. Overfitting poses another challenge, where a model that achieves high performance on the training set fails to generalize to new cases. While advanced sampling techniques and ensemble methods can help mitigate the risks of data imbalance and overfitting, large datasets are required for ML algorithms to reliably identify data patterns [1,8].
While interest in using ML for disease prediction has increased rapidly, it remains unclear whether ML applications outperform traditional regression algorithms. Several studies have reported limited predictive abilities of ML models [9], and others have shown that traditional models surpass ML models on certain statistical metrics, such as area under the receiver operating characteristic curve (AUC) and Akaike information criterion [10-12]. The predictive performance also varies among different ML models [5,13,14]. Therefore, it is crucial to systematically compare and understand the predictive capabilities of both traditional regression techniques and ML algorithms for disease prediction in large populations.
ML algorithms have been applied to identify predictors of psychological distress and depression in previous studies [2]. During the COVID-19 pandemic, higher scores of anxiety and depression were revealed than before the pandemic [15], and the increasing tendency of this subclinical form of depression should not be ignored in public health [16]. Thus, this study evaluated the abilities of ML and logistic regression (LR) to predict the risks of depressive symptoms during the COVID-19 quarantine period in China. Our hypothesis is that ML and LR algorithms will exhibit similar predictive performance on depressive symptoms.
METHODS
Study population
We conducted an analysis of the psychology and behavior of Chinese residents, a national cross-sectional survey from 20 June to 31 August 2022 in mainland China that included 148 cities in all provinces. Approximately 30,000 people participated in the survey with or without the help of investigators (response rate 71.8%, eligibility rate 96.8%), and 21,916 participants (50% men) were included in this cross-sectional study. Ineligible participants included people who were less than 12 years old, who were not of any nationality in China, who resided abroad for more than one month, who were mentally disabled, and who were involved in other similar studies.
The data were also subjected to the following exclusion criteria. First, the questionnaires were excluded with a response time of less than 4 minutes. Second, the questionnaires were excluded with inconsistent LR outcomes (e.g., aged less than 18 years but with a marital status identified as married, divorced, or widowed). Third, questionnaires were excluded with incomplete information. Fourth, any repeated questionnaires or questionnaires completed in a regularly repeated manner were excluded. The final dataset was balanced, ensuring an equal number of man and woman respondents.
The survey was approved by the Medical Ethics Committee (No.JKWH-2022-02) and the Clinical Research Ethics Committee of the Second Xiangya Hospital of Central South University (No.2022-K050).
Definitions of outcomes
The definition of depressive symptoms considered in this study was obtained from the 9-item Patient Health Questionnaire (PHQ-9) following Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, the diagnostic criteria for depression [17]. The PHQ-9 has been widely used in clinical practice and has good reliability, and its validity has been proven in previous studies [16,18,19]. Each item is rated from 0 (never) to 3 (nearly every day); thus, the PHQ-9 score ranges from 0 to 27, which corresponds to levels of depressive symptoms. The Cronbach’s alpha of PHQ-9 was 0.921 in the present study. The definition of depressive symptoms was then dichotomized by a score threshold of 5, that is, the scores of 4 or less were classified as the absence of depressive symptoms [17].
Predictors considered
Sociodemographic variables
The following demographic variables were collected: age, gender (man or woman), ethnicity (Han or non-Han nationality), residency (urban or rural), education level (graduate/doctoral, college, middle school, primary school, or illiterate/semiliterate), monthly income (categorized as ≤1,000 Chinese Yuan [CNY] per month, 1,001–3,000 CNY, 3,001–5,000 CNY, or >5,000 CNY), living arrangement (living alone or not), occupation (employed/student or not), obesity status (yes: body mass index [BMI] ≥25, or no: BMI<25 kg/m2), and marital status (yes or no).
Social environment variables
The participants’ perceptions of their neighborhood (assessed on a scale of 1 to 7), experiences of negative life events (binary response of yes or no), and self-assessed social status (rated on a scale of 1 to 7) were examined.
Subjective indicators
The following measures were used to assess subjective indicators: stress (Perceived Stress Scale-4 items [PSS-4], and Cronbach’s alpha=0.668) [20], self-efficacy (New General Self-Efficacy Scale [NGSES], and Cronbach’s alpha=0.925) [21], social support (Multidimensional Scale of Perceived Social Support [MSPSS], Cronbach’s alpha=0.888) [22]. loneliness (Three-Item Loneliness Scale [T-ILS], Cronbach’s alpha=0.862) [23] and subjective well-being (World Health Organization Well-Being Index [WHO-5], Cronbach’s alpha=0.933) [24].
Individual behavior variables
The following variables were assessed in this study: Problematic internet use (Problematic Internet Use Questionnaire-Short Form-6 [PIUQ-SF6], Cronbach’s alpha=0.932) [25-27], subjective sleep quality (Brief Version of the Pittsburgh Sleep Quality Index [B-PSQI], Cronbach’s alpha=0.686) [28], late bedtime (frequency of late bedtime per week, rated from 1 to 7), screens before bedtime (≤0.5, 0.5–1, or >1 hour), physical activity (7-day recall of International Physical Activity Questionnaire [IPAQ-7]) [29,30], smoking (yes or no), alcohol drinking (yes or no), coffee drinking (≥3 cups per day, 1–2 or never), tea drinking (yes or no), sugar beverage (7, 4–6, 1–3 bottles per week, or never), takeaway food (3, 1–2, ≤1 time per week, or never), water intake (<1,200, 1,200–1,499, 1,500–1,699 mL per day, or 1,700–2,099 mL per day), chronic disease (yes or no), and presence of eye disease (yes or no).
COVID-19-related variables
The following variables were assessed in relation to the COVID-19 pandemic: home quarantine (yes or no), lockdown (yes or no), community closure (yes or no), and perceived influence on daily life (rated on a scale of 0–100 to indicate the extent of the pandemic’s impact).
Modeling strategy
All analyses were conducted by using R software (version 3.5.1; R Core Team, 2019, https://www.R-project).
Data preprocessing
Several preprocessing steps were conducted to prepare data and select features. First, the data with missing values were removed. Second, categorical predictors were transformed into indicator variables. Continuous variables were standardized by subtracting the mean and dividing by the standard deviation. Third, for feature selection, the t-test was applied to determine if there was a significant mean difference in a predictor between the groups with and without depressive symptoms. The features (obesity and Han ethnicity) with the p-values larger than 0.05 were removed from further modeling process. Fourth, pair correlations were checked to avoid issues of collinearity. Based on the coefficient threshold 0.75, only the pair of community closure and city lockdown was identified as high correlation, and the predictor community closure was removed due to its lower correlation with the depression symptom. Therefore, 33 predictors were finally selected for our modeling.
Models considered
The following four models were considered: LR, random forest (RF), neural network (NN), support vector machine (SVM), and gradient boosting machine (GBM). The LR model was applied in two approaches. One was the traditional type with default configurations, and another was optimized by cross-validation with Lasso (L1) penalty. The LR model performance was also compared to that of ML models whose best hyperparameters were selected according to the highest AUC values (modeling details in Supplementary Material).
For GBM modeling, the hyperparameter tuning such as early stopping and regularization was concerned to control model complexity and preventing overfitting. Early stopping is a technique where model training stops automatically when the validation error stops improving for a certain number of iterations (trees). To incorporate the technique in our modeling, the key parameter is n.trees, which specifies the total number of boosting iterations (the number of trees) the model should build. It can be used to mimic early stopping by using cross-validation to identify the optimal number of trees.
In gradient boosting, L1 (Lasso) regularization encourages sparsity by driving some weights (or coefficients) to zero, effectively reducing the number of features. In our modeling, setting the parameter n.minobsinnode (minimum observations in terminal node) to a higher value is similar to L1 regularization. It means that each terminal node (leaf) in the tree must have at least the specified minimum number of observations, which avoids creating overly small leaves and then reduces model complexity. Another regularization L2 (Ridge) penalizes the sum of squared coefficients, which helps reduce the magnitude of the weights, making the model less sensitive to individual features. In our modeling, lower values of shrinkage (learning rate) force the model to learn slowly with smaller step sizes, which is similar to how L2 regularization limits large changes in weights. A smaller shrinkage value typically requires more boosting rounds (n.trees), but it helps smooth out the learning and preventing extreme responses to individual observations, similar to the coefficients smoothing in L2. In addition, the parameter interaction.depth (tree depth) limits the depth of each tree, and shallower trees are less likely to fit noise in the training data, making the model more regularized, acting similarly to L2 by reducing complexity.
Resampling strategy
The data were randomly divided into a training set (70%) and a testing set (30%). The optimal hyperparameters were specifically chosen by five-fold cross-validation using the training set. This resampling method split the training data into five folds equally, and then the model was trained on four of these folds and validated on the remaining one. This process was repeated five times, each time using a different fold as the validation set, and the performance metrics were averaged over five iterations. After the optimization, the model was run in the entire training sets, and then used for prediction on the testing set.
Predictive performance
Six indices of predictive performance were considered: AUC, accuracy, precision, sensitivity, specificity, and F1-score. The AUCs with the 95% confidence interval (CI) were calculated for both training and testing sets, and the AUC differences were considered to indicate the potential risk of model overfitting. The AUC values of different models were compared by paired Delong’s tests. The F1-score is the harmonic mean of precision and recall, which symmetrically represents both precision and recall in one metric. The model with the highest training AUC was selected to rank the variable importance for the risks of depressive symptoms.
RESULTS
Characteristics of the data
Table 1 presents the characteristics of 21,916 participants in this cross-sectional study, including sociodemographic variables, social environmental indicators, subjective indicators, individual behavior, and COVID-19-related variables. The man population represented 50% of the sample, with a mean age of 39.4±18.9 years. Additionally, 9.9% of the participants were adolescents (aged 12–18 years), and 18.8% were older than 60 years. More than half of the participants were urban (69.3%), and only a small number lived alone (14.3%). The education levels included primary (39.8%), college (40.8%), and postgraduate/doctoral (3.8%) education; 5.4% were illiterate/semiliterate. The monthly household income per capita (in CNY) was considered to be at four different levels: >5,000 CNY (36.6%), 3001–5,000 (30.4%), 1,001–3,000 (26.7%), and <1,000 (6.3%). The prevalence of obesity was 16.8%, and 56.7% of participants were married.
The risk of depressive symptoms was 57.6% among the participants in this study. The univariate analysis in Table 1 revealed that the young participants tended to have more depressive symptoms than the older participants (p<0.001), and the women were more likely to develop depressive symptoms than were the men (p<0.001). It was also more common in urban residents and solitary or unmarried individuals (all p<0.001). Its relationship with the social environment was found in poorer neighborhoods, with negative life-event experiences and with lower social status (all p<0.001). The subjective indicators of the participants suggested a strong correlation with depressive symptoms (all p<0.001), such as higher stress, lower self-efficacy, lower social support, greater loneliness, and lower subjective well-being. Furthermore, greater impacts of COVID-19 were reported among participants with depressive symptoms than among those without depressive symptoms (p<0.001). In addition, significant correlations were observed between certain individual behaviors and depressive symptoms. Participants were more likely to experience depressive symptoms (p<0.001) if they went to bed late, had poor sleep quality, or had long screen time before sleep. Depressive symptoms were more common with longer problematic internet use and less physical activity. Participants with depressive symptoms had the following dietary habits: smoking, alcohol intake, more coffee or tea intake, more sugar beverage, more takeaway food consumption, and less water intake (all p<0.001). Additionally, the presence of chronic diseases and eye diseases was associated with a greater likelihood of depressive symptoms (all p<0.001).
Predictive performance
Table 2 presented the predictive performances in training and testing datasets of all models. There were small AUC differences between the training and testing cohorts among ML algorithms except for RF, which may imply that the issue of overfitting cannot be neglected in RF. Among the non-overfitting models, the LR, SVM and NN algorithms had similar AUC values around 0.86 (LR=0.862, 95% CI=0.853–0.871; SVM=0.860, 95% CI=0.851–0.869; NN=0.866, 95% CI=0.857–0.875), while the GBM had slightly better performance with the highest AUC (0.874, CI=0.865–0.883), specificity (0.818), accuracy (0.798), precision (0.769), and F1-score (0.771). According to AUC differences between training and testing, LR had the smallest difference (0.69%) and GBM had a bit larger one (2.3%). Furthermore, it seemed that the performances between the two LR models were almost the same either with or without optimization, which implied that the traditional LR was good enough for our data analysis.
The training and testing AUC values of the different models were compared by the paired Delong’s test (Supplementary Table 1). Although there were significant differences among those models in the training set (p<0.01), the LR was found to correlate with other ML models except RF in the testing set. Overall, the performances of LR and the other models seemed systematically similar except that of RF in our study.
Figure 1 is the AUC learning curves for different models, where they were trained on increasing subsets of the training set. The modeling performances were evaluated not only by the training set (dash-dot-dot green), but also by the testing set (dash red) and the averaged results of cross-validation during the model training process (solid blue). There was a large gap between the training and the validation AUCs in the RF model, which suggested a variance problem which probably led to overfitting issues. The three curves of LR and SVM models almost overlapped with high values of AUCs, which implied the models were well-generalized. In terms of GBM model, its curves seemed to be a bit higher than those of LR, but its curve gap was also larger though. All of these observations agreed with the results of Table 2 above.

The AUC learning curves of the five models (LR, RF, NN, SVM, and GBM) which were optimized by CV. The x-axis was the proportion of training subsets. The blue solid line was the mean AUC from the CV, the green dash-dot-dot was from the prediction on training subsets, and the red dash was from the prediction on the testing set. AUC, area under the receiver operating characteristic curve; LR, logistic regression; RF, random forest; NN, neural network; SVM, support vector machine; GBM, gradient boosting machine; CV, cross-validation.
Variable importance
The variable importance of 36 predictors from all models were ranked in Table 3. The rank in LR was based on the absolute value of the z-statistic, and the ranks of ML models were obtained by the function “varImp” in R. In RF, the variable importance was approached by the mean decrease in Gini impurity; in GBM, it was based on the relative influence of each feature, considering both frequency and gain across all trees; and in SVM, it was determined by the absolute values of the model coefficients with the linear kernel we used.
Although the top 10 variables of different models were not exactly the same, the five were always listed: loneliness, stress, problematic internet use, late bedtime, and subjective well-being. It implied a statistical correlation of subjective indicators (loneliness, stress) and individual behaviors (late bedtime, problematic internet use) with depressive symptoms. Meanwhile, the COVID-19-related variables (home quarantine, lockdown, community closure, and influence on daily life) were ranked relatively low as the less important predictors, indicating that their influence on depressive symptoms was limited.
Interactions among predictors
Figure 2 shows the GBM performance by the relationship between the mean training AUC and the maximum interaction depth with the tree numbers from 50 to 300. The AUC for the low tree number 50 was the lowest at interaction depth 1, but it made a large jump at tree depth 2, and then the increased steps became smaller with deeper interaction. The AUC values of tree number 100 were higher at the low interaction depth than tree number 50, but the gap between these two curves was shrinking as the interaction depth increased. Similar tendency was found among other large tree numbers, where their curves were closer between each other as the interaction depth increased and reached a plateau around 0.875 at the tree depth 6. After the optimization of cross-validation, the hyperparameters of the best GBM model were found to be tree number 150 and max tree depth 6.
Nonlinearity
Figure 3 showed the nonlinear effect of age on the prediction of depressive symptoms using a generalized additive model (GAM). The smooth curve had a minimum at approximately 60 years of age according to visual inspection; the degrees of freedom were greater than 1 (2.95), and the p-value associated with each smoothing term was less than 0.001. Thus, the visual examination and the statistical tests indicated the presence and significance of nonlinearity. Considering the nonlinearity effect of age, we recalculated the variable importance for the age subgroups 18–39, 40–59, and 60–100 years old (Supplementary Table 2).
DISCUSSION
This large-scale epidemiological study demonstrated that the traditional LR model is comparable to the ML models for predicting depressive symptoms. The AUC values on the testing set estimated the similar performances of LR, SVM, and NN models, while those of GBM and RF were found larger by a small amount. The likelihood of overfitting was investigated by comparing the differences between the AUCs on training and testing data, which turned out to be the smallest in LR (0.69%) but the largest in RF (23.3%). Each pair of models were also compared by the pair Delong’s test on these AUCs. There were no significant differences between LR and other machine models except the RF model which had an obvious overfitting issue. Furthermore, the traditional LR was much simpler without optimization, and easier to interpret comparing with those complex ML models like NN and GBM. Thus, our study suggested that ML models may not be warranted because of their better performance than traditional LR models in predicting depressive symptoms in large populations.
Our results were consistent with those of previous studies on depression, and the AUC was in line with that of previous studies [2]. The results also agreed with those of an algorithm study, which revealed that the simple LR method yielded as good a performance as did more sophisticated models according to twenty-two decision trees, nine statistical algorithms, and two NN algorithms compared [2]. A systematic review also indicated that the lack of evidence supporting the clinical prediction of ML was greater than LR after investigating 71 studies out of 927 [12]. In addition, previous studies revealed that the LR model could have good performance even for low-risk diseases among large-size investigations since class imbalance and overfitting challenge ML applications [9-11].
The properties of our data may facilitate the good performance of LR, with weak interactions between predictors based on the pair correlation analysis and no complex nonlinearity relationships between depressive symptoms and predictors. Approximately half of the population in our study had depressive symptoms (57.6%), and the possibility of data imbalance could be ignored for the ML models like SVM and RF, which are sensitive to this issue according to previous studies [31]. A nonlinear effect of age was found by GAM in the whole data, and thus the data processing was conducted repeatedly in the sub age groups to eliminate the nonlinearity (Supplementary Table 2). Although the PHQ-9 in this study has been considered as the gold standard for screening depressive symptoms, other potential approaches on depression were also investigated in previous studies to compare different models, such as natural language processing (NLP) and electronic health records (HERs). ML models seemed to perform better in feature extraction from NLP of Twitter datasets [32], but the LR was proven to be as effective as ML techniques for HERs [33].
In our study, loneliness, stress, and problematic internet use were the top risk factors associated with an increase in depression during the pandemic, and these findings are consistent with previous studies [16,34-36]. Loneliness associated with enforced social distancing leads to an increase in depression and anxiety, particularly in children and adolescents [37]. A study involving United Kingdom (UK) adults aged 50 years and older revealed that living alone was among the key risk factors [34]. Stress from lower incomes during the pandemic has also been investigated in various studies in the United States (US) [38,39], Malaysia [35], and the UK [34]. Consistent with those studies, our research showed that income had a more negative impact on adults younger than 40 years than on those older than 60 years. One possible explanation is the greater financial status (higher position and greater stability) of older adults could contribute to better resilience to economic stress and depression [38,39]. A European study provided an explanation on the association between high use of social media and greater depressive symptoms [36]. It indicated that the negative information (e.g., bulletins about new deaths) from media increases fear and anxiety, and then long durations of problematic internet use could lead to sleep issues and subsequently mental health.
For the general population, lockdowns and COVID-19-related restrictions were associated with worse psychological outcomes [37]. In our study, city lockdown was among the most important predictors for the elderly over 60 years, while it was less important for younger adults. One of the possible explanations was that the social connections among younger adults were more dependent on social media and internet usage [36]. Furthermore, older people may need more medical support, which was quite limited during the lockdown. It was also noted that the lockdown seemed to have a protective effect against depression in women aged older than 65 years due to hormonal influence [36]. The difference between age groups led to a nonlinear age effect, presented by the GAM with a significant slope change at approximately 60 years of age (Figure 2).
Our study revealed that late bedtime was a sensitive predictor of depressive symptoms, as was sleep quality quantified by the B-PSQI [28,40] (including difficulty falling asleep, staying asleep, and experiencing restless and unsatisfying sleep [41,42]). One study demonstrated that normal sleepers were at risk for insomnia during the COVID-19 pandemic, with depression being a significant contributing factor to worsening insomnia. These findings suggested that certain neural markers and functional connectivity between certain brain regions mediate the association between depressive symptoms and insomnia symptoms during the COVID-19 pandemic [43]. It was also noted that different courses of insomnia development (such as lessening, slightly worsening, or developing mild insomnia) could be predicted based on pre-COVID-19 depressive symptoms and brain functional connectivity [43].
The strengths of this study are as follows. First, the study involved more than 20,000 participants in the nationwide population, which satisfied the large amount of data required for ML algorithms. Second, the nonlinear effect of ages was investigated, and further studies on the sub-age groups provided more impact details of those predictors. Third, the algorithms in our study covered the typical model classes, that is, NNs, tree algorithms, SVMs, and LRs. Their performances were compared and discussed in terms of AUCs on testing sets, overfitting issues and the details of hyperparameter optimization. The training codes are available in the Supplementary Material with tunning and griding settings for hyperparameter optimization.
This study has several limitations. First, depressive symptoms were identified by the PHQ-9 index instead of through clinical diagnosis, and the predictors were limited to individual behaviors and social environments without any clinical predictors, such as metabolites. This may imply that our conclusions cannot be generalized to a broader range of clinical predictors on depression. Second, the possibility of depressive symptoms was more than 50%, and so similar conclusions may not be drawn for disease prediction with low risks. Third, limitations of the cross-sectional design prevented the determination of relationships among the predictors, and longitudinal studies are needed for further investigations.
In summary, this study examined the prediction of depressive symptoms in a large national population during the COVID-19 pandemic. Our findings illustrated that the traditional LR performs comparably to ML models, such as NN, RF, SVM, and GBM models. Several risk factors were shown to be associated with depressive symptoms according to the LR model (age, income, stress, loneliness, late bedtime, and problematic internet use). This study provides a foundation for further investigations into depressive symptoms, and future research is needed to validate these results in diverse populations with varying characteristics.
Supplementary Materials
The Supplement is available with this article at https://doi.org/10.30773/pi.2024.0156.
Codes for model training
Paired Delong’s tests between the AUCs of machine learning models
Predictor importance for three age groups
Notes
Availability of Data and Material
The datasets generated or analyzed during the study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors have no potential conflicts of interest to disclose.
Author Contributions
Data curation: Yi-Bo Wu. Formal analysis: Chun-Hua Zhao, Dan-Dan Chen. Funding acquisition: Chen-Wei Pan. Investigation: Jian-Hua Liu, Tian-Yang Zhang, Yi-Bo Wu. Writing—original draft: Xing-Xuan Dong. Writing—review & editing: Dan-Dan Chen.
Funding Statement
A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).
Acknowledgements
We appreciate the contributions of Bing Shi and Yueqing Huang on our manuscript revisions.