Exploring Informative Items for Bipolar Disorder Classification Using Machine Learning With Anger Coping Styles in Combination With the Mood Disorder Questionnaire and Bipolar Spectrum Diagnostic Scale
Article information
Abstract
Objective
This study aimed to develop a machine learning-based classification model to differentiate bipolar disorder from major depressive disorder using self-report scales, including the Mood Disorder Questionnaire (MDQ), Bipolar Spectrum Diagnostic Scale (BSDS), and Anger Coping Scale (ACS).
Methods
A total of 122 bipolar and 67 depressive patients participated. Recursive feature elimination with 1,000 iterations was used to identify the most informative features. Machine learning classifiers assessed combinations of MDQ, BSDS, and ACS items for classification performance.
Results
The AUC values for MDQ and BSDS were 0.8212 and 0.7934, respectively. Combining MDQ and BSDS increased the AUC to 0.8477, which improved further to 0.8548 when ACS was included. For MDQ, the best performance was achieved when all 13 items were included. In contrast, the combined model of MDQ, BSDS, and ACS showed optimal performance when BSDS items 18 (conflicts with colleagues or police), 19 (alcohol or substance use), and ACS item 15 (beating others) were excluded.
Conclusion
Integrating anger coping styles with mood symptoms enhanced diagnostic accuracy, particularly when items related to undesirable behaviors were excluded. This machine learning approach shows potential for effectively evaluating bipolarity and underscores the importance of refining self-report scales to optimize diagnostic tools. Future research should incorporate clinical and objective data to enhance classification models.
INTRODUCTION
Differentiating bipolar disorder from depressive disorder is clinically important. Because bipolar disorder, like depressive disorder, mainly presents depressive symptoms in the natural courses [1,2], it is not easy to distinguish between the two disorders [3]. The diagnostic distinction becomes even more important when considering the therapeutic differences between the two diseases. The use of antidepressants in bipolar disorder might provoke manic switching, cycle acceleration, and mood instability [4,5]. Considering these aspects, diagnostic efforts to distinguish bipolar disorder from depressive disorder are inevitably important.
Several self-report questionnaires have been developed for diagnostic screening for bipolar disorder. The Mood Disorder Questionnaire (MDQ) is known to be useful in screening bipolar disorder [6,7]. Additionally, the Bipolar Spectrum Diagnostic Scale (BSDS) is also useful in screening for bipolar disorder from depressive patients [8]. Meanwhile, the combination use of MDQ and BSDS improved the diagnostic classification rate compared to when a single questionnaire is used, respectively [9]. The machine learning model for classification of bipolar disorder using both MDQ and BSDS also showed good performance (area under the receiver operating characteristic curve [AUC]: 0.8762) [10]. Likewise, the combination of self-report scales can be useful in identifying potential bipolar disorder patients among those with depressive disorders. The diagnostic classification model using the two questionnaires has already reached the level of accuracy required for diagnosis screening. However, further improvement in model performance is clinically desirable. When diagnosing and screening for bipolar disorder and depressive disorder, it may be helpful to utilize the clinical differences of the two disorders rather than being limited to depressive and manic symptoms.
Numerous studies have investigated the clinical differences between bipolar and depressive disorder [11-15], with particular attention on anger coping styles. Although anger patterns appear in both bipolar disorder and depressive disorder, anger traits and coping styles can be different. Patients with bipolar disorder tend to express their anger externally [16]. Depression with anger has clinical characteristics closer to bipolar II disorder (BP-II) than depression without anger [17], including an earlier age of onset, more depressive mixed state, and higher familial loading of bipolar disorder than depression without anger [17]. A longitudinal follow-up study indicated that aggression reactivity can be a risk factor for conversion from depression to bipolar disorder [16]. Anger trait in bipolar patients can be observed in an euthymic state [18] and is more common in eveningness chronotype related to emotional dysregulation [13]. The anger coping styles between bipolar and depressive disorder may be different. Patients with bipolar disorder showed higher tension-releasing and problem-solving coping than those with depressive disorder [19]. These distinctions in anger coping styles suggest their potential utility in differentiating bipolar disorder from depressive disorder.
While previous studies have identified numerous differences between individuals with and without bipolar disorder, they have often been limited to descriptive findings, without translating this knowledge into practical diagnostic tools. Thus, the development of an effective, data-driven approach for bipolar disorder classification that integrates established clinical knowledge has been underexplored. Machine learning offers a powerful solution by not only integrating multiple diagnostic features but also enabling the selection of the most informative variables [20,21]. By incorporating anger coping styles— recognized as distinct in bipolar disorder—alongside MDQ and BSDS, this approach is anticipated to enhance classification performance. Therefore, this study aimed to establish a machine learning model for bipolar disorder classification using the Anger Coping Scale (ACS) along with the MDQ and BSDS. In particular, the present study focused on exploring informative items for differentiating bipolar disorder from depressive disorder and evaluating the diagnostic performance of machine learning models based on these selected items.
METHODS
Subjects
This study included outpatients who visited the Mood Disorder Clinic at Pusan National University Hospital from November 2011 to February 2021. Initial psychiatric assessments using the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Text Revision (DSM-IV-TR) criteria and the Structured Clinical Interview were performed to identify diagnoses of mood disorder such as bipolar I disorder (BP-I), BP-II, bipolar disorder not otherwise specified (NOS), major depressive disorder, depressive disorder NOS, and dysthymia. The final diagnoses were confirmed through long-term outpatient follow-up, thereby reducing the risk of misdiagnosis due to further manic/hypomanic episodes. The exclusion criteria for the study were: 1) individuals with intellectual disabilities who could not complete the questionnaire, 2) those who were illiterate, and 3) individuals who refused to provide demographic information. A total of 189 participants—122 with a diagnosis of bipolar disorder and 67 with unipolar disorder—completed all three questionnaires (MDQ, BSDS, and ACS) and were included in the final analysis. This study received approval from the Institutional Review Board at Pusan National University Hospital (PNUH IRB: No 2310-003-131).
Measurements
MDQ
The MDQ is a self-report questionnaire to screen bipolar disorder in a psychiatric outpatient population [6]. The MDQ consists of a total of 13 questions, and each question is answered with yes (1) or no (0). The 13 questions are related to manic or hypomanic symptoms such as elevated mood, irritability, increased self-esteem, decreased need for sleep, talkativeness, racing thoughts, distractibility, increased energy, increased goal-directed activity, increased social activity, increased sexual interest, risky behavior, and excessive involvement of pleasurable activities. The MDQ additionally asked whether these symptoms occurred simultaneously and how they affected the daily functioning. A positive screen on the MDQ typically requires the endorsement of at least 7 out of the 13 items that these symptoms appeared during the same period and caused moderate or greater problems [6]. However, in the Korean study, the MDQ score of at least 7 was chosen as the optimal cutoff excluding the 2 parts of the questionnaire on symptoms clustering and functional problems [7]. This study used the Korean version of the MDQ translated by Jon et al. [7]. Individual scores on the 13 items were used to predict the diagnosis of bipolar disorder.
BSDS
The BSDS is a self-report questionnaire designed to screen for bipolar disorder [8]. The BSDS consists of two parts with a total of 20 items, each question is answered with yes (1) or no (0). The first part includes 19 questions, which cover depressive symptoms (8 items), euthymic state (1 item), and manic symptoms (10 items). The second part consists of one question that assesses how applicable the first part’s items are to the respondent. In a study by Nassir Ghaemi et al. [8], a score of 13 was reported as the optimal cutoff, while the Korean version determined 10 points to be the optimal cutoff [22]. This study used the Korean version of the BSDS translated by Wang et al. [22] and utilized individual scores of each of the 20 items to develop the machine learning model.
Anger Coping Scale (ACS)
The ACS is a self-report questionnaire designed to assess coping styles in situations involving anger [23]. It consists of 19 items divided into five factors: behavioral aggression (3 items), verbal aggression (4 items), problem-solving (5 items), tension- releasing (4 items), and anger suppression (3 items). Behavioral aggression involves expressing anger through physical actions, such as hitting or engaging in physical altercations with others. Verbal aggression refers to expressing anger through words, such as cussing or engaging in verbal disputes. Problem-solving coping involves objectively analyzing a problematic situation and seeking solutions. Tension-releasing coping refers to calming one’s anger through activities such as exercise, humor, taking a bath, or meditating. Anger suppression involves avoiding outward expressions of anger and holding in one’s emotions when faced with anger-inducing situations. Each item of the ACS is rated by a 5-point Likert scale, from never (1) to very likely (5). This study used individual scores of each item to develop a machine learning model.
Classification and feature selection
Classification strategy and validation framework
To evaluate the performance of individual self-report questionnaires and their combinations in classifying bipolar and unipolar disorder, we employed multiple supervised machine learning classifiers and feature selection methods. Models were trained on five different input sets: MDQ only, BSDS only, ACS only, MDQ-BSDS, and MDQ-BSDS-ACS.
This study adopted a repeated random stratified 70/30 holdout strategy (Monte Carlo cross-validation) to evaluate classification performance and feature stability. At each iteration, the dataset was randomly divided into a training set (70%) and a test set (30%), while preserving diagnostic group proportions. Recursive feature elimination (RFE) was then applied within the training set to iteratively remove the least informative features until all were eliminated, yielding the optimal subset for each questionnaire combination. The trained model was subsequently evaluated on the independent test set to avoid information leakage.
Classifiers and software
To enhance the robustness of classification, we employed multiple supervised machine learning algorithms, including support vector machine (SVM) with radial basis function (RBF), linear, and polynomial kernels; random forest (RF); gradient boosting (GB); k-nearest neighbors (KNN, with k=2 and 5); linear discriminant analysis (LDA); and logistic regression. These classifiers were selected because they are widely applied in medical and psychiatric research, each providing complementary strengths. SVM is particularly well established for high-dimensional, small-sample data and has shown strong performance in psychiatric disorder classification [24,25]. Moreover, its integration with RFE has become a standard approach in biomedical studies, as feature weights can be leveraged to iteratively remove less informative variables [26,27]. RF and GB, as ensemble tree-based methods, are effective in capturing nonlinear relationships and complex feature interactions while remaining relatively robust to overfitting. KNN offers a nonparametric, instance-based approach, and LDA provides a linear baseline model that is computationally efficient and interpretable. All analyses were conducted in MATLAB (Statistics and Machine Learning Toolbox, MathWorks Inc.), and RFE was implemented using the SPIDER machine learning toolbox (http://people.kyb.tuebingen.mpg.de/spider/).
Preprocessing and normalization
Only participants who completed all items on the MDQ, BSDS, and ACS were included in the analysis (complete-case analysis), and no imputation of missing data was performed. Binary items (MDQ and BSDS) were coded as 0/1, and ACS items retained their original 1–5 Likert scaling. For SVM classifiers, z-score standardization was applied automatically within MATLAB (standardize=1), using the mean and standard deviation of the training set and applying the same parameters to the test set to prevent information leakage. No additional normalization was applied to tree-based classifiers, which are invariant to feature scaling, while KNN and LDA were trained on the same feature representation for consistency.
Feature selection and evaluation
For feature selection, we employed RFE, a wrapper-based algorithm that iteratively removes the least informative features based on classifier weights. RFE is particularly effective with SVM, as weight vectors directly indicate feature importance, and it has been widely adopted in biomedical studies for high-dimensional data analysis [24-27]. Feature selection in this context serves not only to improve generalizability and reduce overfitting but also to enhance interpretability of diagnostic models.
To assess the stability and relative importance of features, we adopted a frequency-based evaluation approach [28,29]. Specifically, the RFE procedure was repeated 10,000 times, each time on a newly stratified 70/30 train-test split. At each iteration, features were ranked according to their contribution to classification, and the lowest-ranking items were sequentially removed until all were eliminated. For every iteration, we recorded which features were retained in the subset that achieved the highest classification performance. By aggregating these results across all iterations, we computed the feature appearance frequency, which quantifies how consistently each item contributed to accurate classification [30,31]. Features selected at high frequencies were interpreted as robust and reliable indicators of diagnostic differentiation, whereas those with low frequencies were considered less informative or unstable. This framework not only provided averaged performance metrics—including accuracy, AUC, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1-score—but also offered a stability-driven perspective on the role of individual questionnaire items. For each questionnaire combination, we identified the feature subset yielding the highest average AUC and highlighted the items most frequently selected, thereby providing a nuanced understanding of feature significance across the MDQ, BSDS, and ACS.
Expected confusion matrix construction
To provide a stable visualization of classification performance, we constructed an expected confusion matrix for the final selected model. During model evaluation, classification outcomes were obtained across 10,000 repeated stratified holdout iterations. For each iteration, a confusion matrix was computed and subsequently aggregated across all runs. The resulting average matrix reflects the expected cell counts for true positives, false positives, false negatives, and true negatives, thereby providing a robust summary of classification behavior that accounts for variability across resampling iterations.
Item-level endorsement rate analysis
To further examine the contribution of individual items, we calculated endorsement rates for each item in the MDQ, BSDS, and ACS. For MDQ and BSDS items (except item 20), endorsement was defined as a positive response (coded as 1). For BSDS item 20, endorsement was defined as a score ≥4. For ACS items, endorsement was defined as a score ≥1, reflecting any degree of endorsement. Endorsement rates were computed separately for bipolar and unipolar participants and compared descriptively. This approach allowed us to evaluate whether low base rates of endorsement might explain the limited contribution of certain items to the classification models.
Statistical analysis
Independent t-tests were used to compare questionnaire scores and other continuous variables between patients with bipolar disorder and depressive disorder. Sex, as a categorical variable, was compared using the chi-square test. To assess differences in classification performance between questionnaire sets, paired t-tests were conducted. All statistical analyses were performed using MATLAB software.
RESULTS
Sociodemographic and clinical characteristics of the subjects
The sociodemographic and clinical characteristics were described in Table 1. The final sample consisted of 189 patients, including 122 with bipolar disorder (36 BP-I, 24 BP-II, and 62 other specified bipolar and related disorder) and 67 with depressive disorder (42 major depressive disorder, 1 persistent depressive disorder, and 24 other specified depressive disorder). The mean age of patients with bipolar disorder (31.85± 11.41 years) was significantly younger than that of patients with depressive disorder (38.09±14.37 years). There were no significant differences in sex distribution or education level between the two groups. Patients with bipolar disorder had a significantly higher MDQ score (8.918±3.236) than those with depressive disorder (4.433±3.306, p<0.001). The BSDS score was also significantly higher in patients with bipolar disorder (13.057±5.433) compared to those with depressive disorder (6.582±4.496, p<0.001). Additionally, patients with bipolar disorder had a higher ACS score (25.385±8.413) than those with depressive disorder (21.627±6.480, p=0.002).
Performance of the classification models using items of each MDQ, BSDS, or ACS
As shown in Figure 1A, the AUC for classification using a single MDQ item ranged from 0.5 to 0.72 according to machine learning classifiers. The AUC improved as the number of cumulative items increased, the AUC of the classification models increased. Using all 13 MDQ items, the AUC for classification ranged from 0.65 to 0.83 according to classifiers. When using SVM-RBF classifier, the AUC for classification was best (AUC=0.8212).
Performance of the prediction model using single scale of MDQ, BSDS, ACS according to the number of items. A: For the MDQ scale, the performance of the prediction model improved as the number of MDQ items increased. The AUC for classification ranged from 0.5 to 0.72, depending on the machine learning classifiers. B: Similarly, for the BSDS scale, the performance of the prediction model increased with the number of BSDS items. The AUC ranged from 0.63 to 0.79 across classifiers. C: The performance of the prediction model using the ACS scale was relatively low, ranging from 0.47 to 0.50. However, the AUC improved slightly as the number of ACS items increased. MDQ, Mood Disorder Questionnaire; BSDS, Bipolar Spectrum Diagnostic Scale; ACS, Anger Coping Scale; AUC, area under the receiver operating characteristic curve; SVM, support vector machine; RBF, radial basis function; RF, random forest; KNN, k-nearest neighbors; LDA, linear discriminant analysis; GB, gradient boosting.
As shown in Figure 1B, the AUC for classification using BSDS single item ranged from 0.53 to 0.74 according to classifiers. The increase of items increased the AUC of the classification models. When all 20 BSDS items were used, the AUC of the classification model was 0.63 to 0.79 according to classifiers. The AUC of 19-item model using SVM-RBF was best (AUC=0.7934).
Meanwhile, the AUC for classification using ACS items was 0.47 to 0.50 (Figure 1C). When using total 19 items of ACS, the AUC of the models ranged from 0.50 to 0.60. The highest performance was observed for the 18-item model using the SVM-RBF classifier (AUC=0.6057).
Performance of the classification models using combination of MDQ, BSDS, and ACS
When using the combination models of MDQ and BSDS, the AUC for classification ranged from 0.70 to 0.84 (Figure 2A). The highest performance was achieved using the 32-item model with the SVM-RBF classifier (AUC=0.8477). For the triple combination of MDQ, BSDS, and ACS, the AUC ranged from 0.67 to 0.86, depending on the classifiers (Figure 2B). The AUC of the 49-item classification model achieved the highest value of 0.8548, exceeding that of the 32-item model (p<0.001, t=29.610).
Performance of the prediction model using combination scales of MDQ, BSDS, ACS according to the number of items. A: When combining MDQ and BSDS scales, the performance of the prediction model improved as the number of items increased. The AUC for classification ranged from 0.7 to 0.84 according to machine learning classifiers. B: When using the triple combination models of MDQ, BSDS, and ACS, the AUC of the prediction models was 0.67 to 0.86 across classifiers. MDQ, Mood Disorder Questionnaire; BSDS, Bipolar Spectrum Diagnostic Scale; ACS, Anger Coping Scale; AUC, area under the receiver operating characteristic curve; SVM, support vector machine; RBF, radial basis function; RF, random forest; KNN, k-nearest neighbors; LDA, linear discriminant analysis; GB, gradient boosting.
To complement these AUC-based results, we additionally summarized the best-performing models for each questionnaire set in Table 2. Overall, the MDQ-BSDS-ACS combination achieved the highest accuracy, AUC, specificity, and PPV, all of which were significantly greater than those of the other questionnaire sets (p<0.001). These results indicate that incorporating anger coping styles into established screening tools provides a substantial improvement in overall discriminative performance, particularly by reducing false positives and enhancing positive case identification.
Analysis of items of MDQ, BSDS, and ACS for the classification models using RFE
Feature selection frequency of MDQ, BSDS, and ACS items was analyzed as percentage over 10,000 iterations (Figure 3). For MDQ, all 13 items contributed significantly to the classification models. For BSDS, 19 of the 20 items performed well, with the exception of item 19. Similarly, 18 of the 19 ACS items contributed effectively, excluding item 15.
Feature selection frequency of MDQ, BSDS, and ACS items as percentage over 10,000 iterations. A: All 13 items of the MDQ contributed effectively to the prediction models. B: Nineteen of the 20 items from the BSDS performed well, with the exception of item 19. C: Eighteen of the 19 ACS items contributed significantly, excluding item 15. MDQ, Mood Disorder Questionnaire; BSDS, Bipolar Spectrum Diagnostic Scale; ACS, Anger Coping Scale.
As shown in Figure 4, when combining the MDQ and BSDS, 32 items excluding an item 19 (alcohol or substance use) from the BSDS improved performance. For the combination of all three questionnaires, excluding items 18 (conflicts with colleagues or police), and 19 (alcohol or substance use) from BSDS and item 15 (beating others) from ACS yielded the best performance with a 48-item model.
Frequency of items of combination scales of MDQ, BSDS, ACS for prediction model. A: For the combination of MDQ and BSDS, 32 items excluding an item 19 (alcohol or substance use) from the BSDS improved performance. B: When combining all three questionnaires, 48 items excluding items 18 (conflicts with colleagues or police), and 19 (alcohol or substance use) from the BSDS and item 15 (beating others) from the ACS showed the best performance. MDQ, Mood Disorder Questionnaire; BSDS, Bipolar Spectrum Diagnostic Scale; ACS, Anger Coping Scale.
Confusion matrix of the final model
Among all questionnaire sets, the MDQ–BSDS–ACS combination with feature selection achieved the highest overall performance, as summarized in Table 2. To provide an intuitive visualization of this final model, we constructed an expected confusion matrix averaged across 10,000 iterations (Figure 5). This matrix represents the classification outcomes of the feature-selected model, complementing the AUC- and metric-based results. It highlights the distribution of true positives, false positives, true negatives, and false negatives, thereby offering a concise depiction of sensitivity and specificity in practical classification terms.
Expected confusion matrix for the final prediction–classification model (MDQ–BSDS–ACS combination with feature selection). Values represent average counts across 10,000 iterations. MDQ, Mood Disorder Questionnaire; BSDS, Bipolar Spectrum Diagnostic Scale; ACS, Anger Coping Scale; Sp, specificity; FP, false positives; FN, false negatives; Se, sensitivity; Dp, depressive disorder; Bp, bipolar disorder.
Item-level endorsement rate analysis
In the item-level analysis of endorsement rates (Figure 6), BSDS items 17–19 and ACS item 15 exhibited overall endorsement rates below 50% in both bipolar and unipolar groups. However, RFE-based feature selection revealed that only BSDS items 18 and 19 and ACS item 15 were consistently excluded from the final classification models. This suggests that although BSDS item 17 also showed low base rates, it still contributed enough discriminative information to be retained, whereas the other three items did not provide stable predictive utility.
Endorsement rates of individual items in the MDQ, BSDS, and ACS questionnaires, comparing bipolar disorder and unipolar depression groups. Items with endorsement rates below 50% are highlighted (BSDS items 17–19, ACS item 15). MDQ, Mood Disorder Questionnaire; BSDS, Bipolar Spectrum Diagnostic Scale; ACS, Anger Coping Scale.
DISCUSSION
The study explored the machine learning classification model on diagnosis of bipolar or depressive disorder using subjective self-report scales such as MDQ, BSDS, and ACS. When combining the MDQ and BSDS, the classification model achieved an AUC of 0.8477, consistent with their established utility for screening bipolar disorder. Importantly, while the study by Lee et al. [9] demonstrated the effectiveness of these scales, our approach involved separating the data into training and test sets, which adds robustness to the model by simulating real-world application conditions. This indicates that the multivariate analysis and machine learning classifiers used in this study successfully captured complex feature interactions, yielding comparable diagnostic accuracy. Adding the ACS, which evaluates anger coping styles, further increased the AUC to 0.8548. Although the improvement was modest, it underscores the value of integrating other behavioral aspects in addition to manic and depressive symptoms to enhance diagnostic accuracy. Previous studies have shown that individuals with bipolar disorder exhibit distinct anger coping strategies, such as greater tension-releasing and problem-solving behaviors compared to those with depressive disorders [19], which likely contributed to the improved performance.
Additionally, this study explored the contribution of individual questionnaire items to the classification models. When analyzing item-level effects, we found that BSDS items 18 (conflicts with colleagues or police) and 19 (alcohol or substance use), together with ACS item 15 (beating others), were excluded by RFE despite being included in the initial questionnaire set. These items also demonstrated endorsement rates below 50% across both groups, indicating that they were infrequently endorsed regardless of diagnosis, which likely limited their discriminative value. These questions were related to external behaviors or symptoms that could be evaluated negatively by the other person. These findings may be interpreted through several perspectives. First, individuals with bipolar disorder often have impaired insight into their condition, with factors like cognitive dysfunction and mood instability contributing to this lack of awareness, particularly regarding changes in mood, mental functioning, and social interactions [32-35]. This can lead to challenges in accurately recognizing or reporting their behaviors, particularly those related to interpersonal or socially undesirable actions. Second, individuals with bipolar disorder are often more sensitive to criticism, which may result in underreporting of behaviors that could elicit negative evaluations from others [36]. Such sensitivity could skew self-assessment and reporting reliability. Third, among individuals with milder forms of bipolar disorder, such externalized behaviors may not manifest as prominently, reducing the diagnostic relevance of these items. This underscores the importance of thoughtfully designing and selecting questionnaire content to enhance its effectiveness in predictive models. Finally, while BSDS item 17 (awkward or annoying behaviors) also showed a relatively low endorsement rate, it appears to have retained sufficient discriminative contribution to remain in the model, underscoring the complex interplay between statistical selection and clinical relevance. Taken together, these findings highlight that not all clinically meaningful items enhance model performance in data-driven classification frameworks, and careful consideration of both endorsement patterns and clinical interpretability is essential when designing or refining predictive questionnaires.
In addition, sociocultural factors should be considered in interpreting these findings. In Korean patients, cultural norms regarding anger expression and social desirability bias may differ from those commonly observed in Western populations [37,38]. Such cultural influences could affect coping strategies according to diagnostic group. For example, previous studies have reported a tendency toward anger suppression among patients with depressive disorder, whereas individuals with bipolar disorder more often employed tension-releasing and problem-solving coping strategies [19]. These cultural and diagnostic differences may have contributed to the observed classification patterns in this study.
Numerous studies have explored the distinct features of bipolar and unipolar disorders, primarily focusing on identifying differences between the two groups. Despite these efforts and the discovery of many distinguishing characteristics, the integration of these insights into formal diagnostic criteria remains limited. Clinicians, however, often rely on a broad spectrum of known information to evaluate patients, employing diverse tools that are not yet systematically unified in practice. Machine learning, as a powerful multivariate analytical approach, offers a solution by synthesizing these fragmented insights into cohesive predictive frameworks [21,39]. This study illustrates how combining established screening tools like the MDQ and BSDS with additional measures, such as ACS items, can lead to enhanced diagnostic models. By selecting specific, clinically meaningful features from these tools, machine learning facilitates the development of models with superior diagnostic performance, mirroring the comprehensive decision-making process of skilled clinicians. Furthermore, this framework offers a scalable methodology for integrating additional well-established characteristics of bipolar disorder, paving the way for progressively improving diagnostic precision and bridging the gap between clinical research and real-world application.
Despite these promising results, an important clinical challenge remains in differentiating major depressive disorder from BP-II or spectrum presentations, where hypomanic symptoms are often subtle and less readily detected. While BP-I is typically easier to identify, the limitations of existing screening tools are more evident in such ambiguous cases. This study, despite its limitations, provides a step toward addressing this challenge by applying a machine learning–based approach to explore informative items. Future studies with larger and subtype-stratified samples will be necessary to refine these models and enhance diagnostic accuracy for BP-II in particular.
Beyond the analysis of individual questionnaire items, it is important to situate these findings within broader dimensional approaches to psychopathology. The Research Domain Criteria (RDoC) framework emphasizes the integration of multiple domains, such as affect regulation, cognition, and social processes, rather than restricting diagnosis to categorical DSM criteria [21,40]. Psychiatric disorders are inherently high-dimensional, and synthesizing this information is essential for advancing diagnostic accuracy. Machine learning is particularly well-suited for this task, as it can integrate heterogeneous data sources and identify complex, non-linear patterns beyond the scope of traditional statistical approaches [24,39]. Our previous study showed that combining MDQ and BSDS improved diagnostic accuracy compared with either scale alone [10]. The present work further extends this integrative approach by incorporating anger coping styles, demonstrating how multivariate models can align with dimensional frameworks such as RDoC and support the refinement of diagnostic tools in psychiatry.
The limitations of this study were as follows. First, this study was conducted at a single university hospital, which may have introduced sample-specific biases and contributed to model overfitting. Validation through multi-center studies with larger and more diverse populations is necessary to confirm the generalizability of these findings. In addition, external validation in independent samples remains a critical next step to ensure the robustness of the proposed model. The relatively modest sample size also limited our ability to perform comprehensive analyses of the relative contributions of individual questionnaire items. While our repeated RFE approach allowed us to identify features with consistently low impact and enhance stability, future studies with larger datasets will be required to quantify item-level importance more precisely. Second, while we adopted a repeated random stratified hold-out strategy to balance stability and generalizability, k-fold cross-validation could also serve as a complementary approach, and future research should examine its utility for further strengthening performance estimates. Third, although this study demonstrated proof-of-concept utility, practical clinical application requires translation into decision-making frameworks. Future studies should investigate how this model could be implemented alongside existing diagnostic workflows to aid screening and early detection. Fourth, reliance on multiple questionnaires poses practical challenges in routine care due to time and resource demands; streamlined instruments or adaptive testing approaches may help address this limitation. Fifth, the classification model relied solely on subjective self-report questionnaires, which are subject to biases such as self-perception errors or variability in interpretation. In addition, certain clinically relevant variables—such as age of onset, number of episodes, current medications, and comorbidities—were not included in the present analysis due to the limited sample size and the design focused on self-report data. These variables are important for a more comprehensive clinical characterization and may further improve model performance in future studies. Integrating objective measures, such as clinical evaluations, brain imaging, or electrophysiological tests, could enhance the model’s accuracy and robustness. However, this self-report-based model remains valuable for settings where objective assessments are not feasible, such as community clinics or primary care centers. Sixth, this study used only the MDQ and BSDS for diagnostic screening of bipolar disorder. In future research, additional use of other screening scales, such as the Hypomania Checklist-32, would be helpful in improving the diagnostic classification rate [41]. Seventh, the study did not differentiate between subtypes of bipolar disorder due to the small sample size. Given the distinct characteristics and clinical courses of BP-I and BP-II, future research with larger, subtype-stratified samples may reveal opportunities to optimize diagnostic models further. Finally, this study included patients diagnosed with bipolar disorder NOS within the bipolar group. These patients exhibited clear hypomanic symptoms during the diagnostic assessment; however, the duration and number of symptoms did not meet the full criteria for hypomanic episodes. As such, they could alternatively be classified as major depressive disorder, which means that the diagnostic model in this study may differ from models restricted to BP-I or BP-II patients. To address this limitation, future studies should confirm the diagnosis of bipolar disorder NOS cases through extended longitudinal follow-up to ensure diagnostic accuracy.
This study established a machine learning classification model for screening bipolar disorder using the subjective questionnaires MDQ, BSDS, and ACS. In addition to mood symptoms, incorporating clinical characteristics such as anger coping strategies improved the diagnostic classification performance. Analysis of individual questionnaire items revealed that symptoms associated with external behaviors, such as alcohol or substance use, awkward or annoying actions, social conflicts, or physical aggression, did not significantly enhance the model’s diagnostic performance. These findings suggest that such items may be less effective for distinguishing bipolar disorder, particularly in individuals with mild symptoms or limited insight into their condition. This study highlights the clinical implications of refining self-report questionnaires to enhance diagnostic classification models. Future efforts should focus on optimizing item selection to maximize diagnostic utility while considering diverse symptom presentations. Expanding the application of machine learning models in mental health monitoring within community and primary care settings could facilitate early detection and timely intervention for bipolar disorder.
Notes
Availability of Data and Material
The datasets generated or analyzed during the study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors have no potential conflicts of interest to disclose.
Author Contributions
Conceptualization: Kyungwon Kim, Eunsoo Moon. Data curation: Kyungwon Kim, Hyunju Lim. Formal analysis: Kyungwon Kim. Funding acquisition: Eunsoo Moon. Investigation: Kyungwon Kim, Hyunju Lim, Hwagyu Suh, Hyunji Lee. Methodology: Kyungwon Kim, Hyunju Lim, Young Min Lee. Project administration: Kyungwon Kim. Resources: Hyunju Lim, Hwagyu Suh, Eunsoo Moon. Software: Kyungwon Kim, Eunsoo Moon. Supervision: Byung Dae Lee, Eunsoo Moon. Validation: Young Min Lee, Hyunji Lee. Visualization: Kyungwon Kim. Writing—original draft: Kyungwon Kim, Eunsoo Moon. Writing—review & editing: Eunsoo Moon.
Funding Statement
This work was supported by a 2-Year Research Grant of Pusan National University.
Acknowledgments
None
