Use of a Machine Learning Algorithm to Predict Individuals with Suicide Ideation in the General Population

Article information

Psychiatry Investig. 2018;.pi.2018.08.27
Publication date (electronic) : 2018 October 11
doi :
Department of Mental Health Research, National Center for Mental Health, Seoul, Republic of Korea
Correspondence: Seunghyong Ryu, MD Department of Mental Health Research, National Center for Mental Health, 127 Youngmasan-ro, Gwangjin-gu, Seoul 04933, Republic of Korea Tel: +82-2-2204-0109, Fax: +82-2-2204-0393 E-mail:
Received 2018 June 24; Accepted 2018 August 27.



In this study, we aimed to develop a model predicting individuals with suicide ideation within a general population using a machine learning algorithm.


Among 35,116 individuals aged over 19 years from the Korea National Health & Nutrition Examination Survey, we selected 11,628 individuals via random down-sampling. This included 5,814 suicide ideators and the same number of non-suicide ideators. We randomly assigned the subjects to a training set (n=10,466) and a test set (n=1,162). In the training set, a random forest model was trained with 15 features selected with recursive feature elimination via 10-fold cross validation. Subsequently, the fitted model was used to predict suicide ideators in the test set and among the total of 35,116 subjects. All analyses were conducted in R.


The prediction model achieved a good performance [area under receiver operating characteristic curve (AUC)=0.85] in the test set and predicted suicide ideators among the total samples with an accuracy of 0.821, sensitivity of 0.836, and specificity of 0.807.


This study shows the possibility that a machine learning approach can enable screening for suicide risk in the general population. Further work is warranted to increase the accuracy of prediction.


In Korea, suicide has become the fifth leading cause of death following cancer, stroke, cardiovascular disease, and pneumonia [1]. The suicide rate in Korea is the highest among the Organization for Economic Cooperation and Development (OECD) countries, with up to 40 people taking their own life every day [2]. This high suicide rate is significantly linked to the avoidance of psychiatric treatment due to social stigma associated with mental illness in the country [3]. Many studies have indicated that the majority of suicide completers have diagnosable psychiatric illnesses, such as depression and alcohol use disorder [4,5]. However, in Korea, only one fourth of suicide completers had seen a psychiatrist before taking their own life, but a greater number had visited a general physician or a Korean medicine doctor to address symptoms of indigestion or insomnia [6]. This underlines the importance of suicide prevention strategies in community-based settings, including gatekeeper training, screening programs, public education, and restricting access to lethal means [7].

It is estimated that 15.4% of Koreans have thought about suicide at some point in their lives, with 2.9% reporting having engaged in suicide ideation in the previous year [8]. The lifetime prevalence of suicide ideation in Korea is considerably higher than the cross-national lifetime prevalence of suicide ideation (9.2%) [9]. Suicide ideation is regarded as a major predictor of committing suicide and therefore assessing suicide ideation is an important step in suicide prevention strategies. Even though individuals who think about suicide do not all subsequently commit suicide, people experiencing persistent and severe suicide ideation are at increased risk of attempting suicide [10,11]. According to a cross-national study, 60% of transitions from suicide ideation to suicide plan and attempt occur within the first year after onset of suicide ideation [9]. In addition, suicide ideation has been found to be associated with clinically significant symptoms of mental illnesses such as depression and bipolar disorder [12,13].

There are several known socio-demographic, physical, and psychological factors influencing suicide ideation and behaviors [14]. Predicting which individuals are at high risk of suicide by screening risk factors at the population level might be an effective approach to reduce suicide rates [15]. Such an approach requires analytic techniques that can classify individuals at high risk by integrating multiple risk factors. Recently, several studies have applied machine learning to medical and healthcare big data for disease diagnosis, treatment, and prevention [16]. Machine learning is a branch of artificial intelligence in which a computer generates rules underlying or based on raw data. We expected that machine learning analysis of public health data also could be used to predict individuals at high risk of suicide in the general population. In this study, we aimed to develop a model predicting individuals with suicide ideation in the general population of Korea by using a machine learning algorithm.


Study population

This study was performed with data from the Korea National Health and Nutrition Examination Survey (KNHANES), which was conducted between 2007 and 2012 (total n=50,405). The KNHANES is a nationwide survey of the health and nutritional status of non-institutionalized civilians in Korea, and is conducted every year by the Korea Center for Disease Control and Prevention [17]. Each year, the survey uses a stratified and multistage probability sampling design to include a new sample of about 8,000 individuals. All KNHANES participants provide written consent to participate in the survey and for their personal data to be used.

Among the 38,005 individuals aged over 19 years, 35,116 subjects answered the following survey question about suicide ideation: “During the past year, have you ever felt that you were willing to die?” Among the 35,116 respondents, 5,814 (16.6%) reported experiencing suicide ideation (suicide ideators), while the remaining 29,302 respondents (83.4%) denied any suicide ideation (non-suicide ideators) (Table 1).

Characteristics of suicide ideators (N=5,814) and non-suicide ideators (N=29,302)

The institutional review board of the National Center for Mental Health approved the protocol of this study (IRB approval number: 116271-2018-36).

Set assignment

Inputting all the data into the classifier to build the learning model will usually lead to a learning bias towards the majority class of non-suicide ideators (known as the “class imbalance problem”) [18]. Therefore, to create two classes of the same size, we randomly selected 5,814 individuals from the 29,302 non-suicide ideators via down-sampling. Thus, 11,628 individuals (5,814 suicide ideators and the same number of non-suicide ideators) were finally included in this study. We assigned the 11,628 subjects to a training set (n=10,466) and a test set (n=1,162), preserving the ratio of 1:1 between the two classes.

Data preprocessing and feature selection

We manually selected 47 variables that were likely to be related to suicide risk. Subsequently, we imputed missing data with the Multiple Imputation by Chained Equations (MICE) method and numeric data were normalized by z-scoring.

To select the smallest subset of features that most accurately classifies suicide ideators, we performed recursive feature elimination with a random forest on the training set. We observed that a model trained with 39 features achieved the highest value of Kappa. However, to reduce the dimensionality as much as possible, we determined to use a simpler model trained with the last 15 features in the backward selection for which the Kappa was not much lower than that of the 39-feature model (Figure 1). The 15 selected features, in order of importance, were as follows: “depressed mood over two weeks,” “stress level in daily life,” “EuroQoL-5D (EQ-5D): anxiety/depression,” “EuroQoL-Visual Analogue Scale (EQ-VAS),” “sex,” “education,” “subjective health status,” “age,” “EQ-5D: mobility,” “reasons for unemployment,” “EQ-5D: pain/discomfort,” “days of feeling sick or discomfort,” “EQ-5D: usual activities,” “average work week,” and “limitation of daily life and social activities.”

Figure 1.

A plot of recursive feature elimination with feature selection in the test set.

Machine learning analysis

For the machine learning algorithm, we utilized a random forest model, which is based on ensembles of classification trees. The random forest approach builds numerous trees in bootstrapped samples and generates an aggregate tree by averaging across trees. For model development, 10-fold cross validation was used to avoid overfitting and to increase the generalization of the model. In the 10-fold cross validation, data in the training set are partitioned into 10 equally sized folds and each fold is used once as a validation set while the other 9 folds are used for training (Figure 2). Together, we performed hyperparameter optimization using the grid search method. Successively, the fitted model was used to predict the classes in the test set and the predicted classes were compared with the actual class. The model’s performance in predicting the classes was evaluated by using the area under receiver operating characteristic (ROC) curve (AUC). We calculated the accuracy, sensitivity, specificity, positive predictive value, and negative predictive value from the confusion matrix. To verify the model’s performance in a real population, we applied the model to the sample of 35,116 subjects who were aged over 19 years and had answered the question about suicide ideation in the KNHANES.

Figure 2.

Scheme of prediction model development.

All analyses were conducted in R version 3.4.3 ( and its packages, including “mice” for imputation of missing data and “caret” for down-sampling, feature selection, and cross validation.


The random forest model trained with 15 features showed a good performance (AUC=0.85), comparable to that of the model trained with 39 features, in predicting suicide ideators (Figure 3). The confusion matrices are presented in Table 2. In the test set, the 15-feature model predicted 448 subjects as suicide ideators from among the 581 actual suicide ideators. Meanwhile, among the 581 non-suicide ideators, 460 were classified correctly. Therefore, the model achieved an accuracy of 0.781, sensitivity of 0.771, specificity of 0.792, positive predictive value of 0.787, and negative predictive value of 0.776. When applying the model to the total of 35,116 subjects, the 15-feature model predicted suicide ideators with an accuracy of 0.821, sensitivity of 0.836, specificity of 0.807, positive predictive value of 0.462, and negative predictive value of 0.961.

Figure 3.

Receiver operating characteristic (ROC) curves. *15-feature model, 39-feature model. AUC: Area under ROC curve.

Confusion matrix and prediction scores


In this study, we applied a machine learning algorithm to public health data to develop a model predicting individuals with suicide ideation in the general population. When predicting suicide ideators in the test set, the machine learning model showed a good performance (AUC=0.85) with an accuracy of 78.3%. Moreover, we identified that the model could predict suicide ideators among the total population of about 35,000 with an accuracy of 82%. The predictive ability of the machine learning model is comparable to that of suicide risk assessment tools used in the clinical setting [19,20].

Some studies have been performed to predict suicide risk in clinical settings by using machine learning approaches. Passos et al. [21] distinguished suicide attempters from non-suicide attempters among patients with mood disorders with an accuracy of 65–72%, using machine learning algorithms based on demographic and clinical data. Oh et al. [22] classified individuals with a history of suicide attempts among patients with depression or anxiety disorders by applying artificial neural networks to multiple psychiatric scales and sociodemographic data with an accuracy of 87–91%. Moreover, a recent study investigated the probability of death by suicide using general characteristics and insurance data from the National Health Insurance Service cohort in Korea, showing fair performance (AUC=0.68) of machine learning models in predicting death by suicide [23]. In the present study, we intended to develop a machine learning model predicting suicide risk in the general population. To this end, we analyzed big data from annual nationwide surveys on health and nutrition status in the general population. To ensure the prediction model could learn more information, we chose suicide ideation, rather than rarer suicide attempt, as an indicator of potential suicide risk. This study showed that machine learning based prediction models can successfully classify suicide ideators among the general population by using simple information about physical and mental health status, as well as demographic characteristics.

As the number of features grows, the amount of data we need to generalize accurately grows exponentially [24]. Therefore, to avoid the so-called “curse of dimensionality” and to increase the generalization of our model, we selected as few features as possible via feature selection to train the prediction model for suicide ideators. In the training set, the model trained with 39 features showed the highest value of Kappa. However, we chose the simpler model trained with 15 features for which performance was not worse than that of the 39-feature model. We expected that the simpler model would enable easier interpretation of the results and application to other new population data.

In this study, we used variables related to mental and physical health, as well as demographic characteristics, as features classifying suicide ideators. In our prediction model, depression, anxiety, and stress were the most important features predicting suicide ideators. According to a recent survey of mental disorder in Korea, about 40% of suicide ideators were found to experience mood or anxiety disorders [8]. Several studies have suggested that academic, work, and life event stresses are associated with suicide ideation [25,26]. Physical factors such as having somatic symptoms or medical illnesses also can be important features that distinguish suicide ideators [27]. Indeed, suicide ideation is often accompanied by somatic symptoms in patients with depression [28]. Moreover, the burden of physical health conditions itself is a major risk factor for suicide [29]. In relation to both mental and physical health, the score for quality of life (“EQ-VAS”) played an important role in classifying suicide ideators in our prediction model. Among sociodemographic factors, “sex,” “education,” “age,” “reasons for unemployment,” and “average work week” were included in a set of 15 features for the prediction model. It is known that there are age and gender differences in factors associated with suicide ideation and behaviors [30,31]. Furthermore, some studies have reported an association between educational level and suicide risk [32]. There is also evidence that working-related factors may be related to suicide outcomes [33,34].

This study is subject to some methodological limitations. First, data from the KNHANES included information about suicide ideation and psychological status that was examined by using very simple questions and scales, which might affect the performance of the prediction model. Second, the 1-year prevalence of suicide ideation in this study (16.6%) was much higher than that of an epidemiological survey of mental disorders in Korea in 2016 (2.9%) [8]. This is because the definition of suicide ideation in the KNHANES included mild, fleeting forms. Third, when applying the model to the total sample, the positive predictive rate remained at 46.2%. This was due to a low ratio of suicide ideators among the total subjects. Fourth, we used only one machine learning algorithm, a random forest model. Additional analyses are warranted to compare the performance of prediction models with other machine learning algorithms, such as support vector machines and artificial neural networks.

In conclusion, this study showed that a machine learning model based on public health data can successfully predict individuals with suicide ideation among the general population. Further studies are needed to apply machine learning techniques to public health data, clinical data, and biomarkers to develop prediction models of more critical suicide risk such as self-harm and suicide attempt.


This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI17C0682).


1. Statistics Korea. Annual Report on the Cause of Death Statistics 2016 Daejeon: Statistics Korea; 2017.
2. OECD. OECD Health Statistics 2017. Available at: Accessed June 17, 2018.
3. Kim WJ, Song YJ, Ryu HS, Ryu V, Kim JM, Ha RY, et al. Internalized stigma and its psychosocial correlates in Korean patients with serious mental illness. Psychiatry Res 2015;225:433–439.
4. Arsenault-Lapierre G, Kim C, Turecki G. Psychiatric diagnoses in 3275 suicides: a meta-analysis. BMC Psychiatry 2004;4:37.
5. Conwell Y, Duberstein PR, Cox C, Herrmann JH, Forbes NT, Caine ED. Relationships of age and axis I diagnoses in victims of completed suicide: a psychological autopsy study. Am J Psychiatry 1996;153:1001–1008.
6. Korea Psychological Autopsy Center. Psychological Autopsy Report. Available at: Acessed Jun 17, 2018.
7. Zalsman G, Hawton K, Wasserman D, van Heeringen K, Arensman E, Sarchiapone M, et al. Suicide prevention strategies revisited: 10-year systematic review. Lancet Psychiatry 2016;3:646–659.
8. Ministry of Health and Welfare. The Survey of Mental Disorders in Korea 2016 Sejong: Ministry of Health and Welfare; 2017.
9. Nock MK, Borges G, Bromet EJ, Alonso J, Angermeyer M, Beautrais A, et al. Cross-national prevalence and risk factors for suicidal ideation, plans and attempts. Br J Psychiatry 2008;192:98–105.
10. May AM, Klonsky ED, Klein DN. Predicting future suicide attempts among depressed suicide ideators: a 10-year longitudinal study. J Psychiatr Res 2012;46:946–952.
11. Miranda R, Ortin A, Scott M, Shaffer D. Characteristics of suicidal ideation that predict the transition to future suicide attempts in adolescents. J Child Psychol Psychiatry 2014;55:1288–1296.
12. Chellappa SL, Araujo JF. Sleep disorders and suicidal ideation in patients with depressive disorder. Psychiatry Res 2007;153:131–136.
13. Valtonen H, Suominen K, Mantere O, Leppamaki S, Arvilommi P, Isometsa ET. Suicidal ideation and attempts in bipolar I and II disorders. J Clin Psychiatry 2005;66:1456–1462.
14. Jeon HJ, Lee JY, Lee YM, Hong JP, Won SH, Cho SJ, et al. Lifetime prevalence and correlates of suicidal ideation, plan, and single and multiple attempts in a Korean nationwide study. J Nerv Ment Dis 2010;198:643–646.
15. Ribeiro JD, Franklin JC, Fox KR, Bentley KH, Kleiman EM, Chang BP, et al. Letter to the Editor: suicide as a complex classification problem: machine learning and related techniques can advance suicide prediction - a reply to Roaldset (2016). Psychol Med 2016;46:2009–2010.
16. Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med 2016;375:1216–1219.
17. Kweon S, Kim Y, Jang MJ, Kim Y, Kim K, Choi S, et al. Data resource profile: the Korea National Health and Nutrition Examination Survey (KNHANES). Int J Epidemiol 2014;43:69–77.
18. Li DC, Liu CW, Hu SC. A learning method for the class imbalance problem with medical data sets. Comput Biol Med 2010;40:509–518.
19. Bolton JM, Gunnell D, Turecki G. Suicide risk assessment and intervention in people with mental illness. BMJ 2015;351:h4978.
20. Ghasemi P, Shaghaghi A, Allahverdipour H. Measurement scales of suicidal ideation and attitudes: a systematic review article. Health Promot Perspect 2015;5:156–168.
21. Passos IC, Mwangi B, Cao B, Hamilton JE, Wu MJ, Zhang XY, et al. Identifying a clinical signature of suicidality among patients with mood disorders: A pilot study using a machine learning approach. J Affect Disord 2016;193:109–116.
22. Oh J, Yun K, Hwang JH, Chae JH. Classification of suicide attempts through a machine learning algorithm based on multiple systemic psychiatric scales. Front Psychiatry 2017;8:192.
23. Choi SB, Lee W, Yoon JH, Won JU, Kim DW. Ten-year prediction of suicide death using Cox regression and machine learning in a nationwide retrospective cohort study in South Korea. J Affect Disord 2018;231:8–14.
24. Altman N, Krzywinski M. The curse(s) of dimensionality. Nat Methods 2018;15:399–400.
25. Grover KE, Green KL, Pettit JW, Monteith LL, Garza MJ, Venta A. Problem solving moderates the effects of life event stress and chronic stress on suicidal behaviors in adolescence. J Clin Psychol 2009;65:1281–1290.
26. Tsutsumi A, Kayaba K, Ojima T, Ishikawa S, Kawakami N, ; Jichi Medical School Cohort Study Group. Low control at work and the risk of suicide in Japanese men: a prospective cohort study. Psychother Psychosom 2007;76:177–185.
27. Druss B, Pincus H. Suicidal ideation and suicide attempts in general medical illnesses. Arch Intern Med 2000;160:1522–1526.
28. Jeon HJ, Woo JM, Kim HJ, Fava M, Mischoulon D, Cho SJ, et al. Gender differences in somatic symptoms and current suicidal risk in outpatients with major depressive disorder. Psychiatry Investig 2016;13:609–615.
29. Ahmedani BK, Peterson EL, Hu Y, Rossom RC, Lynch F, Lu CY, et al. Major Physical Health Conditions and Risk of Suicide. Am J Prev Med 2017;53:308–315.
30. Lee H, Seol KH, Kim JW. Age and sex-related differences in risk factors for elderly suicide: differentiating between suicide ideation and attempts. Int J Geriatr Psychiatry 2018;33:e300–e306.
31. Sugawara N, Yasui-Furukori N, Sasaki G, Tanaka O, Umeda T, Takahashi I, et al. Gender differences in factors associated with suicidal ideation and depressive symptoms among middle-aged workers in Japan. Ind Health 2013;51:202–213.
32. Kimura T, Iso H, Honjo K, Ikehara S, Sawada N, Iwasaki M, et al. Educational Levels and Risk of Suicide in Japan: The Japan Public Health Center Study (JPHC) Cohort I. J Epidemiol 2016;26:315–321.
33. Kim SY, Shin DW, Oh KS, Kim EJ, Park YR, Shin YC, et al. Gender differences of occupational stress associated with suicidal ideation among South Korean employees: the Kangbuk Samsung Health Study. Psychiatry Investig 2018;15:156–163.
34. Yoon CG, Bae KJ, Kang MY, Yoon JH. Is suicidal ideation linked to working hours and shift work in Korea? J Occup Health 2015;57:222–229.

Article information Continued

Figure 1.

A plot of recursive feature elimination with feature selection in the test set.

Figure 2.

Scheme of prediction model development.

Figure 3.

Receiver operating characteristic (ROC) curves. *15-feature model, 39-feature model. AUC: Area under ROC curve.

Table 1.

Characteristics of suicide ideators (N=5,814) and non-suicide ideators (N=29,302)

Suicide ideator* Non-suicide ideator* Statistics
Age, years 54.13 (17.73) 49.00 (16.26) T=20.43, p<0.01
Sex χ2=553.10, p<0.01
 Male 1,654 (28.4) 13,225 (45.1)
 Female 4,160 (71.6) 16,077 (54.9)
Education χ2=1,345.13, p<0.01
 Village school 41 (0.7) 72 (0.2)
 Uneducated 834 (14.4) 1,451 (5.0)
 Elementary school 1,585 (27.4) 4,896 (16.7)
 Middle school 673 (11.6) 3,419 (11.7)
 High school 1,342 (23.2) 8,587 (29.4)
 Two- or three-year college 471 (8.1) 3,497 (12.0)
 Four-year university 738 (12.7) 6,221 (21.3)
 Graduate school 107 (1.8) 1,108 (3.8)
Reasons for unemployment χ2=1,296.16, p<0.01
 Do not feel the need 297 (5.1) 2,054 (7.0)
 Schooling 119 (2.1) 774 (2.6)
 Retired 83 (1.4) 846 (2.9)
 Having health problems 1,471 (25.4) 2,674 (9.2)
 Looking for a job 350 (6.1) 1514 (5.2)
 Parenting or nursing 507 (8.8) 2,818 (9.6)
 etc. 171 (3.0) 755 (2.6)
 Employed 2,787 (48.2) 17,773 (60.8)
Average work week, hours 24.28 (26.86) 29.73 (25.88) T=-14.15, p<0.01
Subjective health status χ2=2,340.76, p<0.01
 Very good 126 (2.2) 1,432 (4.9)
 Good 1,154 (19.9) 10,041 (34.3)
 Fair 1,982 (34.2) 12,571 (43.0)
 Poor 1,857 (32.0) 4,552 (15.6)
 Very poor 679 (11.7) 657 (2.2)
Days of feeling sick or discomfort, days 4.46 (6.08) 1.91 (4.39) T=30.43, p<0.01
Limitation of daily life and social activities χ2=1,585.42, p<0.01
 Yes 1,858 (32.1) 3,399 (11.6)
 No 3,937 (67.9) 25,854 (88.4)
EQ-5D: mobility χ2=1,574.23, p<0.01
 No problems 3,760 (64.9) 25,099 (85.8)
 Some problems 1,889 (32.6) 4,046 (13.8)
 Confined to bed 148 (2.6) 110 (0.4)
EQ-5D: usual activities χ2=1,910.65, p<0.01
 No problems 4,191 (72.3) 26,825 (91.7)
 Some problems 1,325 (22.9) 2,225 (7.6)
 Unable to perform 278 (4.8) 203 (0.7)
EQ-5D: pain/discomfort χ2=1,812.66, p<0.01
 No 3,148 (54.3) 22,789 (77.9)
 Moderate 2,058 (35.5) 5,862 (20.0)
 Extreme 590 (10.2) 603 (2.1)
EQ-5D: anxiety/depression χ2=3,746.10, p<0.01
 No 3,647 (62.9) 26,866 (91.8)
 Moderate 1,887 (32.6) 2,280 (7.8)
 Extreme 262 (4.5) 109 (0.4)
 EQ-VAS 63.76 (21.81) 75.03 (16.55) T=-37.125, p<0.01
Depressed mood over 2 weeks χ2=6,316.11, p<0.01
 Yes 2,802 (48.2) 2,321 (7.9)
 No 3,011 (51.8) 26,980 (92.1)
Stress level in daily life χ2=3,295.15, p<0.01
 Extremely 837 (14.4) 844 (2.9)
 Stressful 2,429 (41.8) 5,524 (18.9)
 Moderately 2,085 (35.9) 17,541 (59.9)
 Minimally 457 (7.9) 5,389 (18.4)

N (%) or mean±SD,

chi-square test or independent t-test.

EQ-5D: EuroQoL-5D, EQ-VAS: EuroQoL-Visual Analogue Scale

Table 2.

Confusion matrix and prediction scores

Test set (N=1,162) Entire population (N=35,116)
True positive 448 4,860
True negative 460 23,641
False positive 121 5,661
False negative 133 954
Accuracy 0.781 0.821
Sensitivity 0.771 0.836
Specificity 0.792 0.807
Positive predictive value 0.787 0.462
Negative predictive value 0.776 0.961