In this study, we aimed to develop a model predicting individuals with suicide ideation within a general population using a machine learning algorithm.
Among 35,116 individuals aged over 19 years from the Korea National Health & Nutrition Examination Survey, we selected 11,628 individuals via random down-sampling. This included 5,814 suicide ideators and the same number of non-suicide ideators. We randomly assigned the subjects to a training set (n=10,466) and a test set (n=1,162). In the training set, a random forest model was trained with 15 features selected with recursive feature elimination via 10-fold cross validation. Subsequently, the fitted model was used to predict suicide ideators in the test set and among the total of 35,116 subjects. All analyses were conducted in R.
The prediction model achieved a good performance [area under receiver operating characteristic curve (AUC)=0.85] in the test set and predicted suicide ideators among the total samples with an accuracy of 0.821, sensitivity of 0.836, and specificity of 0.807.
This study shows the possibility that a machine learning approach can enable screening for suicide risk in the general population. Further work is warranted to increase the accuracy of prediction.
In Korea, suicide has become the fifth leading cause of death following cancer, stroke, cardiovascular disease, and pneumonia [
It is estimated that 15.4% of Koreans have thought about suicide at some point in their lives, with 2.9% reporting having engaged in suicide ideation in the previous year [
There are several known socio-demographic, physical, and psychological factors influencing suicide ideation and behaviors [
This study was performed with data from the Korea National Health and Nutrition Examination Survey (KNHANES), which was conducted between 2007 and 2012 (total n=50,405). The KNHANES is a nationwide survey of the health and nutritional status of non-institutionalized civilians in Korea, and is conducted every year by the Korea Center for Disease Control and Prevention [
Among the 38,005 individuals aged over 19 years, 35,116 subjects answered the following survey question about suicide ideation: “During the past year, have you ever felt that you were willing to die?” Among the 35,116 respondents, 5,814 (16.6%) reported experiencing suicide ideation (suicide ideators), while the remaining 29,302 respondents (83.4%) denied any suicide ideation (non-suicide ideators) (
The institutional review board of the National Center for Mental Health approved the protocol of this study (IRB approval number: 116271-2018-36).
Inputting all the data into the classifier to build the learning model will usually lead to a learning bias towards the majority class of non-suicide ideators (known as the “class imbalance problem”) [
We manually selected 47 variables that were likely to be related to suicide risk. Subsequently, we imputed missing data with the Multiple Imputation by Chained Equations (MICE) method and numeric data were normalized by z-scoring.
To select the smallest subset of features that most accurately classifies suicide ideators, we performed recursive feature elimination with a random forest on the training set. We observed that a model trained with 39 features achieved the highest value of Kappa. However, to reduce the dimensionality as much as possible, we determined to use a simpler model trained with the last 15 features in the backward selection for which the Kappa was not much lower than that of the 39-feature model (
For the machine learning algorithm, we utilized a random forest model, which is based on ensembles of classification trees. The random forest approach builds numerous trees in bootstrapped samples and generates an aggregate tree by averaging across trees. For model development, 10-fold cross validation was used to avoid overfitting and to increase the generalization of the model. In the 10-fold cross validation, data in the training set are partitioned into 10 equally sized folds and each fold is used once as a validation set while the other 9 folds are used for training (
All analyses were conducted in R version 3.4.3 (
The random forest model trained with 15 features showed a good performance (AUC=0.85), comparable to that of the model trained with 39 features, in predicting suicide ideators (
In this study, we applied a machine learning algorithm to public health data to develop a model predicting individuals with suicide ideation in the general population. When predicting suicide ideators in the test set, the machine learning model showed a good performance (AUC=0.85) with an accuracy of 78.3%. Moreover, we identified that the model could predict suicide ideators among the total population of about 35,000 with an accuracy of 82%. The predictive ability of the machine learning model is comparable to that of suicide risk assessment tools used in the clinical setting [
Some studies have been performed to predict suicide risk in clinical settings by using machine learning approaches. Passos et al. [
As the number of features grows, the amount of data we need to generalize accurately grows exponentially [
In this study, we used variables related to mental and physical health, as well as demographic characteristics, as features classifying suicide ideators. In our prediction model, depression, anxiety, and stress were the most important features predicting suicide ideators. According to a recent survey of mental disorder in Korea, about 40% of suicide ideators were found to experience mood or anxiety disorders [
This study is subject to some methodological limitations. First, data from the KNHANES included information about suicide ideation and psychological status that was examined by using very simple questions and scales, which might affect the performance of the prediction model. Second, the 1-year prevalence of suicide ideation in this study (16.6%) was much higher than that of an epidemiological survey of mental disorders in Korea in 2016 (2.9%) [
In conclusion, this study showed that a machine learning model based on public health data can successfully predict individuals with suicide ideation among the general population. Further studies are needed to apply machine learning techniques to public health data, clinical data, and biomarkers to develop prediction models of more critical suicide risk such as self-harm and suicide attempt.
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI17C0682).
A plot of recursive feature elimination with feature selection in the test set.
Scheme of prediction model development.
Receiver operating characteristic (ROC) curves. *15-feature model, †39-feature model. AUC: Area under ROC curve.
Characteristics of suicide ideators (N=5,814) and non-suicide ideators (N=29,302)
Suicide ideator |
Non-suicide ideator |
Statistics |
|
---|---|---|---|
Age, years | 54.13 (17.73) | 49.00 (16.26) | T=20.43, p<0.01 |
Sex | χ2=553.10, p<0.01 | ||
Male | 1,654 (28.4) | 13,225 (45.1) | |
Female | 4,160 (71.6) | 16,077 (54.9) | |
Education | χ2=1,345.13, p<0.01 | ||
Village school | 41 (0.7) | 72 (0.2) | |
Uneducated | 834 (14.4) | 1,451 (5.0) | |
Elementary school | 1,585 (27.4) | 4,896 (16.7) | |
Middle school | 673 (11.6) | 3,419 (11.7) | |
High school | 1,342 (23.2) | 8,587 (29.4) | |
Two- or three-year college | 471 (8.1) | 3,497 (12.0) | |
Four-year university | 738 (12.7) | 6,221 (21.3) | |
Graduate school | 107 (1.8) | 1,108 (3.8) | |
Reasons for unemployment | χ2=1,296.16, p<0.01 | ||
Do not feel the need | 297 (5.1) | 2,054 (7.0) | |
Schooling | 119 (2.1) | 774 (2.6) | |
Retired | 83 (1.4) | 846 (2.9) | |
Having health problems | 1,471 (25.4) | 2,674 (9.2) | |
Looking for a job | 350 (6.1) | 1514 (5.2) | |
Parenting or nursing | 507 (8.8) | 2,818 (9.6) | |
etc. | 171 (3.0) | 755 (2.6) | |
Employed | 2,787 (48.2) | 17,773 (60.8) | |
Average work week, hours | 24.28 (26.86) | 29.73 (25.88) | T=-14.15, p<0.01 |
Subjective health status | χ2=2,340.76, p<0.01 | ||
Very good | 126 (2.2) | 1,432 (4.9) | |
Good | 1,154 (19.9) | 10,041 (34.3) | |
Fair | 1,982 (34.2) | 12,571 (43.0) | |
Poor | 1,857 (32.0) | 4,552 (15.6) | |
Very poor | 679 (11.7) | 657 (2.2) | |
Days of feeling sick or discomfort, days | 4.46 (6.08) | 1.91 (4.39) | T=30.43, p<0.01 |
Limitation of daily life and social activities | χ2=1,585.42, p<0.01 | ||
Yes | 1,858 (32.1) | 3,399 (11.6) | |
No | 3,937 (67.9) | 25,854 (88.4) | |
EQ-5D: mobility | χ2=1,574.23, p<0.01 | ||
No problems | 3,760 (64.9) | 25,099 (85.8) | |
Some problems | 1,889 (32.6) | 4,046 (13.8) | |
Confined to bed | 148 (2.6) | 110 (0.4) | |
EQ-5D: usual activities | χ2=1,910.65, p<0.01 | ||
No problems | 4,191 (72.3) | 26,825 (91.7) | |
Some problems | 1,325 (22.9) | 2,225 (7.6) | |
Unable to perform | 278 (4.8) | 203 (0.7) | |
EQ-5D: pain/discomfort | χ2=1,812.66, p<0.01 | ||
No | 3,148 (54.3) | 22,789 (77.9) | |
Moderate | 2,058 (35.5) | 5,862 (20.0) | |
Extreme | 590 (10.2) | 603 (2.1) | |
EQ-5D: anxiety/depression | χ2=3,746.10, p<0.01 | ||
No | 3,647 (62.9) | 26,866 (91.8) | |
Moderate | 1,887 (32.6) | 2,280 (7.8) | |
Extreme | 262 (4.5) | 109 (0.4) | |
EQ-VAS | 63.76 (21.81) | 75.03 (16.55) | T=-37.125, p<0.01 |
Depressed mood over 2 weeks | χ2=6,316.11, p<0.01 | ||
Yes | 2,802 (48.2) | 2,321 (7.9) | |
No | 3,011 (51.8) | 26,980 (92.1) | |
Stress level in daily life | χ2=3,295.15, p<0.01 | ||
Extremely | 837 (14.4) | 844 (2.9) | |
Stressful | 2,429 (41.8) | 5,524 (18.9) | |
Moderately | 2,085 (35.9) | 17,541 (59.9) | |
Minimally | 457 (7.9) | 5,389 (18.4) |
N (%) or mean±SD,
chi-square test or independent t-test.
EQ-5D: EuroQoL-5D, EQ-VAS: EuroQoL-Visual Analogue Scale
Confusion matrix and prediction scores
Test set (N=1,162) | Entire population (N=35,116) | |
---|---|---|
True positive | 448 | 4,860 |
True negative | 460 | 23,641 |
False positive | 121 | 5,661 |
False negative | 133 | 954 |
Accuracy | 0.781 | 0.821 |
Sensitivity | 0.771 | 0.836 |
Specificity | 0.792 | 0.807 |
Positive predictive value | 0.787 | 0.462 |
Negative predictive value | 0.776 | 0.961 |