
Edited by: Dr. Sergi Bermúdez i Badia
(University of Madeira, Funchal, , Portugal)
Dr. Alice Chirico
(No Organisation - Home based - 0595549)
Dr. Andrea Gaggioli
(Catholic University of the Sacred Heart, Milano,Italy)
Prof. Dr. Ana Lúcia Faria
(University of Madeira, Funchal, Portugal)
Last update: November 2025
More infoSuicidality is a complex, multifaceted issue with significant biopsychosocial causes, ranking as a major cause of death in developed nations. This study aims to leverage machine learning (ML) to predict monthly suicide counts in Poland using Google Trends data, contributing to ongoing efforts to improve public health strategies.
MethodsUsing data from the Polish National Police (2013–2023), monthly suicide attempt counts were analysed alongside relative search volumes (RSVs) of 40 suicide-related and mental health terms. Pearson Correlation Coefficient (PCC) identified the strongest predictors. Four ML models: Linear Regression, Random Forest, Support Vector Regression (SVR), and XGBoost Regression were tested, with PCC and error metrics guiding model selection.
ResultsResults showed that 16 terms were the best predictors for the general population and 13 for the adult cohort. Random Forest Regression outperformed the other models, achieving a PCC of 0.909 and a mean absolute percentage error (MAPE) of 6.78% for the general population, compared to SVR's PCC of 0.644 and 14.8% MAPE. For the adult cohort, Random Forest yielded a PCC of 0.853 and MAPE of 7.21%, again outperforming SVR. Key predictors included anxiety disorders and psychiatrist terms for the general population, with also social isolation being significant for adults.
ConclusionsThis study presents one of the first ML approaches to predicting suicide attempts at national level, highlighting the utility of Google Trends data. Further research with higher-resolution data is recommended to refine predictive models and enhance suicide prevention strategies.
Suicide is a leading contributor to global mortality, with more than 1.3 % of all deaths in 2019 attributable to suicide and over 700,000 individuals dying due to suicide each year (Suicide Worldwide, 2021). Efforts at national and international levels aim to reduce suicide rates; however, the considerable delay between monitoring and public reporting of suicides poses a challenge for real-time interventions (Ma-Kellams et al., 2016). This becomes a significant problem when factors affecting suicide are not constant, but rather shift rapidly, thereby complicating the association of these factors with elevated suicide risk (Ernst et al., 2024).
Suicide stigma is a significant barrier to suicide prevention, leading to decreased mental health help-seeking, particularly among high-risk groups such as men and minoritized populations (Batterham et al., 2013; Jafari et al., 2024). The internet appears to lack the stigma associated with seeking face-to-face help, making it a more appealing option for individuals experiencing suicidal thoughts (Kauer et al., 2014). Suicide-related internet use (SRIU) is typically defined as the "use of the internet for reasons relating to an individual’s own feelings of suicide," and can range from beneficial activities, such as searching for help and support, to harmful behaviours, such as researching suicide methods online (Mok et al., 2015). Evidence suggests that suicide-related internet use (SRIU) is associated with suicidal thoughts and behaviours, although the nature of this association is complex and not fully understood. For example, in a UK survey of mental health patients with recent suicidal thoughts or behaviour, one-third of participants who reported SRIU had attempted suicide in the past year, and 35 % reported experiencing moderately high-intensity suicidal thoughts on a daily basis (Bojanić et al., 2025). These findings indicate that while SRIU is not necessarily causal, it is strongly associated with both the frequency and intensity of suicidal ideation and attempts. In a population-based cohort of 21-year-olds, the prevalence of suicide/self-harm-related internet use was 22.5 %, with a higher proportion accessing sites offering help, advice, or support (8.2 %) than those seeking information on how to harm or kill themselves (3.1 %) (Mars et al., 2015). According to Bojanić et al.(2024) 7.7 % of mental health patients who died by suicide in the UK had engaged SRIU. Among these individuals, the most common types of SRIU were obtaining information on how to die (68.9 %), visiting pro-suicide websites (32.9 %), and communicating suicidal ideas online (15.9 %). Furthermore, the digital footprint left by individuals in suicidal crisis can be measured using social media, as demonstrated in numerous studies over recent years (Choi et al., 2020; Robinson et al., 2016; Waszak et al., 2024). Due to the social and environmental underpinnings of suicide, early efforts to examine real-time trends related to suicide have focused on using signals from large-scale social media data, such as Facebook, to harness its potential in suicide-prevention programs. In 2017, Facebook initiated a program using artificial intelligence to detect suicide risk among users, where flagged posts were sent to human moderators for verification, and if a real risk was identified (Broer, 2022).
Google Trends is widely used in studies investigating suicide-related internet use, given that Google represented 73.27 % of the global market share in 2023 and provides free public access to its statistics (Search Engine Market Share, 2024). There is increasing number of studies implementing Google Trends to successfully track mental health problems to accelerate real-time understanding of risk or data trends (Barros et al., 2020; D. Choi et al., 2020; Sumner et al., 2022). The first study using cross-correlation analysis to estimate the temporal relationship between suicide and Google Trends was by study by Yang et al. (2011). Since then, more advanced statistical approaches has been introduced such as multivariate Poisson regression, vector autoregressive (VAR), autoregressive integrated moving average (ARIMA) models, and cutting-edge machine learning (ML) scope (Barros et al., 2019; Choi et al., 2020; Choi et al., 2023; Tran et al., 2017). However, many of these studies have been limited by inconsistent keyword selection, lack of external validation across different populations, and a focus on large countries such as the USA, Germany, or South Korea. Our study addresses these gaps by applying Google Trends analysis to Poland, a middle-sized European country, and by incorporating a broader group of suicide-related terms together with multiple machine learning models. This approach provides novel evidence and improves the reliability of suicide-related search behaviour as a potential signal for public health monitoring.
The ML modelling in public health has gained attention in recent years upon its success in influenza forecasting using internet-derived data (Cheng et al., 2020; Lu et al., 2018). In psychiatry the ML modelling based on the Internet derived data was introduced to real-time weekly forecasting of opioid overdose deaths and suicide fatalities (D. Choi et al., 2020; Sumner et al., 2022). Compared with traditional statistical models, ML approaches offer advantages that are especially relevant in the context of suicide-related Google Trends data: they can accommodate high-dimensional predictor sets from multiple search terms, capture nonlinear associations and interactions, and improve predictive accuracy. These strengths make ML particularly suitable for our study, which applies an expanded group of suicide-related search terms to model suicide attempts in Poland.
Our aim is assessment of different machine learning approaches based on data retrieved from Google Trends and state which approach is the most suitable one and what predictors are the most important in the modelling of suicidal attempts in Poland. We focus specifically on suicide attempts rather than suicidal ideation because attempts are systematically recorded by the National Police, whereas reliable population-level data on ideation are rarely available and more inconsistent than for suicidal attempts (Aluri et al., 2024). Suicide attempts also represent a clinically and public health-relevant outcome with higher immediate risk, making them particularly suitable for predictive modelling. The selection of predictors was guided by known empirical risk factors for suicide identified in systematic reviews and meta-analyses, ensuring that the ML models were grounded in established evidence. These risk factors include suicidal ideations, psychiatric disorders, physical illnesses (e.g., cancer), history of self-harm, alcohol use disorder, drug use disorder, financial debt and unemployment, relationship conflicts, and family-related conflicts (Favril et al., 2022, 2023). Incorporating these evidence-based risk factors improve the interpretability and validity of our models while allowing the machine learning approach to identify the most influential predictors from the broader set of Google Trends search terms. It is crucial to evaluate which terms might be perceived as predictors needed to be covered in future suicide prevention strategies.
MethodsSuicide attempts dataOur primary outcome of interest was the monthly count of suicide attempts. Data about suicide attempts was provided by Headquater of Polish National Police in date range from 1st January 2013 to 31st December 2023. In the police reporting system, “suicide attempts” encompass both fatal (mortal) and non-fatal suicidal acts; therefore, these were analysed jointly in our study. A total number of 128,347 (fatal and non-fatal) attempts were included into analysis. Reports were generated in one-year resolution stratified by month, age and localization based on supervision of each Polish Voivodeship Police Headquarters which is highly coherent with the administrative division of Poland. In our analysis we have distinguished two cohorts: all suicide attempts (regardless of age) and adults (≥18 years old and unspecified age).
Google trends search volumeOur primary input variables were relative search volumes (RSVs) of several search terms through an open online tool – Google Trends which reports deidentified data about queries of Google search engine users. RSV is normalised data over selected time frames and localization in value range from 0 to 100 (Google, 2025). The normalised 0 value indicates a very low number of queries in the selected time frame and localization and 100 indicates the highest interest. An adjustment process made by Google includes also exclusion of queries made over a short time from the same IP address (Mavragani & Ochoa, 2019). All settings were followed through guidelines for Google Trends in Infodemiology (Mavragani & Ochoa, 2019). For this study, the region of interest was set to “Poland” (the entire country). The date range was set from 1st January 2013 to 31st December 2023, so that it matched the date range of the suicide attempts dataset.
Firstly, a preliminary list of search terms was compiled, which included terms that had been examined in previous studies on suicide prediction using Google Trends, given their demonstrated predictive utility (Choi et al., 2020; Choi et al., 2023; Son et al., 2023; Sumner et al., 2023; Yang et al., 2011). Next, the timelines of RSVs for these terms were examined in the Polish dataset to ensure that the results were relatively stable over time and not dominated by short-term fluctuations or outliers. To improve conceptual grounding, these terms were then reviewed in light of empirically established suicide risk factors identified in systematic reviews and meta-analyses ensuring that our feature set broadly corresponded to factors known to be associated with suicide risk (Favril et al., 2022, 2023). The final set of 40 terms was divided into five categories reflecting different domains of suicidal ideation and behaviour: suicide-seeking, suicide-prevention, suicide-triggers, suicide-symptoms, and psychosis (Table 1). Translations of all terms are presented in supplementary material (Table S1).
Terms selected for assessment as prediction factors divided on five categories related to suicidal ideations: suicide-seeking, suicide-prevention, suicide-triggers, suicide-symptoms, psychosis.
First, machine learning models using RSVs as input variables were used for prediction of the number of suicide attempts (output - primary outcome).
RSVs of all selected terms and categories were tested for Pearson correlation with suicide attempts count to find the category with the highest correlation. The magnitude of correlation was interpreted based on the Pearson correlation coefficient as negligible: 0.00 - 0.09, weak: 0.10 - 0.39, moderate: 0.40 - 0.69, strong: 0.70 - 0.89 or very strong: 0.90 - 1.00 (Schober et al., 2018). The resulting PCC values are presented as heatmaps to improve clarity. Importantly, the categorical grouping of terms (e.g., suicide-seeking, suicide-prevention, triggers, symptoms, psychosis) was applied only for visualization and correlation analysis and had no influence on the machine learning workflow. For the modelling step, we applied a clear thresholding rule: only terms with PCC of ≥ 0.5 with suicide attempt counts were pre-selected as input variables (predictors) for the ML models.
Four different machine learning algorithms were tested for suicide attempts count prediction from RSVs: Linear Regression, Random Forest Regression, Support Vector Regression and XGBoost Regression. These four ML models were selected as they present a distinct attitude to model optimization and they are extensively-used statistical approaches for prediction studies. Linear Regression served as a simple and interpretable baseline, assuming a linear relationship between predictors and the outcome, though such assumptions can be limiting when underlying relationships are complex or nonlinear (Kuchibhotla et al., 2019). SVR extends the concept of support vector machines to regression, mapping data into a higher-dimensional space using kernel functions, which enables it to capture nonlinear relationships but at the cost of increased sensitivity to kernel selection and reduced scalability for larger datasets (Kar et al., 2024). Random Forest Regression, in contrast, is a tree-based ensemble method that aggregates predictions from multiple decision trees built on bootstrapped data samples. This approach allows the model to capture nonlinear dependencies, remain robust against noise and outliers, and avoid overfitting through random feature selection and bagging. Furthermore, Random Forests provide interpretable measures of feature importance, which is particularly valuable in clinical research contexts (Bentéjac et al., 2021). Finally, XGBoost is a gradient-boosting algorithm that sequentially constructs decision trees, with each tree improving upon the residuals of the previous ones. While XGBoost is efficient and powerful in modelling nonlinearities, it requires careful hyperparameter optimization to balance predictive accuracy with overfitting risk. In our experiments, Random Forest Regression achieved the best predictive performance, which may be explained by its robustness, ability to model complex patterns in the data, and interpretability through feature importance (Bentéjac et al., 2021).
The modelling was conducted on all two age groups: (1) whole cohort; (2) adults. To evaluate model performance, we employed k-fold cross-validation (k = 5) with shuffled data. Specifically, the dataset was randomly shuffled prior to splitting, ensuring that each fold contained a representative sample of the overall distribution. This approach prevented potential biases related to data ordering or temporal trends. For each iteration, k–1 folds were used for model training and the remaining fold for testing, with the process repeated until all folds served as a test set. The final performance metrics were averaged across folds to provide a robust internal validation of model performance. Statistical efficacy of the models is presented in terms of mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE). As the Random Forest Models had the best performance from all tested approaches, an additional Feature Importance Analysis was performed to find keywords that are the most predictive for the monthly number of suicide attempts. Feature Importance Analysis in ML refers to techniques that assign scores to input features based on their contribution to the model's predictive performance (Vieira-Manzanera et al., 2025).
All statistical analysis and ML modelling were performed in Python 3.10 and Pandas 2.1.3, Power BI as well as Scikit-learn 1.2.1 libraries. The threshold of two-sided statistical significance was set at p < 0.05 (5 %).
ResultsGeneral populationPearson Correlation Coefficients (PCC) for all terms are presented based on categories of terms (Figures S1A-E) and best predictors (Fig. 1a). Scatter plots for every category group are presented in Supplementary Materials (Figures S3A-E). Category with strong correlation with suicide attempts count is Suicide prevention group (PCC=0.79) and the weakest are Suicide seeking and Psychosis categories (PCC=−0.12) (Figures S3A-E). Terms with strong correlation between RSVs and number of suicidal attempts are: psychiatrist (PCC=0.84) and antidepressants (PCC=0.80) from suicide-prevention category, anxiety disorder (PCC=0.81) from suicide-symptoms category, alcohol (PCC=0.80) and social isolation (PCC=−0.76) from suicide-triggers category. Only three terms present negligible magnitude of correlation with number of suicidal attempts - bipolar disorder and depression (both with PCC=0.03) and sexual abuse (PCC=0.09). The strongest correlation between any of 2 terms from one category are: psychiatrist and antidepressants terms with PCC=0.89 and alcoholism and social isolation with PCC=0.87.
Sixteen terms are classified as best predictors for ML modelling number of all suicidal attempts in general population (Table 2). These terms are: anxiety disorder, psychiatrist, cannabis, alcohol, antidepressants, separation, social isolation, poison, divorce, psychosis, suicide, burnout, delusion, stress, pain, alcoholism.
Best predictors for each cohort regard to correlation between terms’ RSVs and number of suicidal attempts exceed Pearson correlation coefficient (PCC) ≥ 0.5.
Table 3 presents results from summary of model performance after 5-fold cross-validation for all suicidal attempts without regard to age and Fig. 2 presents scatter plots of analysed ML models. The ML model with the best prediction performance is Random Forest Regression with Pearson Correlation Coefficient (PCC) result of 0.909 with MAPE only 6.78 % versus SVR model with PCC 0.644 and 14.8 % MAPE. XGBoost model and Linear Regression model do not exceed PCCs 0.900 with results of 0.881 and 0.883 consecutively but results in low MAPE 7.36 % and 7.53 %. In terms of RMSE, Random Forest Regression scores 76.2, for Linear Regression is 84.7, for XGBoost is 86.4 and for SVR is 173. In the Feature Importance analysis for Random Forest Regression model 2 terms are highly outstanding: anxiety disorder and psychiatrist (Fig. 3a).
Presentation of error measures (MAE – mean absolute error, MSE - mean square error, RMSE - root mean square error, MAPE - mean absolute percentage error), PCC - Pearson Correlation Coefficient, predicted values and actual value among different machine learning models of suicidal attempts rate in the general population.
Heatmap presenting magnitude of correlation between RSVs of best predictors and number of suicide attempts of adults are visible in Fig. 1a and all terms grouped in categories in Figures S2A-E. Scatter plots for every category group are presented in Supplementary Materials (Figures S4A-E). Category with strong correlation with suicide attempts count is Suicide prevention group (PCC=0.74) and the weakest are Suicide seeking and Psychosis categories (PCC=−0.17) (Figures S4A-E). Terms with PCC defined as strong correlation are: psychiatrist (PCC=0.74) and antidepressants (PCC=0.70) from suicide prevention category, anxiety disorders (PCC=0.70) from suicide symptoms category and social isolation (PCC=−0.75) and alcohol (PCC=0.73) from suicide triggers category. Negligible magnitude of correlation between RSVs and suicidal attempts was calculated for two terms: bipolar disorder and depression (PCC=0.03).
Thirteen terms are marked as best predictors for ML modelling of adults suicidal attempts. All of them exceed PCC≥0.5 (Table 2). These terms are: anxiety disorder, social isolation, psychiatrist, alcohol, cannabis, separation, poison, antidepressants, suicide, pain, delusion, alcoholism, divorce.
For predicting suicidal attempts by adults, the model with the highest PCC is Random Forest Regression, achieving a PCC value of 0.853 and MAPE of 7.21 % (Table 4 and Fig. 4). XGBoost and Linear Regression also achieve high PCCs – 0.822 and 0809 with similarly low MAPE such as 8,01 % for XGBoost and 7.64 % for Linear Regression. SVR performance misfits with PCC 0.696, MAPE 12.3 % and RMSE is 129. Rest of the models succeed in RMSEs almost under 82 – Random Forest Regression 71.1, XGBoost – 79.6 and Linear Regression – 82.0. Terms with the highest importance for Random Forest Regression are anxiety disorder, social isolation, psychiatrist – all mentioned above have importance higher than 0.100.
Presentation of error measures (MAE – mean absolute error, MSE - mean square error, RMSE - root mean square error, MAPE - mean absolute percentage error), PCC - Pearson Correlation Coefficient, predicted values and actual values among different machine learning models of suicidal attempts rate in the adults cohort.
This study presents one of the first ML approaches for suicidal attempts prediction for the general population and age-divided cohorts in Poland using Google Trends RSVs. Based on the available data, (1) Random Forest Regression is the most suitable model with the highest PCCs and with the <10 % MAPEs for general and adult cohorts; (2) anxiety disorders and psychiatrist terms are the most important predictors for suicidal attempts rate in the general population and additionally social isolation for adult cohort.
The most accurate ML model is Random Forest Regression, achieving a PCC of 0.909 and a MAPE of 6.78 % for the general population, and a PCC of 0.853 with a MAPE of 7.21 % for adults. These results suggest relatively strong predictive performance. For context, D. Choi et al. (2020) reported lower accuracy (PCC = 0.721; MAPE = 7.85 %) when predicting fatal suicide attempts in the U.S. population using Google Trends RSVs. More recently, Sumner et al. (2023) reported a notably lower error rate (MAPE = 3.86 %), though their models relied on Google Symptom Search data, which provided a broader and more granular set of predictors (422 symptoms and conditions, including mental health–related terms). Importantly, because those studies modelled suicide deaths rather than attempts, and in some cases used distinct data sources, the performance metrics are not directly comparable to ours but can still provide a useful reference point. Classical statistical models such as VAR and ARIMA have also been applied in this field (Barros et al., 2019; Bojanić et al., 2022; Taira et al., 2021). While such approaches differ conceptually from ML and often target different populations and outcomes, they similarly highlight the potential of combining traditional epidemiological indicators with internet-derived data to improve predictive accuracy.
Interestingly, outstanding terms from the Feature Importance Analysis overlap with terms presenting the strongest correlations. In the general population, only 2 terms emerge as significant in the Feature Importance Analysis: anxiety disorder and psychiatrist. Both also stand out in the adult cohort analysis. Additionally, social isolation appears as a significant term in the adult cohort. The intrinsic correlation analysis reveals a strong correlation between RSVs for psychiatrist and antidepressants as well as between alcoholism and social isolation (PCC>0.7). While these correlations may help explain why such terms also appeared as important in the Feature Importance Analysis, it is essential to emphasize that correlation does not imply causation, particularly in the context of internet search behaviour and suicide risk.
High importance of anxiety disorder in the most effective ML model is in line with previous correlation analysis between the US suicide rate and Google Trends terms RSVs (Lee, 2020). Lee reported generalised anxiety disorder and anxiety disorder to be significantly correlated with monthly suicides rates in the USA (Lee, 2020). Moreover, recent cross-sectional and case-control studies confirmed that individuals with anxiety symptoms had increased odds for suicidal attempts (Asnake et al., 2025; Busby Grant et al., 2023; Zhang et al., 2019). Anxiety disorder as a comorbidity also increases the suicidal risk as the coexistence of any anxiety disorder with a mood disorder was associated with a higher likelihood of suicide attempts compared to mood disorders burden alone (Sareen et al., 2005). The term psychiatrist from suicide-prevention category, may indicate a growing tendency among population under suicidal risk to seek psychiatric support. This information sheds a light on suicide prevention strategies as people at suicidal risk seek psychiatric help. This trend monitored through RSVs could be partly explained by the online environment, which potentially mitigates stigma more effectively than traditional, in-person consultations. Social isolation as a significant risk factor of suicidal attempt in ML modelling also possesses valid foundations in real world data. Latest study among Asian American students indicated significantly higher ORs for suicidal ideation, plan or attempts when social isolation was also reported (Oh et al., 2025). Moreover, social isolation can be perceived as complex psychosocial phenomenon and be determined based on different variables, including increased media consumption, single relationship status, and living alone (Motillon-Toudic et al., 2022). In these various contexts social isolation was also identified as significant suicide risk factor which should be considered in the future suicide prevention strategies (Motillon-Toudic et al., 2022).
Nonetheless, the relatively low correlations observed for certain terms traditionally associated with suicide risk like depression or bipolar disorder might reflect differences in search behaviour, underreporting, or cultural and linguistic nuances in term usage, rather than their true absence as risk factors. It may be hypothesised that individuals experiencing anxiety disorders may engage with the Internet more frequently and produce a higher number of RSVs compared to those with depressive disorders. This is likely due to the characteristic symptoms of depression, such as anhedonia and psychomotor slowing. This phenomenon was also noted in literature recently as for depression no significant correlation was presented in the US, German, Swiss and Austrian general population, adolescent population of South Korea (W.-S. Choi et al., 2023; Lee, 2020; Tran et al., 2017).
Future perspectives and limitationsA key strength of our model is its ability to address the unmet need for adequate reporting of suicide attempts. The Polish National Health Programme for 2021–2025, developed by the Polish Ministry of Health, includes the establishment of a suicide prevention strategy as one of its objectives. Specifically, Aim No. 8 of this strategy focuses on creating an effective suicide surveillance system, as data from the National Police are collected for other purposes, and reports from Statistics Poland (https://stat.gov.pl/en/) are presented with significant delays (Polish National Health Programme, 2024a) The effective performance of our ML model, which relies on indirect data such as RSVs and achieves low error measures, suggests its potential use for generating monthly suicide attempt reports. This could enable faster reporting compared to the existing Statistics Poland reports.
Moreover, identifying the best predictors may support healthcare technology assessments by highlighting risk factors that should be monitored in online environments. If currently anxiety disorders, psychiatrist and social isolation are not included in risk factors keywords for searching engines – they should be implemented to promote information about suicidal helplines to reduce suicidality. Assessment of factors associated with suicidal risk is also included in the suicidal ideations prevention strategy part of Polish National Health Programme for years 2021–2025 as an aim no. 9 (Polish National Health Programme, 2024b). Observation of continuous trends in predictors may also be utilized for future corrections of national suicide prevention strategies, and indicate which mental healthcare areas should be primarily addressed. Additionally, identifying the most accurate predictors through ML modelling may be valuable for clinicians in detecting red flags, particularly in resource-limited settings. Especially, for GPs where time per patient visit is strictly limited, making it difficult to conduct comprehensive psychiatric interviews focused on suicidal ideation.
However, this study might embrace several limitations. Firstly, the precision of correlations between age-divided cohort rates and RSVs is uncertain, as age-specific RSVs cannot be obtained through Google Trends. Moreover, Google’s data collection algorithm changed twice during the study period (in 2016 and 2022) which may vary a degree of data validity. Confounding factor, such as media coverage of celebrity suicides or films containing analysed terms in their titles, were not examined in this study. Lack of analysis of keywords RSVs with lags also might contribute to overlook trends related to delay between keywords searching and suicide attempts (Bojanić et al., 2022; Tran et al., 2017). There is also ongoing debate about the validity of correlations between RSVs and mental health status, as Knipe et al. found no evidence of an association between self-reported anxiety, self-harm, and Google Trends RSVs (Knipe et al., 2021).
Another confounding factor is the source of suicide attempts data. Police records do not capture the full volume of suicide attempts, as only those officially reported to the police are included in statistics (Waszak et al., 2024). Moreover, the method of reporting by the police in Poland was changed in 2017 to report suicide attempts more accurately, as it was believed that suicide attempts in Poland had been underestimated for many years (Gawliński et al., 2020).
Next limitation of our study concerns the level of population stratification. The models were developed for two aggregated groups: the general population and adults. Adults were selected as a distinct subgroup because they represent the vast majority of recorded suicide attempts, ensuring that the analyses capture most of the relevant cases. However, we were unable to further stratify the analyses by sex or other demographic factors. Given that suicide risk factors may differ between males and females, the absence of gender-specific modelling may limit the applicability of our findings. Moreover, Google Trends does not provide demographic breakdowns of search volumes, which constrains the ability to incorporate such subgroup analyses. Future research should seek access to more granular datasets to enable demographic-specific modelling and provide more tailored insights into suicide risk prediction.
A further limitation of our model is lack of differentiation between fatal and non-fatal suicide attempts and more age-stratified cohorts. These distinctions are crucial, as the underlying risk factors, intent, and clinical implications might differ between these groups (Nielsen et al., 2023; Zuidersma et al., 2025). Failure to account for this variability may lead to models that overgeneralize risk, potentially reducing their effectiveness in guiding targeted interventions and resource allocation. Addressing these limitations through more refined cohort stratification and outcome differentiation could enhance the predictive accuracy and clinical utility of ML-based suicide risk assessments.
Additionally, the resolution of our data may impact modelling accuracy. Suicide attempt rates and RSVs were analysed at a monthly resolution, whereas weekly or daily analyses might yield more precise correlations and enhance ML model performance. Daily analyses could enable near real-time predictions, allowing for rapid intervention in suicide prevention. Our ML models were based solely on RSVs as input variables. Expanding the dataset to include social media trends or economic indicators could improve model performance, as shown in Choi et al., where incorporating diverse input variables improved MAPE by over 3 % compared to models using only Google data (Choi et al., 2020).
Lastly, our study lacks of external validation. While k-fold cross-validation provides reliable internal validation, testing on an independent dataset would be necessary to fully assess the generalizability of the models. Unfortunately, we did not have access to an external dataset. Our data encompassed the entire population of Poland, making internal regional separation unfeasible. Moreover, obtaining data from another country would not have been methodologically appropriate, given that psychiatric care is strongly influenced by country-specific diagnostic practices, healthcare system structures, and cultural contexts. Future research should aim to validate these findings using independent datasets when available.
ConclusionsThis study presents one of the first ML approaches for suicidal attempts prediction for the general population and adult cohort based on Google Trends RSVs. Random Forest Regression emerges as the most suitable model, achieving the highest PCCs and <10 % MAPEs for both cohorts. Anxiety disorders and psychiatrist are the most important predictors for suicidal attempts rate in the general population, with social isolation being an additional predictor for the adult cohort. The findings align with the biopsychosocial model, highlighting anxiety and social disconnection as contributing factors to suicidal crises. Therefore, they support the use of these models for psychosocial interventions aimed at addressing anxiety, reducing social disconnection, and enhancing individual resilience. Further research on ML modelling with higher-resolution data, such as weekly or daily trends, is needed to enable near real-time interventions as part of rapid-response suicide prevention strategies.
FundingThe publication costs were covered by the Medical University of Gdansk (“Medical University of Gdansk Excellence Initiative– Research University” program).
Consent statement/Ethical approvalNot required.
Data availabilityThe data that support the findings of this study are available from the corresponding author, MW, upon reasonable request.
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: MW, ZK, WN, MS, EP, AK, PW have no conflict of interest. WJC has received within the last three years grants from: Acadia, Angelini, Beckley Psytech, GH Research, HMNC Brain Health, IntraCellular Therapies, Janssen, MSD, Neumora, Novartis, Otsuka, Recognify Life Sciences. He has also received honoraria from: Angelini, GH Research, Janssen, Novartis. He is a member of advisory boards: Douglas Pharmaceuticals, GH Research, Janssen, MSD, Novartis.









