Hepatitis B (HBV) is a prevalent chronic illness affecting approximately 254 million individuals worldwide, with China accounting for nearly one-third of cases. Despite its widespread impact, stigma associated with HBV significantly hinders access to testing, diagnosis, and treatment. This study investigates the relationship between HBV stigma and cognitive distortions among individuals living with HBV by analyzing 35,697 posts from Yiyou Forum, China's largest HBV online community. Utilizing a large language model (LLM) for stigma classification, posts were categorized into stigma-related (S-posts) and non-stigma-related (N-posts). A schema comprising 235 n-grams was employed to identify 12 types of cognitive distortions within these posts. Statistical analyses revealed that S-posts had a prevalence ratio (PR) of 1.824 (95%CI [1.636, 2.074]) for cognitive distortions compared to N-posts, indicating that distorted thinking patterns were approximately 1.8 times more common in stigma-related discussions. Specific distortions such as disqualifying the positive, labeling and mislabeling, mental filtering, and should statements were significantly more prevalent in S-posts. User-level analysis confirmed that individuals engaging in stigma-related posts consistently displayed higher levels of cognitive distortions. These insights underscore the potential of targeted cognitive-behavioral therapy (CBT) interventions to address and mitigate cognitive distortions, thereby alleviating the psychological burden of HBV stigma. Additionally, this study demonstrates the efficacy of advanced computational methods in psychological research.
Hepatitis B (HBV) is a widespread chronic illness affecting millions globally. According to the World Health Organization (WHO), approximately 254 million people are living with chronic HBV infection, with an annual incidence of 1.2 million new cases (WHO, 2022a). If left untreated, it can progress to cirrhosis and liver cancer, posing significant health risks and potentially leading to severe complications or death. Notably, in China, the number of individuals affected by HBV is substantial, with an estimated 87 million individuals, accounting for nearly one-third of the global total (WHO, 2022b). However, less than 25% of these individuals have been accurately diagnosed, and among those who require treatment, only about 10% receive appropriate care (WHO, 2022b).
One major barrier preventing individuals with HBV from accessing testing, diagnosis, and treatment is the stigma surrounding the disease (Tu, 2022). HBV stigma refers to the negative stereotypes and discriminatory behaviors directed toward people living with HBV (Harris et al., 2021). Due to the perceived contagiousness of the condition, individuals with HBV often face stigmatizing experiences, such as restrictions in employment and education, social isolation, and challenges in forming intimate relationships (Lee et al., 2016; Jin et al., 2022). Many individuals hold prejudicial views against people with HBV, regarding them as pathological, unhealthy, or even linking their condition to immoral behaviors (Huang et al., 2016). These stigmatizing attitudes are largely rooted in misunderstandings about HBV transmission and a general lack of knowledge about the disease (Mokaya et al., 2018). For instance, in rural areas of China, more than half of the population fears contracting HBV through casual contact with infected individuals, thus exacerbating discriminatory practices (Yu et al., 2015).
The impact of such stigma is profound, severely affecting the psychological health, quality of life, and overall well-being of people living with HBV (Freeland et al., 2021; Mokaya et al., 2018; Dowsett et al., 2017). Research consistently indicates that, compared to the general population, those with HBV report significantly higher levels of anxiety and depression (Martyn et al., 2023; Li et al., 2020).
To mitigate the psychological effects of stigma on people living with HBV, Cognitive Behavioral Therapy (CBT) has been the primary intervention, demonstrating promising efficacy (Soares & Silva, 2024). For instance, Shareh and colleagues (2017) conducted a Randomized Controlled Trial (RCT) revealing that CBT interventions significantly reduce depressive symptoms in individuals with HBV.
A key mechanism underpinning the effectiveness of CBT lies in its focus on identifying and modifying cognitive distortions, which are deeply rooted patterns of maladaptive thinking. The concept of cognitive distortions was first introduced by Beck (1963). Building on Beck's theory, Burns (1981) proposed ten types of cognitive distortions, such as mental filtering and disqualifying the positive. These cognitive distortion types have been widely applied within the framework of CBT (e.g. Persons et al., 2023; Goodarzi et al., 2023). With the growing body of research into cognitive distortions, two additional types have emerged: personalizing and fortune-telling (e.g., Roberts, 2015; Leon, 2022). Prior studies have demonstrated that these cognitive distortions are closely associated with mental health issues (e.g., Persons et al., 2023). For instance, fortune-telling has been linked to the severity of anxiety and depressive symptoms (Jha et al., 2022), while personalizing has been identified as a maintaining factor in anxiety disorders (Kuru et al., 2018).
In the Chinese context, meta-analyses have consistently confirmed the effectiveness of CBT in treating mental health disorders (Li et al., 2024; Ng & Wong, 2018). Furthermore, culturally adapted CBT has shown greater efficacy compared to non-adapted CBT approaches (Ng & Wong, 2018; Ding et al., 2018). Within Chinese CBT practices, compared to behavioral interventions such as exposure therapy, identifying and restructuring cognitive distortions is considered more culturally compatible (Man et al., 2024). Research suggests that Burns’ cognitive distortion types are largely applicable in the Chinese context, but their relative importance may differ from that in Western cultures (Qian et al., 2021). For example, given the collectivist nature of Chinese society, the prominence of interpersonal relationships means that cognitive distortions related to others, such as mental filtering, may be more pronounced in therapy, whereas avoidance of emotional expression may lead distortions tied to emotions, such as emotional reasoning, to have a lesser impact (Fang & Chung, 2019). Additionally, the cultural emphasis on “mianzi” (face)—a concept referring to an individual's social worth and the respect they receive from others in Chinese culture—may exacerbate tendencies toward social comparison-driven distortions such as labeling and catastrophizing (Lauw, 2016).
The effectiveness of CBT on stigma suggests that stigma induces distorted cognitions in individuals living with HBV, which further affect their emotions, leading to psychological issues such as anxiety and depression (Beck, 2020). For example, an individual ostracized due to HBV might think, “No one wants to interact with me” (overgeneralizing) or “I am an unaccepted person” (personalizing). CBT interventions help individuals identify these cognitive errors and biases, gradually changing these negative thought patterns through cognitive restructuring, thereby improving emotional and behavioral responses (Beck, 2020).
However, the key assumption behind CBT interventions for HBV stigma—that stigma leads to cognitive distortions in individuals—has not yet been directly validated through studies examining the language used by real-world individuals living with HBV. Previous research has demonstrated that language serves as an effective tool for investigating cognitive distortions (Shreevastava & Foltz, 2021). For instance, Bollen and colleagues (2021) conducted a text analysis of books and identified trends in the evolution of human cognitive distortions over the past 125 years. At the individual level, prior studies have employed text analysis methods to identify cognitive distortions in individuals with depression (Bathina et al., 2021), anxiety (Basha, 2015), and substance addiction (Sripada, 2022). In this study, we aim to analyze a large corpus of posts from HBV forums to examine whether stigma leads to increased cognitive distortions among individuals with HBV. The forum posts represent spontaneous expressions from people living with HBV, reflecting their real-life thoughts and feelings. This natural language data can more accurately capture cognitive distortions and negative automatic thoughts. Prior studies have used online text to analyze stigmatization related to dementia (Johnson et al., 2020), depression (Bathina et al., 2021), and HIV (Dong et al., 2019), but few have explored HBV stigma using real-world textual data.
Research focusing on stigma detection has traditionally relied on different methodological approaches. Early work often employed qualitative content analysis of online texts (e.g., Moore et al., 2016; Ho et al., 2017), enabling a rich understanding of stigma manifestations and social dynamics but requiring intensive manual effort and limited scalability. More recently, computational strategies have been introduced to process large volumes of textual data, employing various approaches including text classification through support vector machines (e.g., Li et al., 2018), neural network-based text processing (e.g., Saha et al., 2019), BERT-based contextual representations (e.g., Ng et al., 2022), and semantic analysis using word embeddings (e.g., Charlesworth & Hatzenbuehler, 2024). Recently, large language models (LLMs), such as ChatGPT, have emerged as a novel method for language understanding and processing in health research and have been applied in tasks like depression detection (Pérez et al., 2023) and health outcome prediction (Jin et al., 2024). In previous research, LLMs have demonstrated the ability to capture complex semantic relationships and contextual nuances in text (He et al., 2024), which potentially enables a more nuanced understanding of the language used in stigma-related content. Hence, as illustrated in Fig. 1, our study uses LLM to detect HBV stigma. In the identification process, HBV stigma is conceptualized primarily as externally manifested stigma stemming from societal attitudes, behaviors, and institutional policies that negatively target individuals living with HBV. While such external stigma may be internalized by people living with HBV over time, evolving into self-stigma (Ho et al., 2017), our operational definition and categorization framework focus on identifying the presence of overt, external stigma-related content within online forum posts.
After detecting stigma in HBV forum texts, we defined 12 common types of cognitive distortions (see Table 1) and used semantic schemas to identify these distortions within the texts. Semantic schemas are defined by a set of context-independent sequences of N words or characters (n-grams), encoding the semantics of thought patterns, i.e., the cognitive distortions hypothesized by CBT, rather than single terms or features. We compared the prevalence of 235 cognitive distortion schemata (CDS) in posts reflecting HBV stigma (S-posts) and non-stigma-related (N-posts) texts to verify whether stigma induces more cognitive distortions among individuals with HBV. Additionally, to validate the robustness of our findings, we also examined whether the results could be explained by factors such as random variations within the text sample, the CDS n-grams selected, and the emotional load of the CDS set, as the emotional tone of the text may influence the detection of cognitive distortions. Emotional content may skew the analysis by either amplifying or dampening the perceived prevalence of certain cognitive distortions (Bathina et al., 2021).
Categories of Cognitive Distortion n-grams.
Note. Number of n-grams indicates the number of n-grams used in the CDS.
The data, analysis code, and research materials are publicly available.1 Data were analyzed using R (version 4.3.2) and Python (version 3.7.0).
DataData were sourced from Yiyou Forum, the largest online platform dedicated to HBV in China. This platform serves as a crucial space where patients share their experiences, seek advice, and discuss their conditions openly, making it an ideal source for examining the psychological and social narratives of living with HBV. Using Python's Scrapy package, we retrieved all available posts from the Yiyou Forum's archive, covering the period from January 2013 to December 2023. The data extraction process was conducted between December 1, 2023, and December 7, 2023, yielding a total of 35,697 posts from 11,247 users.
The demographic information of users is presented in Table 2. We preprocessed the posts by removing duplicate entries and excluding non-textual elements such as URLs, HTML tags, and system-generated texts.
Demographic Information of Forum Users.
Note: "Confidential" means that users chose to keep their personal information private and it is not accessible.
All collected data were publicly available and do not involve any personal or private information about the users. The forum posts are accessible to anyone browsing the website, and our data collection strictly adhered to ethical guidelines to respect the privacy and anonymity of the forum users.
Classification using large language modelsTo classify the texts, we employed the LLM GLM-4, which is proficient in Chinese language processing (GLM Team, 2024). The model was used to categorize the forum posts into two categories: those that reflect HBV stigma (S-posts) and those that do not (N-posts). For this study, the HBV stigma refers to explicit references, narratives, or sentiments within the forum posts that signal societal-level prejudice or discriminatory attitudes toward individuals with HBV. To ensure a systematic and transparent classification process, we employed a predefined prompt that explicitly listed various forms of HBV stigma. The prompt was developed through several iterations. Initially, we designed several versions of the prompt in collaboration with an expert in HBV stigma and a team of doctoral and master's students in health psychology. After conducting preliminary tests on a small sample of posts, we evaluated the effectiveness of each prompt version of identifying stigma-related content in the post. We refined the prompt based on the results and this iterative process culminated in the final version of the prompt. A detailed version of the final prompt is provided in Supplementary Material S2. The prompt instructed the model to identify any mention of social prejudice, discriminatory narratives, or restrictive practices directed against individuals with HBV. For instance, the prompt highlighted scenarios including, but not limited to, mentions of denied employment or educational opportunities, distancing or exclusion from social circles, romantic relationship barriers, healthcare discrimination, and difficulties in obtaining insurance due to HBV status. Each post was input into the LLM with the standardized prompt. The model returned a binary label indicating whether the post contained stigma-related content.
To validate the accuracy of the model's classification, we randomly selected 5292 posts from the dataset by sampling a random number (30–50) of posts per month across the entire time span, ensuring temporal representativeness while maintaining randomness. Two human annotators, both master's students in health and clinical psychology, independently labeled these posts. Before coding, the two annotators collaboratively developed a detailed guideline (supplementary material S3) to ensure consistency and accuracy in identifying HBV stigma, including the operational definition of HBV stigma and typical manifestations and examples in online discourse. After coding, inter-rater reliability was calculated using Cohen's κ = 0.962, indicating a high level of agreement between the annotators. Any discrepancies during the coding process were resolved through discussion until a consensus was reached. The resulting human-annotated subset served as a reference standard against which we evaluated the model’ s performance. For evaluation, we used commonly used metrics for model performance: accuracy (proportion of posts correctly classified by the model), precision (proportion of posts labeled as S-posts by the model that were actually stigma-related according to human annotators), recall (proportion of human-identified S-posts that were correctly captured by the model) and F1-Score (the harmonic mean of precision and recall for S-posts, summarizing the model's ability to both identify S-posts accurately and minimize missed S-posts). Table 3 presents the contingency table and the model's performance according to these metrics.
Contingency Table and Classification Performance Metrics.
| Category | TP | FP | FN | TN | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|---|---|---|
| S-posts | 1360 | 83 | 124 | 3725 | 0.961 | 0.942 | 0.916 | 0.929 |
| N-posts | 3725 | 124 | 83 | 1360 | 0.961 | 0.968 | 0.978 | 0.973 |
Note. TP (True Positive): The number of correctly predicted positive or negative posts. FP (False Positive): The number of negative posts incorrectly predicted as positive. FN (False Negative): The number of positive posts incorrectly predicted as negative. TN (True Negative): The number of correctly predicted negative posts.
The model's performance metrics demonstrate its reliability in identifying posts that reflect HBV stigma. With an accuracy of 0.961, the model correctly classifies most posts. For S-posts, it achieves a precision of 0.942 and a recall of 0.916, resulting in an F1-score of 0.929. For N-posts, the precision, recall, and F1-score are all above 0.96. This result indicates a strong overall ability to distinguish between the two categories, further supported by an area under the ROC curve (AUC-ROC) of 0.947, which highlights the model's robust discriminatory power.
Following the validation, we applied the model to the entire dataset. Out of the 35,697 posts, 12,575 (35.23%) were classified as reflecting stigma (S-posts), while 23,122 posts (64.77%) were classified as not reflecting stigma (N-posts). The average length of all posts is 266.21 words, with S-posts averaging 250.70 words and N-posts averaging 274.64 words. Detailed descriptive statistics for both groups of posts are provided in Supplementary Material S1.
Construction of cognitive distortion schema (CDS) N-gramsCognitive distortions are irrational or exaggerated thought patterns that contribute to negative emotions and behaviors. For example, "mindreading" involves assuming you know what others are thinking, while "catastrophizing" entails expecting the worst possible outcome. Following previous research (Bathina et al., 2021), we identified 12 common types of cognitive distortions, which are explained in Table 1. To analyze cognitive distortions, we utilized a method involving CDS and n-grams. CDS are sequences of words that represent patterns of thought associated with cognitive distortions. These patterns were extracted using n-grams, which are contiguous sequences of n items from a given sample of text or speech. In the context of CDS, n-grams are sequences of words that frequently occur together and can indicate a specific type of cognitive distortion. For instance, the phrase "everyone believes" (a 2 gram or bigram) is indicative of the Mindreading distortion, where a person assumes the beliefs or thoughts of others without evidence. By employing n-grams, we could effectively capture the semantic structure of thought patterns indicative of cognitive distortions in large corpus of online posts (Bathina et al., 2021).
To identify cognitive distortions within the forum posts, we adapted the schemata constructed by Bathina and colleagues (2021). The n-grams were translated into Chinese by individuals with expertise in psychology and advanced proficiency in English to ensure accuracy and cultural relevance. The translated schema was then reviewed and validated by a senior CBT practitioner to ensure its applicability to the context of HBV stigma. To further enhance comprehensiveness and cultural applicability, we consulted two additional CBT practitioners with expertise in stigma-related interventions and one expert specializing in HBV stigma in the Chinese context. Based on their input, the schema was refined and expanded, resulting in a final set of 235 unique CDS n-grams covering 12 types of cognitive distortions. The constructed n-grams ranges from 2 grams to 8 grams, with an average length of 3.73. Table 1 provides a detailed breakdown of the CDS n-grams used in this study.
Statistical analysisAll statistical analyses were performed using Python, closely following the methodology outlined by Bathina and colleagues (2021). To compare the differences in the number of CDS between the two groups, we calculated the average number of CDS occurrences per post within both S-posts and N-posts. We then used the prevalence ratio (PR), as conceptualized by Bathina and colleagues (2021), to compare the relative prevalence of each cognitive distortion type between the two groups to quantify the extent to which stigma influences the occurrence of cognitive distortions in individuals living with HBV.
Post-level analysis1. Dataset Division: The dataset was divided into two categories: “Not Reflecting HBV Stigma” (N-posts) and “Reflecting HBV Stigma” (S-posts). Let T represent the entire set of text data (i.e., forum posts). Thus, TCN represents the subset of texts classified as not reflecting stigma, and TCS represents the subset of texts classified as reflecting stigma.
2. CDS Matching Function: We followed Bathina and colleagues (2021) in defining a CDS matching function FC(t) that maps each unit of analysis to quantify the CDS patterns. However, unlike Bathina and colleagues (2021), who utilized a binary indicator (FC(t)∈{0,1}) to denote the presence or absence of CDS patterns, our function counts the number of CDS occurrences per post. This frequency-based measure allows for a more granular assessment of cognitive distortions by capturing both the presence and intensity of CDS patterns within each post. Specifically, this function maps a text T to an integer, indicating the number of CDS patterns present in the text.
3. Prevalence Calculation: For each category, we calculated the average number of CDS patterns per post, referred to as Prevalence PC. The prevalence for N-posts is calculated using the formula:
Here, ∑t∈TNFC(t) represents the total number of CDS patterns identified in the text t. |TN| is the total number of texts in TN. Similarly, the prevalence for S-posts is calculated using the formula:
Here, ∑t∈TSFC(t) represents the total number of texts in TS that contain CDS patterns. |TS| is the total number of texts in TS.
4. Prevalence Ratio (PR) Calculation: The PR quantifies the relative prevalence of CDS patterns between the two categories. It is calculated as follows:
This ratio provides insight into the extent to which CDS patterns are more frequent in S-posts.
5. Bootstrap Estimation: To enhance the robustness of our results and to calculate confidence intervals (CI), we employed the bootstrap resampling method following Bathina and colleagues (2021). This involved repeatedly sampling from our dataset and recalculating the prevalence ratios for each sample. By doing this, we generated a distribution of PR values for each cognitive distortion type, allowing us to compute the median PR and the 95% CI. Specifically, we performed B=1000 bootstrap iterations. For each iteration i, we resampled texts from TN and TS, and recalculated the PRs, yielding a distribution of PR values:
From this distribution of bootstrapped PR values, we calculated the median to represent the central tendency and the 95% CI to quantify the uncertainty. The 95% CI is determined by the 2.5th and 97.5th percentiles of the bootstrapped PR values. To account for the increased risk of Type I errors due to multiple comparisons, we further applied the Bonferroni correction to adjust our CIs. If the resulting 95% confidence interval does not include 1, it indicates a significant difference between the two groups.
User-level analysisTo further explore how CDS differs between users, we extended our analysis from the post level to the user level. We categorized users into two groups: those who posted at least one S-post (S-users) and those who never did (N-users). Following the post-level CDS quantification method, we aggregated each user's total CDS counts across all their posts. User-level prevalence was calculated as the mean CDS per user within each group, and the PR between S-users and N-users was derived. Statistical significance was assessed using bootstrapped confidence intervals. The detailed methodology is presented in Supplementary Material S5.
Robustness analysisTo test the robustness of our findings, we conducted a series of additional analyses to explore whether our results were influenced by the specific selection of CDS n-grams, the skewed distribution of user post frequencies (where a minority of users contribute disproportionately many posts), or the emotional load of the CDS sets.
To assess the sensitivity of the results to the specific CDS n-grams selected, we repeatedly resampled (with replacement) sets of 235 CDS n-grams to create alternative CDS sets. For each alternative CDS set, we recalculated the PR between S-posts and N-posts, generating a distribution of PR values for each type of cognitive distortion. We then computed the 95% CIs of the resulting PR value distributions to determine the sensitivity of the results to changes in the CDS sets.
The average number of posts per user is 3.173 (SD = 11.074, min = 1, max = 374, median = 1), indicating that most users post relatively infrequently, with a small number of users posting a large number of posts. To address the potential skewness in the data and validate the robustness of our results, we randomly selected one post per user to recalculate the prevalence of CDS to reduce the impact of extreme values and ensure that the results are not overly influenced by a small number of highly active users.
We also tested whether the emotional load of the CDS sets influenced our findings. We employed the iFLYTEK sentiment analysis method (Zhu & Wang, 2021), which is suitable for analyzing the sentiment of online forum texts. We performed sentiment analysis on all texts in both S-posts and N-posts, obtaining sentiment scores ranging from 0 to 1. By comparing the sentiment scores between the two groups, we assessed whether there were systematic differences in emotional tone between S-posts and N-posts that could potentially confound the observed differences in cognitive distortions.
ResultBetween-cohort CDS prevalenceStatistical analysis revealed a significant difference in the prevalence of CDS between S-posts and N-posts. In N-posts, the total number of identified CDS patterns in N-posts was 48,117. In contrast, in S-posts, this number was 47,734.
To compare the prevalence of CDS patterns between these two categories, we calculated the CDS prevalence by dividing these counts by the total number of posts in each category. This resulted in a CDS prevalence of 2.081 for N-posts and 3.796 for S-posts. The PR between these two categories was calculated to be 1.824, indicating that CDS patterns are 1.8 times more prevalent in S-posts compared to N-posts. The median PR from the bootstrap samples was 1.825, with a 95% CI of [1.636, 2.074]. This result suggests that S-posts are significantly more likely to contain cognitive distortions compared to N-posts.
CDS prevalence by cognitive distortion typeThe prevalence of CDS was further analyzed by specific distortion types to understand how different forms of distorted thinking contribute to the overall cognitive burden experienced by individuals living with HBV. Fig. 2 presents the PR and the 95% CI for each type of cognitive distortion identified in the forum posts.
As shown in Fig. 2, the median PR for all cognitive distortion types is greater than 1, indicating that each type of cognitive distortion is more prevalent in S-posts compared to N-posts. Except for personalizing, the 95% CI for all other cognitive distortion types are greater than 1.
Among the different cognitive distortions, personalizing has the lowest median prevalence ratio (PR = 1.301, 95% CI = [0.718, 2.084]). In contrast, labeling and mislabeling has the highest PR of 2.322 (95% CI = [2.153, 2.515]), followed by disqualifying the positive (PR = 2.254, 95% CI = [2.014, 2.577]), mental filtering (PR = 1.990, 95% CI = [1.090, 3.928]), and should statements (PR = 1.975, 95% CI = [1.737, 2.182]).
CDS prevalence by user typeA total of 5830 users (51.84%) had posted at least one S-post (S-users), while 5417 users (48.16%) had not posted any S-posts (N-users). Specifically, S-users contributed a total of 26,857 posts (M = 4.61), among which 12,575 were classified as S-posts and 14,282 as N-posts. In contrast, N-users posted a total of 8840 posts (M = 1.63).
The comparison between posts by S-users and N-users revealed significant differences in the prevalence of cognitive distortions across the two groups. As shown in Table 4, S-users displayed consistently elevated PR across all identified types of cognitive distortions compared to N-users. Should-statements displayed the highest prevalence ratio (PR = 6.262, 95% CI = [4.773, 8.368]), followed by labeling and mislabelling (PR = 4.058, 95% CI = [3.249, 5.051]) and catastrophizing (PR = 3.673, 95% CI = [2.759, 4.994]).
Results of CDS n-grams analysis.
Note. PRA shows the analysis of all posts by cognitive types. PRU shows the result of user-level analysis. PRB shows the result of bootstrap analysis. PRP shows the result of random post selection. LLCI and ULCI represent the lower and upper limits of the Bonferroni-adjusted 95% confidence intervals.
To assess the sensitivity of our results to the specific selection of CDS n-grams, we repeatedly resampled (with replacement) sets of 235 CDS n-grams to create alternative CDS sets. Table 4 presents the median PR values and their corresponding 95% CIs for each type of cognitive distortion. As shown, all categories exhibit a median PR greater than 1, with the 95% CIs also exceeding 1. Among these, magnification and minimization show the highest values, followed by overgeneralizing and should statements. These findings suggest that these cognitive distortions are more prevalent in S-posts compared to N-posts. Moreover, the results remain consistent regardless of the choice of CDS n-grams, further supporting the robustness and reliability of the study's conclusions.
Robustness analysis: random post selection for CDS prevalenceTo assess the robustness of our results with respect to the frequency distribution of user posts, we conducted a random selection of one post per user, irrespective of the total number of posts made by each user. The results are shown in Table 4. As shown, the recalculated CDS prevalence remains consistent with the original findings. The 95% CIs for the prevalence values overlap with the original results, suggesting that the observed CDS patterns are not significantly influenced by extreme post counts or by a small group of highly active users.
Robustness analysis: impact of emotional loadBy comparing the sentiment scores between the two groups, we examined whether the emotional tone of the texts affected the prevalence of cognitive distortions.
Given that the data violated the assumptions of normality (Shapiro-Wilk test, W = 0.88, p < 0.001) and homogeneity of variances (Levene's test, F = 6.55, p = 0.011), the non-parametric Mann-Whitney U test was conducted to compare the differences between the two groups. The results indicated no significant difference in sentiment scores between the groups (U = 106,661,488.50, p = 0.058). Additionally, Fig. 3 shows the sentiment distribution for S-posts and N-posts, revealing a consistent sentiment distribution, with both groups' sentiment distributions skewed towards negative emotions.
DiscussionThe present study examined the relationship between HBV stigma and the prevalence of cognitive distortions in online forum posts. Drawing on a large corpus of real-world, user-generated texts from an HBV forum, we employed an LLM-based classification approach to detect stigma-related content and subsequently identify 12 types of cognitive distortions through semantic schemas. Our findings indicate that posts reflecting HBV stigma (S-posts) contained significantly more cognitive distortions than non-stigma-related posts (N-posts), and this difference remained robust across a range of sensitivity analyses. At the user level, individuals who posted at least one S-post (S-users) exhibited consistently elevated prevalence across all categories compared to those who never posted S-posts (N-users), indicating that individuals who produced stigma-related content were not only more likely to express cognitive distortions in a single post but tended to do so consistently across their contributions.
These results offer empirical support for the theoretical proposition that stigma—embodied in external societal attitudes, institutional policies, and interpersonal behaviors—can infiltrate individual cognitive processes (Beck, 2020; Link & Phelan, 2001). Consistent with the core principles of CBT, such as those proposed by Beck (2020), these distorted cognitions may serve as a conceptual bridge linking external stigma-related cues to internal psychological distress.
Notably, our analysis identified certain cognitive distortions—particularly labeling and mislabeling, disqualifying the positive, overgeneralizing, and should statements—as having stronger and more consistent links with stigma-related discourse. The frequent occurrence of labeling and mislabeling suggests that stigmatized individuals may adopt negative self-definitions that echo the broader societal narratives of marginalization (Tadesse et al., 2024). Similarly, the prevalence of overgeneralizing and disqualifying the positive indicates a tendency to rely on rigid, pessimistic appraisals of social experiences, limiting the incorporation of positive or neutral information into one's self-view (Franklin et al., 2022). The prominence of should statements aligns with previous literature showing that rigid expectations and self-imposed rules often arise in contexts where individuals feel pressured to meet unrealistic social standards (Del Rosal et al., 2021). These findings converge with prior research examining other stigmatized conditions—such as HIV (Tadesse et al., 2024) or mental health disorders (Volkow et al., 2021)—implying that certain distortions are not solely a function of HBV status, but may be broadly characteristic of stigma-induced cognitive frameworks.
Additionally, interestingly, the sentiment scores for both groups were skewed towards the negative direction. This finding contrasts with previous research that human language tends to exhibit a bias towards positive sentiment, known as the Pollyanna effect (Dodds et al, 2015). This discrepancy may be attributed to the fact that the texts primarily come from individuals with HBV, reflecting the pressures and emotional challenges they face in their daily lives. This observation aligns with Meyer's minority stress theory (Frost & Meyer, 2023), which posits that individuals with minority status experience additional stress and mental health burdens compared to the general population.
From a theoretical standpoint, these results expand upon existing cognitive models of stigma, suggesting that HBV stigmatized experiences do not merely elicit negative affect but may also become entrenched in an individual's cognitive architecture. This aligns with diathesis-stress models, where individuals with a predisposition for certain cognitive patterns are more likely to interpret stressful stimuli (in this case, stigma) in distorted ways, thereby intensifying psychological distress (Broerman, 2020). The findings are also in line with the concept of internalized stigma, where stigma operates not merely as a transient social encounter but as a psychosocial factor that permeates cognitive structures over time, impacting how individuals perceive themselves, interpret others’ behavior, and envision their futures (Li et al., 2020).
Practically, these observations hold implications for interventions aimed at reducing the psychological harm associated with HBV stigma. CBT protocols could be adapted to target these specific cognitive distortions identified in the current study in individuals living with HBV. For instance, interventions might focus on helping individuals with HBV identify and challenge negative labels, integrate positive experiences more fully into their self-appraisals, and adopt more nuanced, evidence-based interpretations of stressful events. By addressing these prevalent distortions, interventions may be more effective in mitigating the psychological burden of HBV stigma, ultimately supporting better mental health outcomes and potentially improving medical treatment adherence and disease management of the people living with HBV.
Moreover, these findings contribute to the literature by providing empirical evidence that stigma influences cognitive processes at the level of natural language expression. While previous research has primarily focused on qualitative analyses or controlled experimental paradigms, the use of computational approaches in the present study offers a scalable and objective method to detect and quantify these cognitive patterns in large-scale, ecologically valid datasets. This methodological advancement allows for more nuanced investigations into the interplay between stigma and cognitive patterns, setting the stage for future research that might explore longitudinal patterns, identify protective factors, or examine the relative effectiveness of different intervention strategies.
It is important, however, to interpret these findings cautiously. While the analyses support an association between stigma and increased cognitive distortions, causality cannot be inferred from this cross-sectional, observational study. It remains unclear whether stigma exposure leads to distorted cognitions, whether individuals prone to certain cognitive distortions are more sensitive to perceived stigma, or if a mutual interplay perpetuates both. Additional longitudinal research or experimental designs would be needed to disentangle these dynamic relationships and determine more precisely how one factor may influence the other. Furthermore, although our conceptualization focused on external stigma, internalized or self-stigma may also shape these cognitive patterns over time (Link & Phelan, 2001). Future research employing longitudinal designs, experimental manipulations, or qualitative interviews could help disentangle these complex, potentially bidirectional relationships and provide deeper insights into underlying mechanisms. Additionally, there is growing evidence suggesting that human cognitive and emotional responses can exhibit seasonal fluctuations (e.g., Meyer et al., 2016; Winthorst et al., 2020). Although our data were collected over a ten-year period and therefore span multiple seasons, we cannot entirely exclude the possibility of seasonality influencing cognitive or emotional responses. Future research adopting more fine-grained, time-series analyses may help clarify whether and how seasonal factors affect the expression of cognitive distortions. Another limitation is that the bootstrap procedure used in the analysis accounts solely for sampling variability, without considering potential classification errors in the model. This could lead to narrower confidence intervals than would be obtained if classification error were incorporated into the model. The exclusion of classification error in the model may result in an overestimate of the precision of the estimates and may underestimate the true variability in the model's predictions. As such, the confidence intervals derived from the bootstrap procedure should be interpreted with caution, and future work could benefit from including a consideration of classification error to provide more accurate confidence intervals.
In addition, potential cultural factors and the distinct nature of the online forum's user base should be considered when interpreting and generalizing these findings. The demographic profile of users, as presented in Table 2, partially aligns with the broader HBV population in China, with a higher proportion of male users compared to female users (Liang et al., 2023; Wang et al., 2019). However, discrepancies emerge when comparing clinical severity. Epidemiological studies in China indicate that approximately 8% of HBV carriers require treatment (WHO, 2022b), and around 10% of these individuals progress to cirrhosis or liver cancer (Qi, 2010). In contrast, the severity profile of our forum users differs from these general patterns, potentially reflecting a self-selected population more inclined to share experiences, seek advice, or discuss their condition online. These differences suggest that the forum-based sample may not fully represent the entire HBV population. Cultural norms, stigma-related concerns, and the self-selecting nature of online communities could influence both participation and the linguistic manifestation of cognitive distortions. Nonetheless, the large-scale data we analyzed and the alignment of our results with established theoretical and empirical frameworks underscore the value of these findings. Despite the limitations in representativeness, this study provides meaningful insights into the psychological impact of stigma and cognitive distortions among individuals living with HBV, and it could serve as a foundation for future research aimed at more broadly representative samples.
In conclusion, the current study offers empirical evidence that HBV stigma is associated with heightened cognitive distortions, lending support to CBT-based theoretical models positing stigma as a key factor shaping maladaptive thought patterns. By identifying specific, robust distortion categories linked to HBV stigma, this research provides a clearer target for intervention for individuals living with HBV and deepens our understanding of the interplay between social marginalization and mental health. As future studies continue to refine these methods and explore causality and cultural context, we move closer to interventions that not only address the emotional toll of stigma but also dismantle the cognitive structures that sustain it.
PreregistrationThis study is not pre-registered.
CRediT authorship contribution statementXi Wang: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft. Yujia Zhou: Methodology, Resources, Software, Visualization, Writing – review & editing. Guangyu Zhou: Funding acquisition, Project administration, Resources, Supervision, Validation, Writing – review & editing.
This work was supported by the National Social Science Foundation of China under grant number 21BSH158.













