Interstitial lung disease (ILD) is one of the leading causes of mortality in autoimmune diseases. The extent of the disease is a determining factor in the prognosis and treatment initiation and monitoring. Quantification using the Goh method is the most commonly used method; however, it is subjective. So far, no studies have evaluated the level of agreement among various readers.
ObjectiveThe study's objective is to determine the interobserver and intra-observer variability in using ILD quantification among physicians from various specialties and levels of experience.
MethodsImages from chest computed tomography of patients with rheumatoid arthritis (RA) or systemic sclerosis (SSc) and ILD were collected. The five necessary cuts described by Goh were extracted to be evaluated by pulmonologists, rheumatologists, radiologists, fellows, and a thoracic radiologist (gold standard). The interobserver and intra-observer variability values were calculated using the intraclass correlation coefficient test or Cohen's Kappa test, depending on the nature of the variable, between each group of medical specialties and in comparison with the gold standard.
ResultsSeventy-nine patients were selected, primarily women, 56% having SSc. A total of 1098 CT scans were performed. The intraclass correlation coefficient was .75 (95% CI: .67–.81), including all nine readers. The best correlation with the gold standard was found among pulmonologists (CCI .83) and rheumatologists (CCI .81). According to severity (more significant or less than 20% extension), the Kappa coefficient was .64 among the nine readers. The intraclass correlation coefficient for the average intra-observer correlation of all readers was .89 (95% CI: .81–.93), and the Kappa coefficient was .82.
ConclusionThe Goh method is valuable and highly correlated among a diverse group of specialties that manage ILD, making it a practical tool for assessing the extent of the disease.
La enfermedad pulmonar intersticial (EPI) es una de las principales causas de mortalidad en las enfermedades autoinmunes. Su extensión es un factor determinante en el pronóstico, el inicio o el monitoreo del tratamiento. El método de Goh es el más utilizado para su cuantificación; sin embargo, es subjetivo. Hasta el momento no hay estudios que evalúen el grado de acuerdo entre varios lectores.
ObjetivoEl objetivo del estudio es determinar la variabilidad interobservador e intraobservador en el uso de la cuantificación de la EPI en un grupo de médicos de diversas especialidades y tiempo de experiencia.
MétodosSe recopilaron imágenes de tomografía computarizada de tórax de pacientes con artritis reumatoide (AR) o esclerosis sistémica (ES) y EPI. Se extrajeron los cinco cortes necesarios descritos por Goh para ser evaluados por neumólogos, reumatólogos, radiólogos, residentes y un radiólogo de tórax (estándar de oro). Los valores de variabilidad interobservador e intraobservador se calcularon utilizando la prueba estadística del coeficiente de correlación intraclase o la prueba Kappa de Cohen, según la naturaleza de la variable, entre cada grupo de especialidades médicas y en comparación con el estándar de oro.
ResultadosSe seleccionaron 79 pacientes, en su mayoría mujeres, un 56% con ES. Se realizaron un total de 1.098 lecturas de TC. El coeficiente de correlación intraclase fue de 0,75 (IC95%: 0,6-0,81), incluyendo a los 9 lectores. La mejor correlación con el estándar de oro fue de los neumólogos (CCI: 0,83) y reumatólogos (CCI: 0,81). Según la gravedad (mayor o menor al 20% de extensión), el coeficiente Kappa fue de 0,64 entre los 9 lectores. El coeficiente de correlación intraclase para la correlación intraobservador promedio de todos los lectores fue de 0,89 (IC95%: 0,81-0,93), y el coeficiente Kappa fue de 0,82.
ConclusiónEl método de Goh es útil y tiene una alta correlación entre un grupo diverso de especialidades que manejan la EPI, lo que lo convierte en una herramienta práctica para calcular la extensión de la enfermedad.
Interstitial lung disease (ILD) is one of the most critical complications of autoimmune diseases, with increased mortality rates, such as in rheumatoid arthritis (RA)1 and systemic sclerosis (SSc).2
In recent years, evidence regarding therapeutic strategies has extended, with the recognition of immunosuppressive and antifibrotic therapies within the therapeutic arsenal for this condition, which has demanded the development of guidelines and algorithms to facilitate practical criteria for determining disease progression and guiding the initiation of this treatments.3
As a fundamental part of the strategies to define the initiation of treatment in these patients, we have to determine the disease severity and risk of progression, for which pulmonary function tests and measurement of disease extent on high-resolution tomography4,5 are employed. The degree of extension has been correlated in pathologies such as RA, SSc, or inflammatory myopathies (IIM) with mortality and deterioration of lung function,5–7 which is the rationale why proposed algorithms include these measurements as a crucial part of the decision-making process in initiating immunosuppressive and antifibrotic therapies.8,9 In one of the most relevant works published by Goh et al., they define the extension limit >20% as associated with increased mortality, with an HR of 3.03.6
There are several ways to measure the extent of interstitial lung disease, whether through the manual, non-automated form used in the vast majority of centers or software that allows for extension calculation, employed mainly for clinical studies. The Goh method is the most widely used and recommended manual method currently,4 taking into account that the treatment guidelines are based on the parameters of the original study.
It was described in 2008 as part of developing an algorithm for managing patients with systemic sclerosis and interstitial lung disease, where disease severity, through pulmonary function and disease extent, would determine the benefit of treatment. This was based on the fact that these variables are fundamental in the prognosis of these patients, depending on whether they were classified into the mild or extensive disease group.
This semi-quantitative method relies entirely on the evaluator's subjective quantification of disease extent, resulting from averaging such quantification at five different levels (tomographic slices). Its use has been more relevant as a tool to establish baseline prognosis for mortality and disease progression in interstitial lung disease and, to a lesser extent, as a means to determine longitudinal changes.10
Its extensive use has been carried out in patients with SSc, and given its ease of use, broad validation, no requirement for additional programs, low cost, and short execution time, it has been extrapolated for use in other pathologies with interstitial disease involvement, as rheumatoid arthritis and hypersensitivity pneumonitis.11,12 However, it has a significant subjective component, which could induce a degree of variability among evaluators. To mitigate the impact of this inconvenience, the authors of this proposal published a guide with high-resolution tomography, with an atlas of images that facilitate discrimination and might help to reduce variability.8
Considering the importance of quantifying the extent of disease in high-resolution tomography, the increase in the implementation of the Goh method, and the possibility of measurement variability. It is essential to know the degree of agreement between different observers, which has not been explored until now, as this interferes with therapeutic decision-making. For this reason, a study was proposed to determine the inter- and intra-observer variability among different medical specialty groups related to addressing patients with interstitial lung disease and variable levels of experience.
Materials and methodsSubjectsData were collected from patients attending the rheumatology outpatient clinic of a university hospital over a one-year period who underwent chest tomography for the study of interstitial lung disease associated with their autoimmune pathology, either RA according to the 2010 ACR/EULAR criteria or SSc according to the 2013 ACR/EULAR criteria. The decision for the request for the tomography was for all patients with SSc as part of the annual screening strategy and for patients with RA who reported dyspnea. Patients with other possible causes of autoimmune interstitial lung disease, other concomitant causes of interstitial lung disease, or poor-quality tomographies that did not allow adequate image assessment were excluded.
Images and readingHigh-resolution tomography images were obtained using a multidetector system, without contrast, in the inspiratory phase, with thin slices of 1–2mm maximum thickness, with the minimum possible collimation in axial cuts and with reconstructions. The tomographies were collected over one year; they could not be more than six months old at the consultation time.
The five necessary cuts for quantifying disease extension described by Goh were performed: 1. Great vessel outlet; 2. Carina level; 3. The confluence of pulmonary veins; 4. Halfway between the third and fifth cuts; 5. Immediately superior to the right hemidiaphragm.6,8 The interstitial lesions to be quantified were reticulation, traction bronchiectasis, honeycombing, and ground-glass opacity. Each cut was initially measured with an approximation to multiples of 5% for its final calculation. The tomographies were sent every two weeks without patient data and anonymously encoded to blind the reader. Some studies with different degrees of disease were selected for a second evaluation to determine the level of intra-observer variability.
ReadersPrior to the start of tomography readings, a 60-minute test session was conducted with the readers virtually live. During this session, the methodology for image evaluation was explained, and questions were answered.
Written and video instructions were sent regarding using the Goh method to achieve homogeneous and standardized performance. The readers included two rheumatologists and two pulmonologists with more than five years of experience managing patients with autoimmune-associated ILD, one pulmonologist with eight years of experience in general pulmonology, two non-specialist radiologists in chest imaging with 5 five years experience, one final-year rheumatology fellow, one final-year pulmonology fellow, and a chest radiologist as the gold standard with more than three years experience in ILD with specific training in this area for one year.
The main objective is to determine the values of inter- and intra-observer variability values in quantifying interstitial lung disease extension using the Goh method. Specific objectives include determining the values of inter- and intra-observer variability among different groups of physicians compared to the gold standard defined as the assessment of tomographies by the chest radiologist with experience in interstitial lung disease; additionally, examining the differences between a group of students in related specialties with imaging training and the gold standard.
Statistical analysisQualitative variables are described as means with their standard deviations (SD) or medians with their interquartile range (IQR), depending on their regular or non-normal nature, respectively, and categorical variables with their relative and absolute frequencies. The disease extension variable was analyzed in two ways, one as continuous and the other categorizing the result dichotomously as severe or non-severe depending on whether it is more significant or less than 20% extension. Inter- and intra-observer variability values were calculated for continuous extension using the statistical test of individual intraclass correlation coefficients with confidence intervals. A value between 0.4 and 0.6 was considered moderate agreement or acceptable, more significant than 0.6–0.8 as good, and more than 0.8 as excellent. Cohen's Kappa test was used for the dichotomous extension variable, with a value between 0.4 and 0.6 considered moderate agreement, more significant than 0.6–0.8 as good, and more than 0.8 as excellent.13
This study obtained approval from an ethics committee according to the legal requirements of the site where it was conducted. Patients were informed and provided signed consent for using their data and images.
ResultsData from 85 patients was collected, of which 79 patients who met image quality standards were selected. A total of 1098 readings (each reading includes the selection of 5 images) of the tomographies were conducted. Each participant initially performed 79 readings, and each conducted 43 additional reassessments.
The characteristics of the included patients and the general findings of the tomography are summarized in Table 1.
Population description.
Variable | Result |
---|---|
Sex (women) | 82% |
Diagnosis | |
RA | 37 (43.5%) |
SSc | 42 (56.5%) |
Age (average) (SD) | 62 (10.2) |
Smokers | 23.6% |
Current smokers | 40% |
Antibodies | |
SSc patients | |
Centromere (patients with SSc) | 26 (61%) |
SCL-70 (patients with SSc) | 7 (16.6%) |
Other antibodies | 9 (21%) |
RA patients | |
Rheumatoid factor (patients with RA) | 32/34 (94%) |
Anti CCP (RA patients) | 28/34 (82.3%) |
Forced vital capacity(% of estimate) (SD) | 86.9 (16.7) |
Extension of disease in tomography | |
Up to 5% | 45% |
5–20% | 35% |
More than 20% | 20% |
SD: standard deviation; RA: rheumatoid arthritis; SSc: systemic sclerosis.
For inter-observer variability, the extension variable was adopted in both continuous and dichotomous forms.
The analysis of the variable in its continuous form resulted in an individual intraclass correlation coefficient of 0.75 (95% CI: 0.67–0.81), including all nine readers.
The analysis of the intraclass correlation coefficient among each group of physicians, within the same group members, and in comparison with the gold standard is summarized in Table 2.
Intraclass correlation coefficients between groups, within-group members, and in comparison, with the gold standard.
Rheumatology | Pneumology | General Radiology | Fellowship | Thorax Radiology | |
---|---|---|---|---|---|
Rheumatology | 0.79 (0.61–0.88) | 0.85 (0.79–0.89) | 0.7 (0.59–0.79) | 0.77 (0.67–0.84) | 0.81 (0.73–0.87) |
Pneumology | 0.81 (0.72–0.88) | 0.72 (0.62–0.79) | 0.78 (0.69–0.84) | 0.83 (0.77–0.88) | |
Radiology | 0.61 (0.45–0.73) | 0.59 (0.46–0.71) | 0.63 (0.49–0.74) | ||
Fellowship | 0.69 (0.46–0.81) | 0.75 (0.64–0.83) |
We observed the best correlation between pulmonologists and rheumatologists (ICC 0.85 (95% CI: 0.79–0.89)), the best correlation within the same group among pulmonologists (ICC 0.81 (95% CI: 0.72–0.88)), and the best correlation with the gold standard among pulmonologists (ICC 0.83 (95% CI: 0.77–0.88)), which was very close to the value of the rheumatologists.
When studying the extension dichotomously, classifying it as severe or not severe when it was more or less than 20%, a Kappa index of 0.64 was obtained, including all 9 readers. Again, the analysis was performed among each group of physicians, within the same group of members, and in comparison with the gold standard. The results are summarized in Table 3, showing high correlation values again between rheumatologists and pulmonologists, as well as with the specialized radiologist.
Kappa indices for extensive disease between groups, within-group members, and in comparison with the gold standard.
Rheumatology | Pneumology | General Radiology | Fellowship | Thorax Radiology | |
---|---|---|---|---|---|
Rheumatology | 0.8 | 0.81 | 0.62 | 0.69 | 0.71 |
Pneumology | 0.8 | 0.6 | 0.71 | 0.69 | |
Radiology | 0.58 | 0.49 | 0.44 | ||
Fellowship | 0.56 | 0.55 |
Each reader conducted 43 repeated assessments to determine intra-observer correlation. The intraclass correlation coefficient for the average intra-observer correlation of the nine evaluators was 0.89 (95% CI: 0.81–0.93). The values for each specialty group of physicians on average were as follows: for rheumatologists 0.94 (95% CI: 0.83–0.98), pulmonologists 0.94 (95% CI: 0.88–0.97), radiologists 0.83 (95% CI: 0.65–0.92), fellows (rheumatology and pneumology) 0.84 (95% CI: 0.68–0.96), and for the chest radiologist 0.91 (95% CI: 0.84–0.95).
For the analysis of tomography extension by severity, the average intra-observer Kappa indices for all observers were 0.82, and by groups of physicians, they were as follows: for rheumatologists 0.965, pulmonologists 0.8, radiologists 0.77, fellows (rheumatology and pneumology) 0.75, and for chest radiology 0.85.
DiscussionThe Goh method allows for the manual calculation of fibrosis extension in high-resolution tomographies, assuming it includes a significant subjective component and, therefore, high variability.
In this study, we demonstrated high agreement among physicians from different specialties involved in addressing patients with interstitial lung disease, marking the first research in our setting to evaluate this characteristic in a diverse clinical group.
These results show that evaluations are generally homogeneous, with a satisfactory interclass correlation index for all observers 0.75. This index is even higher among the most trained physicians in this field, exceeding 0.8 about the gold standard, which is crucial when making decisions regarding treatment initiation.
The Kappa index for assessing severity with a 20% cut-off extension was found to be 0.64, indicating a moderate concordance. Although a higher value would be expected considering the importance of this outcome, the value has likely been reduced given that with a higher number of evaluators; this statistical value tends to decrease as it becomes more difficult for numerous values to have a good agreement, mainly when we include a diverse group of physicians with different levels of expertise. When reviewing the kappa values among readers with more experience in using this approach and treating these patients, it increases considerably to 0.8 between rheumatologists and pulmonologists.
Additionally, we observed good agreement (0.7) among members of the postgraduate student group with the gold standard. These evaluators only had initial training and little experience, indicating the importance of implementing a systematic training method for evaluating images of interstitial lung disease within the educational process, which does not need to be extensive. Similar data were observed in the pulmonologist group, where agreement within this group was the highest, even with one highly experienced participant and another not. This suggests that learning curves are short and straightforward.
Intra-observer variability was also evaluated and found to be 0.89, demonstrating that the use of this systematic and practical method provides confidence when evaluating images at different times, even among readers with less preparation, which is very useful in settings where computer programs are not available to replace manual methods, thus representing good clinical practice without increasing healthcare costs, taking into account the current charges of implementing automated systems, which is a fundamental reason that has limited their use, in addition to not having shown a clear benefit over physician-dependent systems.
The lower correlation values between the radiologists’ readings and the gold standard caught our attention. We propose a possible explanation for this is the lack of use of these methods for quantification during training, which does not occur in specialties such as pulmonology or rheumatology, where clinical and treatment decisions are based on these methods. This is supported by the fact that the values were close to those found in trainee physicians, who had analyzed fewer studies. These differences may be more pronounced considering that the recruited rheumatologists and pulmonologists had extensive experience with patients with interstitial lung disease.
Until now, the concordance in the assessment of disease extent had not been evaluated, although other studies have been conducted regarding the variability in the diagnosis of ILD among different readers of tomography images, primarily using the Interstitial Lung Fibrosis Reporting and Data System (ILF-RADS), which is used to determine the likelihood of having a usual interstitial pneumonia pattern. In 2022, Alnaghy et al.14 found a high degree of agreement (Kappa=0.88) with this measurement, which was also very high for the different abnormalities in the tomography (Kappa=0.91). These results are similar to the current work, where despite relying on subjective assessment, the agreement among various readers is high. However, a later study in 2024, conducted by Elshetry et al.15 using the same instrument, found lower inter-observer agreement values (Kappa=0.44) although high for intra-observer assessments (Kappa=0.87), which are closer to our study. These studies support the evidence that tomographic evaluation in ILD, despite a clear subjective component, may not substantially affect the outcome when compared among various observers.
Regarding the study's limitations, we consider the lack of accurate measurement of the time used by each reader, which ranged from 5 to 15min, which could influence the readings, like a fragment of data that might be missing with possible importance.
During image selection, several patients (5%) with predominantly basal disease, spatially located below the last cut recommended by the Goh method, were found, which may influence the extension calculation by underestimating it. Similarly, differences in areas between apices and bases may affect the final extension calculation. So far, we do not have a standardized method for correcting these errors.
Additionally, specific measurements were not made according to the different types of abnormalities in the tomography (ground-glass opacities, reticular patterns, honeycombing), which makes it impossible to have data on the concordance in each of them in case there were anywhere the differences could be more pronounced.
ConclusionWe found that the estimation of interstitial disease extent by the Goh method correlates well among a diverse group of medical specialties, being higher when these have high experience compared to the gold standard. Furthermore, intra-observer correlation is very high for all evaluators, concluding that it is a practical and highly reproducible method for evaluating the extension of fibrosis findings in high-resolution tomographies.
CRediT authorship contribution statementThe authors’ contributions were in different stages of the study. LC contributed to patient and image collection and statistical analysis. SC, GG, and MC contributed to image collection, analysis, and article writing. MT, DG, YM, AN, JG, CR, WA, DO, and JD contributed to image analysis.
Ethical statementThis study obtained approval from an ethics committee according to the legal requirements of the site where it was conducted. Patients were informed and provided signed consent for using their data and images. The present study does not contain any person's data in any form.
FundingThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Conflicts of interestsThe authors declare that no conflicts of interest may be considered to influence the content of the manuscript directly or indirectly.