metricas
covid
Radiología (English Edition) Interobserver and intraobserver variability in determining breast density accord...
Journal Information
Visits
1244
Vol. 62. Issue 6.
Pages 481-486 (November - December 2020)
Original articles
Full text access
Interobserver and intraobserver variability in determining breast density according to the fifth edition of the BI-RADS® Atlas
Estudio de la variabilidad inter- e intraobservador en la determinación de la densidad mamaria según la 5.a edición del Atlas BI-RADS®
Visits
1244
K. Pescea, M. Tajerianb, M.J. Chicoa, M.P. Swiecickia, B. Boiettib, M.J. Frangellab,
Corresponding author
, S. Benitezb
a Servicio de Diagnóstico por imágenes, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
b Departamento de Informática en Salud, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
This item has received
Article information
Abstract
Full Text
Bibliography
Download PDF
Statistics
Figures (1)
Tables (2)
Table 1. Linear weighted kappa coefficient between each specialist and the majority report.
Tables
Table 2. Linear weighted kappa coefficient for each specialist in the readings a month apart.
Tables
Show moreShow less
Abstract
Objective

To determine the level of agreement within and between observers in the categorization of breast density on mammograms in a group of professionals using the fifth edition of the American College of Radiology’s BI-RADS® Atlas and to analyze the concordance between experts’ categorization and automatic categorization by commercial software on digital mammograms.

Methods

Six radiologists categorized breast density on 451 mammograms on two occasions one month apart. We calculated the linear weighted kappa coefficients for inter- and intra-observer agreement for the group of radiologists and between the commercial software and the majority report. We analyzed the results for the four categories of breast density and for dichotomous classification as dense versus not dense.

Results

The interobserver agreement among radiologists and the majority report was between moderate and nearly perfect for the analysis by category (κ = 0.64 to 0.84) and for the dichotomous classification (κ = 0.63 to 0.84). The intraobserver agreement was between substantial and nearly perfect (κ = 0.68 to 0.85 for 4 categories and k = 0.70 to 0.87 for the dichotomous classification). The agreement between the majority report and the commercial software was moderate both for the four categories (κ = 0.43) and for the dichotomous classification (κ = 0.51).

Conclusion

Agreement on breast density within and between radiologists using the criteria established in the fifth edition of the BI-RADS® Atlas was between moderate and nearly perfect. The level of agreement between the specialists and the commercial software was moderate.

Keywords:
Breast density
Intraobserver variability
Interobserver variability
BI-RADS® fifth edition
Breast density on mammograms
Resumen
Objetivo

Determinar el acuerdo intra- e interobservador en la categorización de la densidad mamográfica entre un grupo de profesionales según la 5.a edición del Atlas BI-RADS® - ACR y analizar la concordancia entre la categorización de los expertos y un software comercial de un mamógrafo digital para categorización automática.

Métodos

6 médicos categorizaron la densidad mamográfica de 451 mamografías en dos oportunidades con un intervalo de 1 mes. Calculamos los coeficientes kappa ponderados lineales de acuerdo inter- e intraobservador para el grupo médico y la concordancia entre el software comercial y el reporte de la mayoría. Analizamos los resultados para las cuatro categorías de densidad mamaria y para el resultado dicotómico de mama densa/no densa.

Resultados

El acuerdo interobservador entre especialistas y el reporte de la mayoría fue moderado y casi perfecto para el análisis por categoría (κ = 0,64 a 0,84) y de manera dicotómica (κ = 0,63 a 0,84). El acuerdo intraobservador fue sustancial y casi perfecto (κ = 0,68 a 0,85 para 4 categorías y k = 0,70 a 0,87 para el análisis dicotómico). El acuerdo entre el reporte de la mayoría y el software comercial fue moderado tanto por categoría (κ = 0,43) como en el análisis dicotómico (κ = 0,51).

Conclusión

Hemos observado un acuerdo entre moderado y casi perfecto inter- e intraobservador entre los radiólogos, según los criterios establecidos en la 5.ª edición del Atlas BI-RADS®. El nivel de acuerdo entre el reporte de los especialistas y un software disponible comercialmente fue moderado.

Palabras clave:
Densidad mamaria
Variabilidad intraobservador
Variabilidad interobservador
BI-RADS® 5 edición
Densidad mamográfica
Full Text
Introduction

Breast density, evaluated by means of mammography, is defined as relative quantities of radiodense stromal and epithelial tissues in comparison to radiotransparent adipose tissue.1 The 5th edition of the BI-RADS® Atlas of the American College of Radiology (ACR) defines four patterns of breast density, specifically ACR a (almost entirely fatty), ACR b (scattered areas of fibroglandular tissue), ACR c (heterogeneously dense) and ACR d (extremely dense).2

High breast density (categories c and d) is an independent risk factor for developing breast cancer3–5 and a masking factor that decreases mammography’s sensitivity for detecting this disease and other lesions.6–10

Breast density may be visually evaluated by a radiologist. Some authors have found that this method entails considerable intraobserver and interobserver variability.11–13 A systematic literature review conducted by the Task Force working group in 2016 reported that, in community environments, 19–22% of mammography examinations initially classified as dense were subsequently classified as not dense, and 10–16% of examinations initially deemed not dense were reclassified as dense. Regarding sequential reproducibility, the group reported that in 20% of studies the category was changed in the subsequent round if the reading was performed by the same radiologist, and that this proportion rose to 33% if the evaluation was performed by a different radiologist.13,14

Moreover, automated breast density measuring systems are becoming increasingly common.15–19 Consequently, in clinical practice, breast density is reported based on assessment performed by specialists aided by these systems. Yet few studies have analysed professionals’ use of, concordance with and perception of the usefulness of these tools.

The objective of our study was to determine the degree of intraobserver and interobserver agreement in categorising breast density according to the 5th edition of the BI-RADS® Atlas of the ACR among a group of professionals specialised in breast imaging at a highly complex health institution. We also analysed concordance between categorisation by experts and categorisation by an automated categorisation method (a commercial software program for a digital mammography machine).

MethodsSetting

Our study was conducted in the Breast Diagnosis and Intervention Section of the Diagnostic Imaging Department at a tertiary hospital. The department has had digital imaging and an integrated radiology information system/picture archiving and communication system (RIS/PACS) since 2010.20 The section is composed of ten specialists and two fellows, and reports an annual average of 30,000 mammograms. Mammograms are randomly assigned on a daily basis to radiologists for reporting; each receives 200–400 cases per month. Once mammogram reports are redacted, 10% of studies reported by specialists (approximately 300 studies per month) and all studies reported by fellows are submitted for peer review. In addition, report quality audits are performed by the physicians who order the studies.

Study design

This cross-sectional study was conducted in accordance with the principles of the Declaration of Helsinki and was approved by the Independent Ethics Committee at our institution. Patient consent was also obtained.

Categorisation of breast density according to the 5th edition of the BI-RADS® of the ACR

The team of professionals was made up of six physicians from the Breast Diagnosis and Intervention Section. The group had an average of nine years and a range of two to eighteen years of experience in breast imaging. A total of 451 mammograms from randomly selected asymptomatic patients 40–90 years of age, performed at the institution in February 2019, were included. One of four acquisitions was drawn from each study; it could be of the craniocaudal or mediolateral oblique view. Focalised and magnified incidents as well as mammograms for patients with a personal history of breast surgery (including breast implants) or gigantomastia were excluded. The latter was defined by the need to use more than one plate per incident.

A bioengineer and a radiologist, who did not take part in the subsequent categorisation, extracted the images to be evaluated from the hospital database and removed patient-identifying data. Consecutive sampling was used.

A week before the 451 mammography images were evaluated, the participating professionals reviewed the criteria for categorisation of breast density with images in the 5th edition of the BI-RADS®2 Atlas. Next, the specialists categorised breast density in each mammogram. The images were evaluated at 5-megapixel workstations.

The evaluators were not aware of the patients’ demographic data or the category assigned in the original report for each mammogram. They were also unaware of the density assigned by the other participants in the study or the assessment of the commercial software program. Two readings of the same mammograms were performed a month apart. The order of the studies in each of the two readings was random. All this information was recorded in an electronic database.

We used a commercially available breast density measurement software program for the AMULET Innovality (3000AWS7.0 option) Fujifilm® mammography machine for automated categorisation of breast density. This software program calculates breast density based on the ratio of fibroglandular tissue to total breast area.21 It uses this proportion to estimate breast density on a continuous scale and automatically assigns a category.

Endpoints of interest and statistical methods

We evaluated intraobserver and interobserver variability for each participating professional by calculating linear weighted kappa coefficients, reported with a 95% confidence interval (CI). The coefficient was calculated for the four categories of breast density and for the dichotomous result of dense versus non-dense breast.

We also selected the majority report measure, defined as the statistical mode of the observers’ reports, consistent with pre-existing literature.22 We calculated the level of agreement between each evaluator and the majority report. In the event of a tie (non-unimodal categorisation distribution), a seventh imaging specialist categorised the mammogram to reach an agreement.

For the calculation of the linear weighted kappa coefficient (κ) we used the method described by Cohen and Fleiss.23,24 We reported kappa coefficients of interobserver and intraobserver agreement, which for their part were classified according to their level of agreement in accordance with the kappa coefficient breakdown proposed by Landis and Koch (0: “poor”; 0−0.2: “slight”; 0.21−0.4: “fair”; 0.41−0.6: “moderate”; 0.61−0.8: “substantial”; 0.81−1: “almost perfect”).25

Finally, we reported the concordance between the commercial software program and the majority report.

We used the statistics software programs STATA version 14 and R version 3.6.0. A p-value <0.05 was considered significant.

The participating investigators declare that they have no conflicts of interest.

Results

The frequency of the categories assigned by each specialist is shown in Fig. 1. The linear weighted kappa values for overall interobserver agreement (between each specialist and the majority report) at the time of the first evaluation are shown in Table 1. These values attained a range of 0.64 (CI: 0.58−0.70) to 0.84 (CI: 0.80−0.89) for the category analysis. In other words, a moderate to almost perfect agreement was reached between each observer and the majority report. Similar values were obtained when the results were analysed dichotomously from a clinical perspective and the images were categorised as “dense breast” or “non-dense breast”.

Figure 1.

Distribution of frequencies of breast density categories by specialist.

Table 1.

Linear weighted kappa coefficient between each specialist and the majority report.

  Linear weighted kappa coefficient (95% CI)
  InterobserverOn 4-category scalea  InterobserverDichotomised (dense/non-dense)b 
Observer 1  0.66 (0.60−0.72)  0.63 (0.55−0.70) 
Observer 2  0.77 (0.72−0.83)  0.76 (0.70−0.82) 
Observer 3  0.64 (0.58−0.70)  0.72 (0.66−0.78) 
Observer 4  0.84 (0.80−0.89)  0.84 (0.79−0.89) 
Observer 5  0.83 (0.78−0.87)  0.80 (0.75−0.86) 
Observer 6  0.67 (0.61−0.73)  0.66 (0.59−0.73) 
Commercial software  0.46 (0.39−0.52)  0.51 (0.43−0.59) 
a

Interobserver agreement for the first step of observation in the 4 density categories of the BI-RADS®.

b

Interobserver agreement for the first step of observation for the dichotomised result (dense/non-dense breast).

The linear weighted kappa values for intraobserver agreement between the first and the second observation are shown in Table 2. The results attained a range of 0.68 (CI: 0.63−0.74) to 0.85 (CI: 0.80−0.89). For the dichotomous analysis, the results were 0.70 (CI: 0.64−0.77) to 0.87 (CI: 0.83−0.92). For the above-mentioned results, the values correspond to a level of agreement between substantial and almost perfect.

Table 2.

Linear weighted kappa coefficient for each specialist in the readings a month apart.

  Linear weighted kappa coefficient (95% CI)
  IntraobserverOn 4-category scalea  IntraobserverDichotomised (dense/non-dense)b 
Observer 1  0.76 (0.71−0.81)  0.75 (0.68−0.82) 
Observer 2  0.70 (0.64−0.76)  0.70 (0.64−0.77) 
Observer 3  0.85 (0.80−0.89)  0.87 (0.83−0.92) 
Observer 4  0.72 (0.66−0.77)  0.72 (0.66−0.79) 
Observer 5  0.68 (0.63−0.74)  0.73 (0.67−0.77) 
Observer 6  0.73 (0.68−0.79)  0.77 (0.70−0.83) 
Majority report  0.80 (0.76−0.85)  0.85 (0.80−0.90) 
a

Intraobserver agreement for the six radiologists according to the 4 categories.

b

Intraobserver agreement for the dichotomised result (dense/non-dense breast).

The agreement between the majority report and the commercial software program was moderate for the four categories and for the dichotomous analysis, with kappa values of 0.46 (CI: 0.39−0.52) and 0.51 (CI: 0.43−0.59), respectively.

Discussion

Breast density is an independent risk factor for developing breast cancer. Categorisation of breast density is essential for performing a personalised risk assessment and efficiently supplementing population screening efforts with higher-sensitivity studies in patients with dense patterns, such as magnetic resonance imaging and breast ultrasound.26 This factor may be used to select the most appropriate method for diagnosis of each patient. This prevents clinicians from performing too many or not enough complementary tests, thus preventing delays in due access to said tests.

This study evaluated interobserver and intraobserver variability in a diagnostic imaging department in categorising breast density, based on the 5th edition of the BI-RADS® Atlas.

Interobserver agreement for breast density was substantial in the concordance between the six radiologists and the majority report, both for the four-category scale and for the dichotomous (dense/non-dense) categorisation. When we analysed similar studies in the scientific literature, we found highly variable results. Some groups have reported more mixed observations. Some studies have shown interobserver concordance between pairs of radiologists ranging from slight to substantial, with kappa values ranging from 0.02 to 0.72 (mean = 0.46; 95% CI: 0.36−0.55).27 Other studies, by contrast, have reported values similar to those presented in this study.22,27–30 The fact that the group of physicians participating in the study engage in regular academic activities, grand rounds, updates, report audits and so on could account for these results.

In its evaluation of intraobserver variability, our study recorded kappa values with substantial and almost perfect agreement,31 both on a four-category scale and on a dichotomous scale, with no statistically significant difference between them. However, we should mention the potential risk of bias due to the brief period that elapsed between the two readings.

In relation to the above, while we found studies that reported a statistically significant difference for observers with more than 10 years of experience in mammogram diagnosis,28 our results diverged from theirs. This may be explained by the fact that the medical team belongs to a specialised section with uniform diagnostic criteria, regular update meetings and an exclusive focus on breast diagnosis. As mentioned, the evaluators participated in an update session on criteria for breast density categorisation prior to the start of the study. Therefore, our results could be generalised to centres with the same characteristics.

The main advantages of an automated diagnostic tool are its consistency over time and lack of variability.32 Hence, an observer-independent automated system enables reproducible measurements and should be more appropriate for a reliable, standardised evaluation. In this study, agreement between the majority of the physicians and the commercially available software program was moderate. This could be attributed, in the first place, to the fact that the criteria associated with the two characterisation methods are different. Visual categorisation is primarily based on professional knowledge and experience, whereas the automated method uses a quantitative strategy to determine the ACR category. Second, development and validation processes for a software tool are determinant of its performance in different scenarios. It would be useful to explore the reasons for this difference in future studies.

Although this study was conducted at a single institution, it is a leading institution and receives referrals from all over Argentina on a daily basis. A multi-centre study would be a good idea for evaluating new technologies. The sample used had a limited number of cases with a breast density pattern classified as extremely dense (ACR d), consistent with the prevalence thereof according to department reports over the past five years, which hovers around 1–2%. Even so, the total prevalence of high density (categories c and d) in the sample was 41%, also consistent with the prevalence in our hospital population. The number of mammograms used and the randomised order of the studies avoided memory biases due to potential effects of familiarisation with the sample of mammograms between the two observation periods. Finally, our design ensured that the physicians were blinded to the reports issued by the automated classification software and the diagnoses made by all the other evaluators.

Conclusion

Although there is general variability between observers and even within a single operator, qualitative classification of breast density is an acceptable method with moderate to almost perfect interobserver and intraobserver agreement according to the criteria established in the 5th edition of the BI-RADS® Atlas. We found a moderate level of agreement between the reports of the specialists and a commercially available software program. Future studies will be able to examine and characterise agreement between specialists and automated classification methods in greater depth.

Authors

Study integrity: SB, KP.

Study concept: MT, KP, JF.

Study design: BB, JF.

Data acquisition: MJC, MPS.

Data analysis and interpretation: BB, JF, MT, KP.

Statistical processing: BB, MT.

Literature search: JF, MT, KP, MJC.

Drafting of the paper: JF, MT, KP, BB, MPS.

Critical review of the manuscript with intellectually significant contributions: SB.

Approval of the final version: JF, MT, KP, BB, MPS, SB.

Conflicts of interest

The authors declare that they have no conflicts of interest.

References
[1]
N.S. Winkler, S. Raza, M. Mackesy, R.L. Birdwell.
Breast density: clinical implications and assessment methods.
Radiographics, 35 (2015), pp. 316-324
[2]
E.A. Sickles, C.J. D’Orsi, L.W. Bassett, et al.
ACR BI-RADS® Mammography.
ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System, American College of Radiology, (2013),
[3]
S. Ciatto, C. Visioli, E. Paci, M. Zappa.
Breast density as a determinant of interval cancer at mammographic screening.
Br J Cancer, 90 (2004), pp. 393-396
[4]
J.O.P. Wanders, K. Holland, N. Karssemeijer, P.H.M. Peeters, W.B. Veldhuis, R.M. Mann, et al.
The effect of volumetric breast density on the risk of screen-detected and interval breast cancers: a cohort study.
Breast Cancer Res, 19 (2017), pp. 67
[5]
F. Strand, E. Azavedo, R. Hellgren, K. Humphreys, M. Eriksson, J. Shepherd, et al.
Localized mammographic density is associated with interval cancer and large breast cancer: a nested case-control study.
Breast Cancer Res, 21 (2019), pp. 8
[6]
V.A. McCormack, I. dos Santos Silva.
Breast density and parenchymal patterns as markers of breast cancer risk: a meta-analysis.
Cancer Epidemiol Biomarkers Prev, 15 (2006), pp. 1159-1169
[7]
N.F. Boyd, H. Guo, L.J. Martin, L. Sun, J. Stone, E. Fishell, et al.
Mammographic density and the risk and detection of breast cancer.
N Engl J Med, 356 (2007), pp. 227-236
[8]
C.A. Swann, D.B. Kopans, K.A. McCarthy, G. White, D.A. Hall.
Mammographic density and physical assessment of the breast.
AJR Am J Roentgenol, 148 (1987), pp. 525-526
[9]
D.S.A.L. Mousa, E.A. Ryan, C. Mello-Thoms, P.C. Brennan.
What effect does mammographic breast density have on lesion detection in digital mammography?.
Clin Radiol, 69 (2014), pp. 333-341
[10]
M.C. Carreira Gómez, M.C. Estrada Blan.
What we need to know about dense breasts: implications for breast cancer screening.
Radiologia, 58 (2016), pp. 421-426
[11]
B.L. Sprague, E.F. Conant, T. Onega, M.P. Garcia, E.F. Beaber, S.D. Herschorn, et al.
Variation in mammographic breast density assessments among radiologists in clinical practice: a multicenter observational study.
Ann Intern Med, 165 (2016), pp. 457-464
[12]
H.-J. Eom, J.H. Cha, J.-W. Kang, W.J. Choi, H.J. Kim, E. Go.
Comparison of variability in breast density assessment by BI-RADS category according to the level of experience.
Acta Radiol, 59 (2018), pp. 527-532
[13]
A. Alikhassi, H. Esmaili Gourabi, M. Baikpour.
Comparison of inter- and intra-observer variability of breast density assessments using the fourth and fifth editions of Breast Imaging Reporting and Data System.
Eur J Radiol Open, 5 (2018), pp. 67-72
[14]
J. Melnikow, J.J. Fenton, E.P. Whitlock, D.L. Miglioretti, M.S. Weyrich, J.H. Thompson, et al.
Supplemental screening for breast cancer in women with dense breasts: a systematic review for the U.S. Preventive Services Task Force.
Ann Intern Med, 164 (2016), pp. 268-278
[15]
A.M. Jeffers, W. Sieh, J.A. Lipson, J.H. Rothstein, V. McGuire, A.S. Whittemore, et al.
Breast cancer risk and mammographic density assessed with semiautomated and fully automated methods and BI-RADS.
Radiology, 282 (2017), pp. 348-355
[16]
S. Ciatto, D. Bernardi, M. Calabrese, M. Durando, M.A. Gentilini, G. Mariscotti, et al.
A first evaluation of breast radiological density assessment by QUANTRA software as compared to visual classification.
[17]
O. Alonzo-Proulx, R.A. Jong, M.J. Yaffe.
Volumetric breast density characteristics as determined from digital mammograms.
Phys Med Biol, 57 (2012), pp. 7443-7457
[18]
I. Martínez Gómez, M. Casals El Busto, J. Antón Guirao, F. Ruiz Perales, R. Llobet Azpitarte.
Semiautomatic estimation of breast density with DM-Scan software.
Radiologia, 56 (2014), pp. 429-434
[19]
J. Gao, R. Warren, H. Warren-Forward, J.F. Forbes.
Reproducibility of visual assessment on mammographic density.
Breast Cancer Res Treat, 108 (2008), pp. 121-127
[20]
D. Luna, F. Plazzotta, C. Otero, F. González Bernaldo de Quirós, A. Baum, S. Benítez.
Incorporación de tecnologías de la información y de las comunicaciones en el Hospital Italiano de Buenos Aires.
[22]
E.U. Ekpo, U.P. Ujong, C. Mello-Thoms, M.F. McEntee.
Assessment of interradiologist agreement regarding mammographic breast density classification using the fifth edition of the BI-RADS Atlas.
AJR Am J Roentgenol, 206 (2016), pp. 1119-1123
[23]
J. Cohen.
A coefficient of agreement for nominal scales.
Educ Psychol Meas, 20 (1960), pp. 37-46
[24]
J.L. Fleiss, J. Cohen.
The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability.
Educ Psychol Meas, 33 (1973), pp. 613-619
[25]
J.R. Landis, G.G. Koch.
The measurement of observer agreement for categorical data.
Biometrics, 33 (1977), pp. 159-174
[26]
W.A. Berg, Z. Zhang, D. Lehrer, R.A. Jong, E.D. Pisano, R.G. Barr, et al.
Detection of breast cancer with addition of annual screening ultrasound or a single screening MRI to mammography in women with elevated breast cancer risk.
JAMA, 307 (2012), pp. 1394-1404
[27]
W.A. Berg, C. Campassi, P. Langenberg, M.J. Sexton.
Breast Imaging Reporting and Data System: inter- and intraobserver variability in feature analysis and final assessment.
AJR Am J Roentgenol, 174 (2000), pp. 1769-1777
[28]
C.C. Gard, E.J. Aiello Bowles, D.L. Miglioretti, S.H. Taplin, C.M. Rutter.
Misclassification of breast imaging reporting and data system (BI-RADS) mammographic density and implications for breast density reporting legislation.
Breast J, 21 (2015), pp. 481-489
[29]
S. Ciatto, N. Houssami, A. Apruzzese, E. Bassetti, B. Brancato, F. Carozzi, et al.
Categorizing breast mammographic density: intra- and interobserver reproducibility of BI-RADS density categories.
[30]
A. Redondo, M. Comas, F. Macià, F. Ferrer, C. Murta-Nascimento, M.T. Maristany, et al.
Inter- and intraradiologist variability in the BI-RADS assessment and breast density categories for screening mammograms.
Br J Radiol, 85 (2012), pp. 1465-1470
[31]
J.R. Landis, G.G. Koch.
The measurement of observer agreement for categorical data.
Biometrics, 33 (1977), pp. 159-174
[32]
G.D. Tourassi, C.E. Floyd.
The effect of data sampling on the performance evaluation of artificial neural networks in medical diagnosis.
Med Decis Making, 17 (1997), pp. 186-192

Please cite this article as: Pesce K, Tajerian M, Chico MJ, Swiecicki MP, Boietti B, Frangella MJ, et al. Estudio de la variabilidad inter- e intraobservador en la determinación de la densidad mamaria según la 5.a edición del Atlas BI-RADS®. Radiología. 2020;62:481–486.

Copyright © 2020. SERAM
Download PDF