This study aims to study the degree of interobserver agreement in the interpretation of radiological tests used for forensic age estimation in migrant minors without family references between the radiologist and the forensic doctor and between the two forensic doctors who evaluated each case.
Materials and methodsThis is a prospective study of the 69 cases studied at the Institute of Legal Medicine and Forensic Sciences of Alicante during 2023 and 2024. Each observer assessed and assigned a stage, following the Greulich and Pyle atlas for hand and wrist radiographs, the Demirjian stages for panoramic radiographs, and the Schmeling and Kellinghaus classifications for computed tomography scans of the proximal clavicle.
The IBM SPSS Statistics 28.0 statistical package was used; obtaining both the percentage of observed agreement and the Kappa index, which corrects for chance.
ResultsThe degree of agreement between the two forensic physicians was very good for the hand and wrist radiograph (k = 0.931), very good for the panoramic radiograph (k = 0.828), and substantial for the computed tomography (k = 0.671). However, the degree of agreement between the radiologist and the forensic physician was lower for all three tests: substantial for the hand and wrist radiograph (k = 0.748), moderate for the panoramic radiograph (k = 0.434), and modest for the computed tomography (k = 0.285).
ConclusionsThese results highlight the importance of training and instruction in the forensic medical evaluation of these radiological tests.
Se plantea el estudio del grado de concordancia interobservador en la interpretación de las pruebas radiológicas utilizadas para la estimación médico-forense de la edad en menores migrantes sin referentes familiares, tanto entre el radiólogo y el médico forense, como entre los 2 médicos forenses que valoran cada caso.
Material y métodoSe trata de un estudio prospectivo de los 69 casos estudiados en el Instituto de Medicina Legal y Ciencias Forenses de Alicante entre 2023 y 2024. Cada observador valoró y adjudicó un estadio, siguiendo el atlas de Greulich y Pyle en las radiografías de mano y muñeca, los estadios de Demirjian en las ortopantomografías y las clasificaciones de Schmeling y Kellinghaus en las tomografías computarizadas de la extremidad proximal de la clavícula.
Se utilizó el paquete estadístico IBM SPSS Statistics 28.0, obteniéndose tanto el porcentaje de acuerdo observado como el índice Kappa, que corrige el acuerdo por azar.
ResultadosEl grado de concordancia entre los 2 médicos forenses fue muy bueno para la radiografía de mano y muñeca (k = 0,931), muy bueno para la ortopantomografía (k = 0,828) y sustancial para la tomografía computarizada (k = 0,671); sin embargo, el grado de acuerdo entre el radiólogo vs. el médico forense fue menor en las 3 pruebas: sustancial para la radiografía de mano y muñeca (k = 0,748), moderado para la ortopantomografía (k = 0,434) y discreto para la tomografía computarizada (k = 0,285).
ConclusionesEstos resultados resaltan la importancia de la formación y el entrenamiento en la valoración médico-forense de estas pruebas radiológicas.
The medico-forensic age estimation of migrant minors without family references is a frequent expert activity.1
The Forensic Age Estimation Unit for Minors (UEFEM) of the Institute of Legal Medicine and Forensic Sciences (IMLCF) in Alicante aims to study, recognise, and subsequently report on all requests for age estimation received from different judicial bodies.2
The UEFEM has the following protocol for action:
- –
Identification of the expert and interpreter, if necessary, collection of filiation data: place and date of birth, journey to Spain and its duration, involvement of third parties, etc.
- –
Accompanying persons are allowed to be present during the examination at the discretion of the expert.
- –
Informed consent is requested for the physical examination and for each of the three radiological tests: hand and wrist X-ray, orthopantomography (OPG), and computed tomography (CT) of the medial clavicular epiphysis (MCE). The person giving consent must be asked to confirm that they have understood the content of the consent forms they are about to sign, and any questions they have must be answered.
- –
Medical history: pathological history, episodes of malnutrition, illnesses, surgical procedures, strenuous work activities, and competitive sports.
- –
An investigation is carried out to determine whether there are any physical or psychological injuries due to possible abuse. If so, with prior consent, these are examined.
- –
Sex, weight, height, body mass index, ethnic origin (important for literature search), constitutional type and general state of maturity are recorded.
- –
No examination of the genitals, breasts, or full nudity is performed, except in cases where sexual assault or genital mutilation is suspected and always with consent.
- –
Through the medical history and physical examination, an attempt is made to identify or rule out growth and development disorders. This is because inferring chronological age from biological age (based on skeletal and dental age) can only be done for individuals with no findings in this regard.
- –
Examination of the oral cavity, with special attention to the presence or absence of third molars, primarily mandibular.
At this point, the interview is concluded, no further examination is necessary, and the requested radiological tests are awaited to finalise the report.
The protocol establishes, as a quality control measure, that all reports must be signed by two forensic doctors (FD), one of whom must belong to the unit. Only in ‘urgent’ cases (usually from courts of first instance on duty) will the signature of one doctor be sufficient.
This protocol also states that the two signatories will independently view and evaluate the radiological tests, and that they will reach a consensus on the final assessment. If they disagree, a third member of the unit will be consulted. The radiological reports will not be examined before or during assessment of the images.
The UEFEM adheres to national and international recommendations, guidelines, and protocols for this type of assessment.3,4 Initially, an X-ray will be performed to assess the state of metaphyseal fusion of certain bones in the hand and wrist. Then, an OPG will be performed to study the degree of maturation of the third molars. If there is a discrepancy between the two lower third molars, the most advanced stage will be chosen. If any of these tests indicate that the maturation process has not been completed, the individual will be considered under 18 years of age, in which case, the minimum age requirement will apply. If both tests (X-ray and OPG) indicate that the maturation process has been completed, a thin-slice (less than 1 mm) CT scan of the MCEs will be performed in axial and coronal projections. Both two-dimensional images and volumetric reconstructions will be analysed. In the event of a discrepancy between the two MCEs, the most advanced stage will be chosen.
The objective of this study is to establish the degree of agreement between observers who assess the different radiological tests blindly: the two FDs involved and the radiologist (R) who performs and reports the tests versus the FD's final assessment.
Material and methodMaterialBetween 2023 and 2024, the IMLCF in Alicante estimated the age of 69 unaccompanied migrant children at the request of the juvenile prosecutor's office. Of these, 23 were assessed in 2023 and 46 in 2024. All were male.
In accordance with the UEFEM protocol, following the initial (and only) assessment, 68 of them underwent X-rays (one did not show up at the hospital after escaping from the centre where they were being held), 65 underwent OPGs, and 46 underwent CT scans of the MCE, all at the same hospital.
Once these tests had been carried out, the hospital sent a DVD containing the images and radiological reports to the juvenile prosecutor's office, which was then forwarded to the IMLCF for assessment by two FDs.
Table 1 shows all the radiological tests carried out, as well as those that could be studied by two observers.
It should be noted that it was not possible to assess the degree of agreement between the R and the FD in all tests, since sometimes, only the images were provided without a radiological report and vice versa.
All the tests performed, however, could be studied and assessed by two FDs, except for 2 X-rays, where this was not possible as the images were not provided in digital format.
MethodThis is a prospective study. To establish the degree of agreement between FD1 and FD2, both doctors (one more experienced than the other) assessed the tests separately and independently, without consulting the radiology report. If they disagreed, they reached a final decision by consensus after reviewing and discussing the images together. If they could not reach a consensus, another member of the unit was consulted.
In the study of agreement between R vs. FD, the result of the previous FD assessment was compared with that of the R. Note that the R assessing the images was not always the same, so the degree of the R's experience in assessing and staging OPGs and CT scans of the MCE depended on who was in charge of the study.
X-rays were assessed according to Greulich and Pyle's atlas standards,5 OPGs according to Demirjian et al.'s assessment system,6 and the CT scans of the MCE according to Schmeling et al.’s7 and Kellinghaus et al.'s staging schemes8 (Table 2).
The CT scans of the MCE were evaluated using two-dimensional images (MicroDicom Viewer, version 3.1.4) and volumetric reconstruction (Invesalius, version 3.1). The algorithm developed by Wittschieber et al.9 was followed, posing four specific questions (A–D). Each question has only two possible answers. In addition, each question is accompanied by further descriptions and explanations detailing the differentiation criteria.
The data were processed using the IBM SPSS Statistics 28.0 statistical package.
The percentage of agreement between the two coding doctors was obtained, ranging from 0% (total disagreement) to 100% (full agreement).
However, the disadvantage of using this calculation as an indicator of reproducibility is that even if the two coders used independent criteria, there would still be a certain degree of agreement by chance. In other words, there may be a coincidence in the result due to nothing more than pure chance rather than the application of the same criteria to the decision.10
Goyanes and Piñero-Naval11 state that, as this percentage is biased because it does not take random agreement into account, it should be collected alongside Cohen's Kappa index (k), as is done in this study.
The k index is used to evaluate agreement between two or more evaluators and between two categorical samples12 and corrects for agreement due to chance alone. In other words, it corresponds to the agreement between observers that goes beyond any chance agreement that may exist between them. For this reason, it is used to determine the reliability between the 2 evaluators.
The range of this statistical test is from −1 to +1, with −1 indicating total disagreement (the reviewers did not agree on any observations), 0 indicating that the agreement was no higher than random agreement, and +1 indicating total agreement between the two coders.11
The results and significance according to Altman13 are shown in Table 3.
The confidence interval and the p-value were calculated for each value of k with a significance level (α) of 0.05.
ResultsAgreement between forensic doctor 1 and forensic doctor 2 (Table 4)Hand and wrist X-rayA total of 66 X-rays were studied and classified according to Greulich and Pyle.5 Both professionals agreed on the standard of 19 years on 47 occasions, on the standard of 18 years on 9 occasions, on the standard of 17 years on 7 occasions, and on the standard of 15 years on one occasion. They disagreed on only two occasions, regarding the standards of 18 years and 19 years. The percentage of agreement was 96.97%, with a measure of agreement k = 0.931 (95% CI: 0.835–1.027), p < .001, indicating very good interobserver agreement.
OrthopantomographyA total of 65 OPGs were evaluated. On 38 occasions, both doctors agreed on stage H, on 13 occasions on stage G, on 3 occasions on stage F, and on 2 occasions on stage E, with a 90.77% agreement rate and a k = 0.828 (95% CI: 0.695–0.961), p < .001, which indicates a very good degree of agreement.
The G and H stages of Demirjian et al.6 are primarily responsible for the discrepancies.
Computed tomographyA total of 46 CT scans of the MCE were studied, and no developmental abnormalities or anatomical variants were found in any of them. The percentage of agreement observed was 71.74%. To establish the reliability of the two observers, taking into account the probability of random agreement, Cohen's Kappa was used, which gave a k = 0.671 (95% CI: 0.520–0.822), p < .001, suggesting a substantial degree of agreement.
On 3 occasions, the observers agreed on stage 2a, on 4 occasions on stage 2b, on one occasion on stage 2c, on 7 occasions on stage 3a, on 2 occasions on stage 3b, on 6 occasions on stage 3c, on 6 occasions on stage 4, and on 4 occasions on stage 5.
No disagreements were found in this group in stages 1, 4, and 5; when disagreements did occur, they were in the substages of Kellinghaus et al.8
Table 4 shows the results of the k index of the two observers FD1 vs. FD2.
Agreement between radiologist and forensic doctor (Table 5)Hand and wrist X-rayA total of 61 X-rays were analysed according to Greulich and Pyle.5 Both doctors agreed in 86.89% of cases as follows: standard of 19 years in 37 cases, 18 years in 8 cases, 17 years in 7 cases, and 15 years in 1 case. There were only 8 disagreements, on the standards of 18 years vs. 17 years and 18 years vs. 19 years.
Interobserver agreement between radiologist and forensic doctor.
| n | % agreement (probability observed) | k | Strength of agreement | |
|---|---|---|---|---|
| X-ray | 61 | 86.89 | 0.748 | Substantial |
| OPG | 54 | 68.5 | 0.434 | Moderate |
| CT | 22 | 40.91 | 0.285 | Fair |
CT: computed tomography; k: kappa index; OPG: orthopantomography; X-ray: hand and wrist X-ray.
k = 0.748 (CI 95%: 0.593–0.903), p < .001 indicates substantial interobserver agreement.
A total of 54 OPGs were studied. On 25 occasions, both doctors agreed on stage H, on 8 occasions on stage G, on 3 occasions on stage F, and on one occasion on stage E. The percentage of agreement was 68.5% and k = 0.434 (95% CI: 0.214–0.654), p < .001, which expresses a moderate degree of agreement.
Demirjian et al.6 stages G and H are mainly responsible for the disagreements.
Computed tomographyTwenty-two CT scans of the MCE were studied, in which no developmental abnormalities or shape variants were found, with a percentage of agreement of 40.91%. In this case, k = 0.285 (95% CI: 0.099–0.471), p < .001, suggesting a moderate strength of agreement.
On one occasion, the observers agreed on stage 1, on 6 on stage 4, and on 2 on stage 5.
The greatest disagreements occurred on the substages of Kellinghaus et al.8 Two cases of disagreements between very different stages are noteworthy: 1 vs. 4 and 1 vs. 5.
Table 5 shows the results of the k index of the 2 observers: R vs. FD.
In all the radiological tests, there was greater agreement between the two FDs than between the R and the FD (Table 6).
Comparison of the degree of agreement between the two groups of doctors.
| FD1 vs. FD2 (k) | R vs. FD (k) | |
|---|---|---|
| X-ray | 0.931 (very good) | 0.748 (substantial) |
| OPG | 0.828 (very good) | 0.434 (moderate) |
| CT | 0.671 (substantial) | 0.285 (fair) |
CT: computed tomography; FD: forensic doctor; k: kappa index; OPG: orthopantomography; R: radiologist; X-ray: hand and wrist X-ray.
Incorporating a reliability coefficient into the coding process is essential for reproducible analyses and valid statistical inferences in any scientific discipline. In fact, the absence of such a coefficient would be sufficient grounds to call the findings into question and reject the study.11
We used the k index. When the observers were the two FDs, its value in the three tests reveals very good interobserver agreement for X-ray and OPG and substantial agreement for CT (Table 6). This highlights the importance of training and education in interpreting these three radiological tests.
When one of the observers was an R and the other an FD, the degree of agreement decreased, primarily for the two most difficult radiological tests: OPG and CT (Table 6). Assessment of X-ray continues to show a substantial degree of agreement.
The previous experience of each observer in interpreting radiological tests is an important factor.
A significant increase in interobserver and intraobserver agreement was observed with the increasing experience of observers in the application of each method, and the degree of agreement between observers increases significantly when training sessions are held to familiarise them with the anatomical region and classification systems.14
Wittschieber et al.15 state that correctly determining clavicular ossification stages requires a high level of specialised training. In their study of the influence of observer qualification on the staging of MCE using CT, these authors conclude that k values depend directly on the examiners' qualifications. Starting from moderate interobserver agreement (k = 0.470; 44.1% of cases misclassified) over six evaluation sessions, they found that agreement improved by 28%, reaching a final value of k = 0.605 (31.4% of cases misclassified). This level of agreement can be considered good. This confirms the importance of training in this type of imaging technique. In the case of CT, most interpretation errors arise from cases in which inexperienced observers attempt to assign a maturation stage in the presence of developmental abnormalities or other situations that would actually prevent the correct stage from being assigned.15 After these cases, most errors in applying the Schmeling system arise between stages 4 vs. 5, 3 vs. 4, and 4 vs. 3. With the Kellinghaus system, most errors occur between stages 3b vs. 3c and 3c vs. 4.
In our study, the disagreement between FDs was mainly in the assignment of Kellinghaus sub-stages 2 and 3. There was always agreement in stages 1, 4, and 5.
We hold regular UEFEM clinical sessions at the IMLCF in Alicante to review cases that have already been studied. All professionals specialising in forensic medicine who wish to learn more about this field are invited to participate.
In order to increase the reliability of the determinations of MCE stage, there should be at least two experienced examiners.15
Most of the studies consulted in this paper have at least two observers, and all use a blind evaluation methodology. Chaumoitre et al.16 use 2 as do Tisè et al.17; one of these observers is a forensic pathologist, and the other is a paediatric radiologist. For OPGs, Uys et al.18 and Martin de las Heras et al.19 use 2 evaluators, and Garamendi et al.20 use 3, 2 of whom are expert FDs. In the study of CTs, Milenkovic et al.,21 Ekizoglu et al.,22 El Morsi et al.,23 Franklin and Favel,24 and Pattamapaspong et al.25 also use 2 observers.
This is the protocol at the UEFEM. As previously noted, except in emergency situations such as on-call duties, there are always two FDs as observers and, in case of disagreement, a third FD from the UEFEM intervenes.
ConclusionsThe first conclusion of this study is that experience is important for IMLCF observers when interpreting radiological images. This is achieved through continuous training (clinical sessions, courses, attendance at conferences and congresses, etc.) and the transfer of knowledge from more experienced FDs to those who are new to this field.
The low level of agreement between the radiologist (R) and the FD confirms the need for the FD to view the radiological images and assign a stage themselves, rather than simply copying the radiological report.
FundingNo specific support from public sector agencies, commercial sector, or not-for-profit organisations was received for this research study.
The authors have no conflict of interest to declare.
Please cite this article as: Rodes Lloret F, Pérez Campello GI, Galiana Vila P, Alegre Requena A, Pastor Bravo M, Gavilán Turiño E. Interobserver agreement in the interpretation of radiological tests for age estimation in migrant minors without family references. Revista Española de Medicina Legal. 2025. https://doi.org/10.1016/j.remle.2025.500456.







