Scientific reasoning is an essential skill for medical students, helping them to formulate hypotheses, design experiments, and interpret data effectively. This cross-sectional observational study aimed to evaluate scientific reasoning levels among 113 medical students from both a public university and a military university in Mexico, using the Lawson Classroom Scientific Reasoning Test.
MethodsThe test included 12 items, each requiring justification of the response. Data analysis involved normality tests (Kolmogorov–Smirnov and Shapiro–Wilk) and an ANOVA with Tukey's post-hoc comparisons.
ResultsThe findings showed that 67.26% of students demonstrated concrete reasoning, 31.86% were at the transitional formal stage, and only 0.88% attained the post-formal level. No significant differences were observed between the two groups.
ConclusionThese results highlight the importance of enhancing scientific reasoning in medical education. While the Lawson Test is a useful assessment tool, it should be complemented by broader educational strategies to cultivate deeper critical thinking skills during medical training.
El razonamiento científico es una habilidad esencial para los estudiantes de medicina, ya que les ayuda a formular hipótesis, diseñar experimentos e interpretar datos eficazmente. Este estudio observacional transversal tuvo como objetivo evaluar los niveles de razonamiento científico en 113 estudiantes de medicina de una universidad pública y una universidad militar en México, mediante la Prueba de Razonamiento Científico en el Aula de Lawson.
MétodosLa prueba incluyó 12 ítems, cada uno con justificación de la respuesta. El análisis de datos incluyó pruebas de normalidad (Kolmogorov–Smirnov y Shapiro–Wilk) y un ANOVA con comparaciones post-hoc de Tukey.
ResultadosLos hallazgos mostraron que el 68% de los estudiantes demostró razonamiento concreto, el 30% se encontraba en la etapa formal transicional y solo el 2.7% alcanzó el nivel post-formal. No se observaron diferencias significativas entre ambos grupos.
ConclusiónEstos resultados resaltan la importancia de fortalecer el razonamiento científico en la educación médica. Si bien la Prueba de Lawson es una herramienta de evaluación útil, debe complementarse con estrategias educativas más amplias para cultivar habilidades de pensamiento crítico más profundas durante la formación médica.
Scientific reasoning underpins the capacity to generate hypotheses, derive and test predictions, and discern among competing explanations. In medical education, it enables students to apply theoretical knowledge to make informed diagnostic and therapeutic decisions. Research consistently shows that reasoning skills support academic and clinical performance, yet weaknesses in proportional, probabilistic, and correlational reasoning remain—precisely the areas most associated with errors in decision-making. Studies that utilized the Lawson Classroom Test of Scientific Reasoning (LCTSR) and related assessments confirm this pattern: students perform relatively well on conservation tasks but struggle with proportionality and probability.1,2
A growing body of research identifies metacognition and self-regulation as immediate determinants of reasoning quality in medical students. Structural models and cross-sectional surveys indicate that planning, monitoring, and reflection correlate with better performance in solving clinical problems and with dispositions toward critical thinking; group-level metacognition in collaborative environments also associates with higher satisfaction and perceived learning.3,4 Reflective practices further coordinate knowledge, affect, and strategies in complex tasks, supporting curricular interventions that privilege metacognitive reflection alongside disciplinary mastery.5
Assessment tools are decisive. Two-tier instruments like the LCTSR, which require both answers and justifications, provide profiles of reasoning subskills. However, recent psychometric studies advise caution: inconsistencies between item pairs and sensitivity to scoring methods can distort inferences about subskills, which has motivated proposals to combine strict scoring (response plus justification) with updated measurement models and transparent reporting.6,7 Parallel to this, pedagogical interventions—such as flipped classrooms with clinical scripts, team-based learning, and integrative modules—have shown measurable improvements in reasoning outcomes.8,9 Digital tools now expand the monitoring of decision-making,10 and recent reviews highlight the role of emotions in shaping reasoning.11 Despite these advances, multi-institutional studies frequently show a predominance of concrete or transitional performances, with limited evidence of consolidated formal or post-formal structures among undergraduate and early medical students. This recurring pattern across countries and disciplines12,13 underscores the need for person-centered analyses, such as cluster or latent profile methods, which reveal subgroups with distinct strengths and vulnerabilities as well as identify students situated in liminal zones.14
Within this framework, the study hypothesizes a positive association between metacognition, self-regulation, and scientific reasoning as determined by the Lawson test, and focuses on characterizing scientific reasoning among medical students from two Mexican universities. Using the 12-item Lawson test with justification-based scoring, the design combines distributive comparisons with cluster analysis to classify students into concrete, transitional, formal, and post-formal levels. The objective is to generate evidence for targeted interventions that strengthen proportional and probabilistic reasoning, hypothesis testing, and metacognitive control in early medical training.
Material and methodsStudy designA cross-sectional observational study was conducted to characterize the scientific reasoning levels of medical students through the administration of a scientific reasoning test (Lawson's test). Inclusion criteria were being enrolled in the Bachelor of Medicine program, having voluntarily agreed to participate in the study, providing informed consent, and having completed the reasoning test in its entirety.
ParticipantsThe study population consisted of 133 s-year medical students enrolled in the Bachelor of Medicine program at the Mexican Naval Medical School (Escuela Médico Naval) and the Faculty of Medicine at the Universidad Nacional Autónoma de México (Facultad de Medicina, UNAM). The sample was divided into four groups: three from the first institution (Group I, n = 27, Group II, n = 26, Group III, n = 23) and one from the second (Group IV, n = 37), respectively.
Lawson's testThis assessment instrument has been widely validated in the literature. It consists of 12 items, each accompanied by a justification. For an answer to be considered correct, both the selected answers and their corresponding justifications must be correct simultaneously. Reliability analysis for the 12-item version typically results in a Cronbach's alpha coefficient between 0.72 and 0.78. When adjusted with the Spearman-Brown formula, the alpha is around 0.86. The test has also been examined using the Rasch model, which concludes that mixed models provide the most reliable fit with a fit above average (reliability coefficient of 0.680).6,7 The translated and adapted version of the test by Pérez de Landazábal was used, a tool used in Spanish-speaking countries, such as Chile and Spain.15,16 The 12-item format is the same as the 24-item format, differing in the scoring method; each of the 12 pairs of answers and reasoning receives one point only if both are correct, making it practical for classroom use and also effective in eliminating “false positives” that could occur if students guess the correct answer without understanding the underlying reasoning.6 The inclusion criterion required complete responses for both the question and its justification (n = 113). Incomplete answers were excluded from the analysis, resulting in the exclusion of 20 participants.
Reasoning classificationThe classification procedure was based on the criteria formulated by Inhelder and Piaget in 195817 and later adapted by Echiburu et al.,16 who distinguished three levels of reasoning according to the number of correct answers on Lawson's test: empirical-inductive or concrete thinking (0–4 correct answers), transitional or formal thinking (5–8 correct answers), and hypothetical-deductive or post-formal thinking (9–12 correct answers). Piaget's theory ends with formal operations as the pinnacle of cognitive development. However, adult cognition often integrates dimensions Piaget did not fully theorize, such as ambiguity, contradiction, and relativism. The post-formal concept was introduced to account for reasoning beyond formal operations. Dialectical thinking, relativistic reasoning, pragmatic problem-solving, and integration of affect and cognition are key features. In medical and scientific education, post-formal thought is essential to prepare professionals for uncertainty, ethical dilemmas, and evidence evaluation.18 Additionally, the formal thinking level was subdivided into early transition (to 5–6 correct answers) and late transition (7–8 correct answers). The subdivision of the formal level responds to the need to capture the variability in adults, where formal reasoning does not always appear as an abrupt leap, but as a gradual process of consolidation.
Statistical analysisThe statistical analysis was conducted with IBM SPSS Statistics 25.0. Each student's score was determined by the total number of correct responses (ranging from 0 to 12). To assess the distribution of scores within each of the four groups, normality was evaluated using the Kolmogorov–Smirnov and Shapiro–Wilk tests. Subsequently, one-way ANOVA was applied to compare student scores across groups. Cluster analysis and factor analysis were conducted to explore whether test items could be grouped according to underlying dimensions, such as the type of reasoning. The Mann–Whitney U test for independent samples was used to determine whether the differences between the identified clusters were statistically significant. Additionally, a Venn diagram was constructed to visualize the number and overlap of students classified within each reasoning category. The significance level for all statistical tests was set at p < 0.05.
ResultsThe distribution of students according to their performance showed that 67.26% fell within concrete reasoning (0–4 correct answers), while 31.86% reached transitional formal reasoning levels (5–9 correct answers). Only 0.88% of the students demonstrated performance corresponding to hypothetical-deductive or post-formal reasoning (9 correct answers), and none of the participants achieved the highest scores of 10–12 correct answers (see Table 1).
Percentage distribution of students and their categorization within the three levels of thinking, only one student reached the hypothetical deductive or post-formal level, representing 2.7% of the sample.
| Correct answers | Thinking | I | II | III | IV |
|---|---|---|---|---|---|
| 0 | Empirical-inductive or concrete | 0 | 0 | 0 | 2.7 |
| 1 | 14.8 | 15.4 | 8.7 | 5.4 | |
| 2 | 11.1 | 3.8 | 21.7 | 13.5 | |
| 3 | 7.4 | 26.9 | 17.4 | 29.7 | |
| 4 | 18.5 | 23.1 | 21.7 | 18.9 | |
| 5 | Transitional or formal | 14.8 | 15.4 | 8.7 | 8.1 |
| 6 | 14.8 | 7.7 | 13.0 | 8.1 | |
| 7 | 11.1 | 0.0 | 4.3 | 8.1 | |
| 8 | 7.4 | 7.7 | 4.3 | 2.7 | |
| 9 | Hypothetical-deductive or post-formal | 0 | 0 | 0 | 2.7 |
| 10 | 0 | 0 | 0 | 0 | |
| 11 | 0 | 0 | 0 | 0 | |
| 12 | 0 | 0 | 0 | 0 | |
Normality tests (Kolmogorov–Smirnov and Shapiro–Wilk) indicated that Groups I, II, and III followed a normal distribution, with p-values greater than 0.05. In contrast, Group IV had a p-value of less than 0.05, suggesting a potential deviation from normality. The comparison of overall scores among the four groups, conducted via ANOVA and Tukey's post-hoc tests, revealed no statistically significant differences.
A breakdown by test items showed that questions assessing elementary knowledge, such as conservation of mass (Item 1), conservation of displaced volume (Item 2), and identification and control of simple variables (Items 9 and 10), had the highest percentages of correct responses. Item 1 surpassed 90% accuracy, while Items 2, 9, and 10 ranged between 30% and 50% accuracy (see Table 2). In contrast, items requiring advanced reasoning, such as proportional reasoning (item 4) and advanced probabilistic thinking (items 6 and 7), had the lowest success rates, with an accuracy below 20% in most groups.
Graphical representation of the data revealed a dense clustering of students in the quadrants associated with low performance, supporting the observation of a predominance of concrete reasoning. This pattern was consistent across the four groups, although slight variations were noted: Group I exhibited a marginally higher proportion of students in the transitional formal reasoning level compared to Groups II, III, and IV (see Fig. 1).
Cluster analysis (see Fig. 2), based on regression factor scores (REGR), revealed two orthogonal dimensions and two clearly differentiated clusters.
- •
Cluster 1 (blue lines) is predominantly located in the negative region of the horizontal axis, with greater dispersion across both axes. This cluster reflects the oscillation between concrete and transitional reasoning, with less consistent patterns representing participants who achieved their initial levels of reasoning.
- •
Cluster 2 (pink lines) was located in the positive region of the horizontal axis and showed less internal dispersion. This group represents participants who demonstrated greater consistency in their responses and thus achieved formal levels of reasoning.
Cluster analysis revealed two distinct groups based on Ward's method, suggesting divergent reasoning profiles among the participants. To determine whether the differences between Cluster 1 and Cluster 2 were statistically significant, the Mann–Whitney U test was applied, given that the normality test yielded p < 0.05. The grouping variable was defined by the Ward clustering algorithm. The Mann–Whitney test result (p < 0.005) indicated that the differences between the two clusters were statistically significant. Specifically, Cluster 2 included students with higher scores on Lawson's test.
When categorizing students by reasoning type, two main sets were observed: those exhibiting concrete reasoning (n = 76; 67.26%) and those classified under formal reasoning (n = 36; 31.86%). The intersection between both sets included 36 students, comprising 25 who displayed predominantly concrete reasoning and 11 who demonstrated predominantly formal reasoning. Only one student (0.88%) reached the post-formal reasoning level, obtaining a score of 9 correct responses on Lawson's test. These distributions are illustrated in Fig. 3.
DiscussionThe findings of this study show that nearly seven out of ten students remain at concrete levels of reasoning, with only 0.88% reaching post-formal stages. This pattern, rather than constituting an isolated anomaly, reflects a structural tendency that confirms the lag in the transition toward advanced forms of thought. When analyzed through Piaget's theory of equilibration, this imbalance represents an interruption in the processes of assimilation and accommodation that enable the reorganization of experience within formal logical frameworks.19 Recent studies corroborate that, even in university settings, a large proportion of students display persistent difficulties in variable control, proportionality, and probabilistic reasoning—core components of scientific reasoning.1,20
Low performance on complex items cannot be attributed solely to deficiencies in disciplinary content but to the absence of flexible cognitive structures that support abstract and reversible inferences. Research in medical education indicates that the instructional overload of information, without prior cognitive scaffolding, contributes to the crystallization of rigid schemas and resistance to hypothetical-deductive reasoning.21,22 This phenomenon has also been documented in science and engineering programs, where students typically reach late formal levels only under conditions of explicit and sustained instruction.2
Cluster analysis adds an additional dimension to this interpretation. The coexistence of a small group with coherent formal reasoning and a majority with fluctuating trajectories confirms that cognitive development in higher education progresses through heterogeneous and discontinuous pathways. As Cremona et al. argue, these transitions reflect liminal states in which thought oscillates between correct and inconsistent responses, evidencing structural instability rather than linear progression.23 This finding aligns with studies describing the emergence of hybrid reasoning profiles, in which partial acquisition of competencies coexists with profound limitations in probabilistic and proportional inferences.12,24
A critical aspect revealed is the close association between low Lawson test scores and the absence of metacognitive strategies. The theory of self-regulated learning explains this relationship: without planning, monitoring, and evaluating their own cognitive activity, students face greater barriers to advancing toward formal levels.25,26 Studies in medical education reinforce this link, showing that self-reflection and metacognitive awareness correlate with stronger performance in critical thinking and clinical reasoning tests.27–29 Consequently, instruction centered exclusively on content delivery proves insufficient; pedagogy must foster self-regulation, anticipation, and deliberate reflection.
From an epistemological perspective, these findings challenge the implicit assumption in many medical programs that scientific reasoning emerges spontaneously through repeated practice of clinical tasks. On the contrary, the data presented here support a constructivist approach: scientific reasoning must be intentionally cultivated in environments that combine progressive cognitive challenges, timely feedback, and guided reflection spaces. Otherwise, the risk arises of forming professionals who are technically competent yet epistemologically unstable, unable to discriminate between robust evidence, accumulated experience, and faulty heuristics.30,31
Moreover, the different levels of reasoning can be interpreted as transitional states that demarcate boundary zones between cognitive categories. The fact that virtually all students were classified within concrete or formal frameworks supports the ecological validity of the Lawson test, particularly in educational systems where the evaluation of complex competencies is substituted with memory-based metrics. This evidence converges with recommendations to expand assessment instruments toward inferential, symbolic, and metacognitive domains.32
In sum, the evidence presented underscores that the development of scientific reasoning in higher education—particularly in medical training—cannot be left to chance or assumed to arise naturally from exposure to disciplinary content. The predominance of concrete reasoning levels, the instability of transitional cognitive states, and the absence of metacognitive regulation reveal structural barriers that impede progression toward advanced thought. These findings highlight the need for curricular models that scaffold cognitive growth, integrate reflective practices, and expand assessment beyond rote memorization toward inferential and symbolic competencies. Strategies such as adaptive virtual environments, case analysis under conditions of structural uncertainty, metacognitive training, and evaluation rubrics centered on inferential coherence can activate higher-order processes and consolidate formal reasoning.8,9 For instance, a study found that problem-solving in computer-based virtual environment simulations might be more effective for learning clinical reasoning skills than theoretical instruction.33 On the other hand, another study examined how entry-level students approach clinical problem solving by examining the types of hypotheses and reasoning errors they make during their progression through a physical therapy curriculum. As students progressed through the curricula, they made fewer analytical errors. The study findings showed that a design of appropriate case-based learning activities allows students to progress into more adaptive and flexible thinkers.34 Only through intentional pedagogical strategies that cultivate operational intelligence can professionals be formed who can navigate complexity, distinguishing evidence from heuristics, and exercising sound scientific judgment in practice.
Our findings reflect the cognitive profiles of medical students and may not generalize to other health professions or STEM disciplines with distinct curricula. In addition, sample size constrains estimates of prevalence and effect sizes; therefore, replication with larger, diverse cohorts is required to stabilize inferences. Future work should employ stratified, multi-institutional sampling across academic years to capture developmental trajectories and curricular variability and compare medical students with nursing, engineering, and natural sciences cohorts, for instance, to identify reasoning barriers.
Finally, although this study employed a cross-sectional design and a sample limited to two Mexican institutions, its results open a line of research centered on modeling the development of scientific thought as a dynamic process shaped by contextual, affective, and epistemic variables. To advance in that direction, it becomes necessary to trace individual trajectories using instruments sensitive to qualitative change, with the aim of constructing a more precise cognitive cartography that guides formative reforms toward the integration of critical thinking, self-regulation, and scientific reasoning in medical education.
ContributionsFrancisco Estrada-Rojo contributed to Conceptualization, Methodology, Investigation, Data curation, Formal analysis, Writing – original draft, and Writing – review & editing. Alejandro Hernández-Chávez contributed to Methodology, Investigation, Data curation, Formal analysis, and Writing – original draft. Armando Muñoz-Comonfort contributed to Methodology, Formal analysis, Validation, and Writing – review & editing. Gustavo López-Toledo contributed to Formal analysis, Validation, Writing – original draft, and Writing – review & editing. Laura Gómez-Virgilio contributed to Formal analysis, Visualization, Writing – original draft, and Writing – review & editing. Francisco Estrada-Bernal contributed to Methodology, Formal analysis, Validation, and Writing – review & editing. Raúl Sampieri-Cabrera assumed responsibility for Conceptualization, Supervision, Project administration, Formal analysis, Writing – original draft, and Writing – review & editing, with overarching responsibility for the conceptual integration and academic supervision of the manuscript.
Informed consentInformed consent was obtained from all participants enrolled in the study.
Ethical considerationsThis study adhered to the ethical principles of observational educational research in accordance with the Declaration of Helsinki. Participant confidentiality and anonymity were ensured. No incentives were provided to the participants. The study protocol was approved by the Research Ethics Committee of the Faculty of Medicine, UNAM (approval code: FM/DI/049/2023).
FundingNo funding was received for conducting this study.
Use of Artificial Intelligence (AI)The authors declare that no AI was used at any stage of the research or manuscript development.
All authors have no conflicts of interest to declare.
The authors would like to thank the Department of Physiology, Faculty of Medicine, Universidad Nacional Autónoma de México (UNAM), and the Mexican Naval Medical School for their support in this research. LGV is supported by the “Postdoctoral Fellowships in National Institutions” program from CITNOVA.









