RCTWikiTop journalDeliberate Practice Cognitive PerformanceHigh evidence score

Effect of Artificial Intelligence Tutoring vs Expert Instruction on Learning Simulated Surgical Skills Among Medical Students

Authors: Ali M. Fazlollahi, Mohamad Bakhaidar, Ahmad Alsayegh, Recai Yilmaz, Alexander Winkler-Schwartz, Nykan Mirchi, Ian Langleben, Nicole Ledwos, Abdulrahman J. Sabbagh, Khalid Bajunaid, Jason M. Harley, Rolando F. Del Maestro
Journal: JAMA Network Open
Year: 2022
DOI: 10.1001/jamanetworkopen.2021.49008
Citations: 236

TL;DR

AI-based audiovisual feedback (the Virtual Operative Assistant) improved surgical skill acquisition by 0.66 points on a -1.00 to 1.00 scale compared to remote expert instruction and by 0.65 points compared to no feedback, with equivalent emotional and cognitive load — suggesting that automated, metric-based feedback can outperform human instruction for procedural learning in simulation.

What they tested

The researchers compared three conditions for teaching medical students how to perform a virtual reality brain tumor resection:

**AI tutoring (VOA group):** Students received automated, audiovisual feedback from the Virtual Operative Assistant after each practice session. The VOA analyzed their performance using a validated algorithm (the Intelligent Continuous Expertise Monitoring System, or ICEMS) and provided specific metric-based feedback — e.g., showing where they applied too much force, how efficiently they removed tumor tissue, and where they violated healthy brain tissue.

**Remote expert instruction (instructor group):** Students received synchronous, verbal, scripted debriefing and instruction from a remote expert surgeon via video call. The expert watched their performance and gave real-time feedback, but the feedback was verbal only — no visual overlays or metric displays.

**Control group:** Students received no feedback at all during practice sessions.

All groups completed the same training structure: 5 practice sessions (each followed by 5 minutes of feedback for the treatment groups, or rest for controls), then one final realistic virtual reality brain tumor resection that served as the test of learning and retention.

The primary outcome measures were:

1. **Expertise Score** (from ICEMS, range -1.00 to 1.00) — an automated, algorithm-based assessment of technical skill during each practice resection and the final realistic resection.

2. **Objective Structured Assessment of Technical Skills (OSATS)** — a human-rated scale (range 1–7) used by blinded expert reviewers to assess the final realistic resection video.

Secondary outcomes included:

Self-reported cognitive load after the intervention (using a validated scale)

Self-reported emotional states (positive activating, negative, and deactivating emotions) before, during, and after the intervention

Who was studied

**Sample size:** 70 medical students

**Population:** Undergraduate medical students (years 0–2) from 4 institutions in Canada

**Demographics:** 41 women (59%), 29 men (41%); mean age 21.8 years (SD 2.3 years)

**Setting:** McGill Neurosurgical Simulation and Artificial Intelligence Learning Centre, Montreal, Canada

**Time period:** January to April 2021

**Exclusions:** None reported; all 70 randomized participants were included in the final intention-to-treat analysis

How they measured it

**Expertise Score (ICEMS):** A validated, automated assessment algorithm that analyzes surgical performance from virtual reality simulation data. The score ranges from -1.00 (novice) to 1.00 (expert). It is computed from metrics including: force applied, efficiency of movement, amount of healthy tissue removed, amount of tumor removed, and time to completion. The algorithm was developed and validated in prior work by the same research group.

**OSATS (Objective Structured Assessment of Technical Skills):** A widely used, human-rated scale for surgical skills. The global rating scale ranges from 1 (very poor) to 7 (excellent). It includes subscores for: respect for tissue, time and motion, instrument handling, knowledge of instruments, flow of operation, use of assistants, and knowledge of specific procedure. In this study, two blinded expert surgeons (not involved in the training) rated video recordings of the final realistic resection.

**Cognitive load:** Measured using a validated self-report questionnaire (the NASA Task Load Index, or NASA-TLX, adapted for surgical simulation). Participants rated mental demand, physical demand, temporal demand, performance, effort, and frustration on a 0–100 scale.

**Emotional states:** Measured using a self-report scale adapted from the Achievement Emotions Questionnaire. Participants rated the intensity of positive activating emotions (e.g., enjoyment, pride), negative emotions (e.g., anxiety, shame), and deactivating emotions (e.g., boredom, relaxation) before, during, and after the intervention.

Methodology

**Study design:** This was an instructor-blinded, three-arm, parallel-group randomized clinical trial (RCT). Participants were individually randomized to one of three groups: VOA (AI feedback), instructor (remote expert feedback), or control (no feedback).

**Randomization:** Participants were randomized using a computer-generated random sequence. The allocation was concealed — the researchers enrolling participants did not know which group the next participant would be assigned to. This prevents selection bias.

**Blinding:** The instructors (remote experts) were blinded to the group assignment — they did not know whether the student they were instructing was in the instructor group or the control group (though obviously they knew they were instructing someone). The outcome assessors (the two expert surgeons who rated OSATS from video) were blinded to group assignment. The participants themselves could not be blinded — they knew whether they were receiving AI feedback, human feedback, or no feedback. This is a limitation because participants' expectations could influence their effort or performance.

**Duration:** The entire experiment took place in a single 75-minute session. This included:

5 practice sessions (each followed by 5 minutes of feedback for treatment groups)

1 final realistic virtual reality brain tumor resection

Pre- and post-intervention questionnaires

**Statistical approach:** The primary analysis used intention-to-treat — all randomized participants were included in the analysis regardless of whether they completed all sessions (though all did). For the primary outcomes (Expertise Score change over practice sessions, and final realistic Expertise Score), the researchers used linear mixed-effects models, which account for repeated measures within participants. They reported mean differences with 95% confidence intervals and p-values. For OSATS ratings, they used one-way ANOVA with post-hoc Tukey tests.

**What this design can and cannot prove:**

**Can prove:** That the AI feedback caused better performance on the virtual reality simulation task compared to remote expert instruction or no feedback, in this specific population and setting. The RCT design with randomization controls for confounding variables (e.g., baseline ability, prior experience) because these should be evenly distributed across groups.

**Cannot prove:** That the AI feedback would work better in real surgery (not simulation), or that the skills transfer to the operating room. The study only measured performance on the same simulator used for training — this is near transfer, not far transfer. It also cannot prove long-term retention — the final test was immediate (within the same session), so we don't know if the advantage persists days or weeks later. The single-session design also cannot tell us about the optimal number of practice sessions or the durability of learning over multiple training sessions.

**Major methodological weaknesses:**

1. Single session only — no measure of retention after a delay (e.g., 1 week or 1 month later)

2. No blinding of participants — they knew which feedback they received, which could affect motivation

3. The remote expert instruction was scripted and verbal only — this may not reflect how expert instruction is typically delivered in person (e.g., with gestures, visual demonstrations, or hands-on guidance)

4. The sample was all early-stage medical students (years 0–2) — results may not generalize to more experienced trainees or practicing surgeons

5. The primary outcome (Expertise Score from ICEMS) was developed by the same research group — there is a potential conflict of interest in using their own algorithm as the gold standard

Key findings

**Primary outcome — Expertise Score during practice sessions (learning curve):**

VOA group improved their Expertise Score by an average of 0.66 points (95% CI, 0.55 to 0.77) more than the instructor group across the 5 practice sessions (p < 0.001)

VOA group improved by 0.65 points (95% CI, 0.54 to 0.77) more than the control group (p < 0.001)

The instructor group and control group did not differ significantly from each other (the paper does not report this specific comparison, but the confidence intervals suggest overlap)

**Primary outcome — Expertise Score on the final realistic resection (learning and retention):**

VOA group mean: 0.53 points higher than instructor group (95% CI, 0.40 to 0.67; p < 0.001)

VOA group mean: 0.49 points higher than control group (95% CI, 0.34 to 0.61; p < 0.001)

Instructor group vs control group: not reported as significant

**Primary outcome — OSATS global rating (human-rated, final realistic resection):**

VOA group: mean 4.63 (95% CI, 4.06 to 5.20)

Instructor group: mean 4.40 (95% CI, 3.88 to 4.91)

Control group: mean 3.86 (95% CI, 3.44 to 4.27)

No statistically significant differences between any groups on the global OSATS rating (p-values not reported as significant)

**Primary outcome — OSATS subscores:**

VOA group had significantly higher "overall subscore" (a composite of all subscores) compared to control: mean difference 1.04 points (95% CI, 0.13 to 1.96; p = 0.02)

Instructor group had significantly higher "instrument handling" subscore compared to control: mean difference 1.18 points (95% CI, 0.22 to 2.14; p = 0.01)

No other subscores showed significant differences

**Secondary outcomes — Cognitive load:**

No significant differences between groups on any NASA-TLX subscale (mental demand, physical demand, temporal demand, performance, effort, frustration)

All groups reported moderate cognitive load (means not reported in abstract, but stated as non-significant)

**Secondary outcomes — Emotional states:**

No significant differences between groups on positive activating emotions, negative emotions, or deactivating emotions at any time point (before, during, or after the intervention)

Effect magnitude

The primary effect — a 0.66-point improvement in Expertise Score — needs context. The Expertise Score scale runs from -1.00 (novice) to 1.00 (expert), so a 0.66-point improvement represents roughly one-third of the entire scale range. To put it in practical terms:

The VOA group's improvement across 5 practice sessions was about **3 times larger** than the instructor group's improvement (the paper reports that the instructor group improved by approximately 0.20 points, though this exact number is not in the abstract)

On the final realistic resection, the VOA group scored about **0.50 points higher** than both other groups — this is roughly equivalent to moving from a "below average" performance to an "above average" performance on the ICEMS scale

The OSATS global ratings showed a trend (VOA 4.63 vs control 3.86, a difference of 0.77 points on a 1–7 scale), but this was not statistically significant — meaning the human raters did not reliably detect the same advantage that the automated algorithm detected

In plain English: The AI feedback produced a **large and consistent improvement** in automated performance metrics, but this advantage was **not obvious to human expert raters** watching the final performance videos. This suggests either that the AI algorithm is more sensitive to subtle differences in skill, or that the AI-trained students learned to "game" the algorithm's metrics without actually improving their overall surgical quality.

Limitations

**Acknowledged by authors:**

Single-session design — no measure of long-term retention

Participants were early-stage medical students — results may not generalize to residents or practicing surgeons

The remote expert instruction was scripted and verbal only — may not reflect typical in-person teaching

The VOA feedback was audiovisual and metric-based, while the instructor feedback was verbal only — the comparison confounds the modality of feedback (visual + verbal vs verbal only) with the source (AI vs human)

**Additional critical observations:**

**Conflict of interest:** The ICEMS algorithm and VOA system were developed by the same research group that conducted the trial. This creates a potential bias — the algorithm may be optimized to detect improvements that align with its own feedback, rather than general surgical skill.

**No sham control:** The control group received no feedback at all, which means they had 5 minutes of "rest" while the treatment groups received instruction. This extra attention could have motivated the treatment groups to try harder, independent of the feedback content.

**Small sample size:** With 23–24 participants per group, the study may have been underpowered to detect differences in OSATS ratings or secondary outcomes. The non-significant OSATS result could be a false negative (Type II error).

**Single center:** All training occurred at one simulation center, which limits generalizability to other settings or simulators.

**No blinding of participants:** As noted, participants knew whether they were receiving AI feedback, human feedback, or no feedback. This could influence effort, motivation, or anxiety.

**The "instructor" condition may be weak:** The remote expert instruction was scripted and limited to 5 minutes per session. In real surgical training, expert instruction is often more interactive, adaptive, and prolonged. The comparison may stack the deck in favor of the AI system.

**No measure of real-world transfer:** The study only measured performance on the same simulator. We don't know if the AI-trained students would perform better on a different simulator, a physical model, or a real patient.

Practical takeaways

For someone running their own n=1 experiment (e.g., learning a procedural skill with AI feedback vs self-study or human coaching):

### What to test

**Intervention:** Use an AI-based feedback system that provides real-time, metric-based, audiovisual feedback on your performance. For surgical skills, this could be a VR simulator with built-in analytics (e.g., the VOA system or similar). For non-surgical skills (e.g., playing an instrument, coding, public speaking), look for tools that provide automated, quantitative feedback on specific performance metrics (e.g., timing, accuracy, force, efficiency).

**Dose:** 5 practice sessions of ~10–15 minutes each, with 5 minutes of AI feedback after each session. Total training time: ~75 minutes.

**Comparator:** Either no feedback (self-study) or human coaching (e.g., a tutor, instructor, or peer who watches your performance and gives verbal feedback).

### Minimum meaningful duration

**Single session:** This study shows effects within a single 75-minute session. For your own experiment, you could test the effect of AI feedback in one session, but a more robust test would involve multiple sessions over days or weeks (e.g., 3 sessions per week for 2 weeks) to assess learning curves and retention.

**Retention test:** Include a delayed test (e.g., 1 week after training) to see if the advantage persists.

### What to measure (specific metrics)

**Primary metric:** An automated, objective performance score that captures multiple dimensions of skill (e.g., accuracy, speed, efficiency, error rate). If you're using a VR simulator, use its built-in scoring algorithm. For other skills, define your own composite score (e.g., for typing: words per minute × accuracy percentage; for coding: time to complete task × number of errors).

**Secondary metric:** A human-rated assessment of overall quality (e.g., a 1–7 global rating scale completed by a blinded expert or peer). This helps check whether the automated score reflects real-world quality.

**Cognitive load:** Use the NASA-TLX (free online) to measure mental effort. This tells you whether the AI feedback is more or less mentally demanding than human coaching.

**Emotional state:** Use a simple 1–10 self-report scale for enjoyment, anxiety, and frustration before, during, and after training.

### Key confounds to control for

**Baseline skill:** Measure your starting performance before any training (a pre-test). Randomize the order of conditions if you're comparing AI vs human feedback across different skills.

**Practice time:** Keep total practice time

Read full paper →More Deliberate Practice research