SpineFairBench: A Counterfactual Benchmark for Auditing Demographic Sensitivity in Spinal Radiology VLM Reports
Preprint, 2026 ยท Under review ยท Cited by 0
TL;DRPaired counterfactual benchmark that audits whether nine frozen vision-language models change their spinal-radiology reports when apparent age and sex are edited while target pathology is preserved โ measurable recommendation drift in all nine, with management recommendations less stable than diagnostic-label overlap under the same demographic edit.