Evaluating Local LLMs for Structured Extraction from Endometriosis ...

Evaluating Local LLMs for Structured Extraction from Endometriosis Ultrasound Reports

THE STUDY Researchers evaluated locally-deployed large language models for converting unstructured endometriosis transvaginal ultrasound (eTVUS) reports into structured data. The study compared three LLMs of varying sizes (7B, 8B, and 20B parameters) against expert human extraction across 49 eTVUS reports from clinical practice.

KEY FINDINGS The 20B parameter model achieved mean accuracy of 86.02%, substantially outperforming the smaller 7B and 8B models (specific performance numbers not reported for comparison models). Interestingly, the study revealed complementary error profiles: LLMs excelled at syntactic consistency including date and numeric formatting where humans made errors, while human experts provided superior semantic and contextual interpretation of clinical findings.

The researchers found that LLM semantic errors represented fundamental limitations that could not be resolved through prompt engineering alone. This suggests inherent boundaries in current model capabilities for complex medical text interpretation.

METHODOLOGY NOTES This was a comparative study using 49 clinical eTVUS reports with expert human extraction serving as the reference standard. Strengths include the use of real clinical data and systematic comparison across multiple model sizes, clearly demonstrating the importance of parameter scale in clinical text processing.

Limitations include the relatively small sample size (n=49) and focus on a single imaging modality and anatomical region. The study was conducted on a single institution’s reports, which may limit generalizability to different clinical documentation styles or terminology variations across healthcare systems.

CLINICAL RELEVANCE The findings strongly support a human-in-the-loop workflow rather than full automation. The complementary error patterns suggest LLMs could serve as collaborative tools, handling routine structural formatting while flagging potential inconsistencies for human review. This approach would allow imaging specialists to focus on high-level semantic validation rather than manual data entry.

For practices considering local LLM deployment, the 20B model requirement may have infrastructure implications. The study demonstrates that smaller models may not provide adequate accuracy for clinical text extraction tasks.

https://arxiv.org/abs/2601.09053v1

ALSO TODAY

Multi-modal framework combining Vision Transformer, CNN, and Graph Neural Network achieved 97.8% accuracy in diabetic retinopathy diagnosis using retinal images and temporal biomarkers across five validation datasets. https://arxiv.org/abs/2601.08240v1

Diffusion-based PathoGen model enables controllable lesion synthesis in histopathology images, outperforming GAN and Stable Diffusion baselines while enhancing downstream segmentation performance in data-scarce scenarios. https://arxiv.org/abs/2601.08127v1

Multi-modal dataset SOPHIAS captures 50 oral presentations from 65 students using eight synchronized sensor streams including eye-tracking, physiological monitoring, and rubric-based evaluations for educational AI development. https://arxiv.org/abs/2601.07576v1

The AI Dentist