Manuscript in Preparation

BreastGPT A Multimodal Large Language Model
for the Full Spectrum of Breast Cancer Clinical Routine

One backbone, three stage-specialist agents — Screening, Diagnosis, and Treatment-Planning — orchestrated end-to-end along the breast cancer care continuum.

Screening Diagnosis Treatment Planning
Anonymous Author(s)
Targeting the 40th Conference on Neural Information Processing Systems (NeurIPS 2026); not yet accepted.
BreastStage overview
The BreastStage corpus aligned with the end-to-end breast cancer clinical workflow: screening → diagnosis → treatment planning across five imaging modalities.
1.86M
Instruction-Following Pairs
17
Sub-Datasets Curated
5
Imaging Modalities
136
Task Templates
89.92%
Open-Ended Score

A clinical agent that thinks in stages.

Clinical Trajectory ⚙ Orchestrator Step 0 · intake Step 1 · invoke A1 Step 2 · invoke A2 Step 3 · invoke A3 ✓ done
Patient
Patient
new breast complaint
BUS
BUS
Mammo
Mammo
CT
MRI
Histo
WSI
Screening Agent
A1 · 筛
Screening Agent
→ BI-RADS · risk score
BUS
BUS
Mammo
Mammo
CT
risk?
high → diagnose
low → follow-up ↘
Diagnosis Agent
A2 · 诊
Diagnosis Agent
→ lesion characterization
BUS
BUS
Mammo
Mammo
MRI
malignant?
yes → treat
benign → monitor ↘
Treatment Agent
A3 · 治
Treatment Agent
→ subtype · therapy plan
MRI
WSI
WSI
↘ Follow-up · alternative path (orchestrator did not select)
τ
intake5 modalities + history BI-RADS 5A1 · risk = high IDC suspectedA2 · lesion report HER2+ NAC planA3 · care plan
The Orchestrator reads the evolving patient trajectory τ, decides which stage-agent to invoke next based on the current evidence, and aggregates each agent's output back into τ. A single 8-second loop traces one end-to-end screening → diagnosis → treatment handoff.

BreastGPT treats breast oncology as a patient-level trajectory rather than a single image task. A shared multimodal backbone is steered by an Orchestrator that decides — based on the current evidence state — which stage-agent (Screening / Diagnosis / Treatment) acts next, and aggregates every agent's output into one evolving clinical record.

1

One clinical trajectory

An Orchestrator decides which stage-agent should act next, aggregates its output, and preserves the accumulated patient context.

2

One workflow corpus

BreastStage aligns 1.86M instruction pairs with the real screening-to-treatment pathway.

3

One deployable model

A shared backbone handles standard radiology and gigapixel pathology without stage-specific models.

Three agents, one workflow.

筛 → 诊 → 治

A single Orchestrator builds the patient-level trajectory: at each step it inspects the current evidence state, selects the stage-agent whose clinical role matches, and aggregates the returned output back into the shared record.

🩺
Screening Agent
筛 · Screening
  • Role: Early triage from population-screening evidence
  • Writes to τ: Recall decision, BI-RADS risk, suspicious findings
🔬
Diagnosis Agent
诊 · Diagnosis
  • Role: Characterize lesions and resolve diagnostic uncertainty
  • Writes to τ: Radiology report, malignancy rationale, next-step recommendation
💊
Treatment-Planning Agent
治 · Treatment
  • Role: Translate pathology and staging evidence into care intent
  • Writes to τ: Subtype evidence, biomarker interpretation, treatment plan
⚙️

Workflow Orchestrator

All three agents share one Qwen3-VL-8B backbone and one weights checkpoint. The orchestrator changes the stage-conditioned persona and output schema, then appends the result back into the same patient trajectory.

Teaching the agent the full clinical workflow.

BreastStage is a workflow-aligned corpus with ≈662K images, 606K boxes/masks, and 1.86M instruction-following pairs across screening, diagnosis, and treatment planning.

BreastStage construction pipeline
Figure 2 · BreastStage construction pipeline. From multimodal clinical data to expert-verified instruction pairs.
BreastStage stage, modality, and task-category distribution
Figure 6 · Stage-modality-task distribution. The corpus preserves the screening-diagnosis-treatment structure.

Inside the agent: perception, routing, compression.

BreastGPT combines a shared multimodal backbone with stage prompts, modality-aware perception, and compact visual memory.

BreastGPT architecture and performance
Figure 3 · BreastGPT architecture. A unified agent stack for standard radiology, pathology, and stage-conditioned reasoning.

Stage-Aware Role Prompting

The Orchestrator switches the agent's clinical role across screening, diagnosis, and treatment planning while keeping one model checkpoint.

Modality-Aware Visual Router

Standard radiology and gigapixel pathology are routed through different visual branches before entering the same language model.

Concept-Based Token Selector

Large visual inputs are compressed to k = 128 clinically relevant tokens, making WSI reasoning practical at inference time.

Workflow Training

A single training recipe aligns visual features first, then teaches the full workflow across modalities and clinical stages.

The agent wins at every stage of the workflow.

On BreastStage-Bench, BreastGPT outperforms proprietary frontier agents, open-source VLMs, and medical-specific VLMs across the full clinical workflow.

Closed-ended VQA (accuracy)
BreastGPT (cluster)
75.66
GPT-5.4
54.00
InternVL3.5
53.64
Gemini-3.1-Pro
51.32
Qwen3-VL-8B
47.93
Open-ended VQA (normalized score)
BreastGPT (cluster)
89.92
BreastGPT (learn)
85.95
GPT-5.4
53.58
InternVL3.5
53.64
Qwen3-VL-8B
44.89

Table 1 · VQA performance on BreastStage-Bench

Model #P Screening Diagnosis Treatment Closed Avg Open Avg
BUSCTMam BUSMamMRI MRIHis
Proprietary Models
GPT-5.464.8978.5568.5153.4653.5038.1032.2854.0053.58
Claude-opus-4-650.2172.0039.5738.837.6645.1041.2725.9441.2342.97
Gemini-3.1-Pro68.0973.3350.2147.1423.8743.8844.4446.5351.3246.16
Open-Source Models
Qwen2.5-VL7B49.1579.2744.4744.7634.3114.4146.8537.3044.2446.55
Qwen3-VL8B57.8778.5539.6851.4348.1425.8347.9034.9247.9344.89
InternVL3.58B51.9177.7035.8534.7616.2252.2744.4452.9845.4153.64
Medical-Specific Models
Lingshu7B58.9478.5539.8954.2958.248.5645.2851.5250.4450.26
HuatuoGPT-V7B45.7471.3943.0939.0535.119.6147.7315.2445.0451.71
BreastGPT (cluster)8B86.8177.2175.0082.8677.1368.3261.1171.3875.6689.92
BreastGPT (learn)8B84.4771.0368.5175.7178.4655.2673.9571.2570.6485.95

BUS = Breast Ultrasound · CT = Computed Tomography · Mam = Mammography · His = Histopathology. Best results in bold. BreastGPT rows highlighted.

Ablations · Token budget & inference cost

Visual token budget sweep
Visual token budget sweep. k = 128 gives a compact operating point across task families.
GPU latency breakdown
Inference latency on WSIs. Concept-based selection delivers 33× faster inference than direct patch-token feeding.

BibTeX

@misc{breastgpt2026,
  title     = {BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine},
  author    = {Anonymous Author(s)},
  year      = {2026},
  note      = {Manuscript in preparation}
}