Manuscript in Preparation

BreastGPT A Multimodal Large Language Model
for the Full Spectrum of Breast Cancer Clinical Routine

One backbone, three stage-specialist agents — Screening, Diagnosis, and Treatment-Planning — orchestrated end-to-end along the breast cancer care continuum.

Screening Diagnosis Treatment Planning

Anonymous Author(s)

Targeting the 40^th Conference on Neural Information Processing Systems (NeurIPS 2026); not yet accepted.

Paper Code Dataset Model

The BreastStage corpus aligned with the end-to-end breast cancer clinical workflow: screening → diagnosis → treatment planning across five imaging modalities.

1.86M

Instruction-Following Pairs

Sub-Datasets Curated

Imaging Modalities

136

Task Templates

89.92%
Open-Ended Score

01 · The Agent

A clinical agent that thinks in stages.

Clinical Trajectory ⚙ Orchestrator Step 0 · intake Step 1 · invoke A1 Step 2 · invoke A2 Step 3 · invoke A3 ✓ done

Patient

new breast complaint

MRI

→

A1 · 筛

Screening Agent

→ BI-RADS · risk score

risk?

high → diagnose

low → follow-up ↘

A2 · 诊

Diagnosis Agent

→ lesion characterization

MRI

malignant?

yes → treat

benign → monitor ↘

A3 · 治

Treatment Agent

→ subtype · therapy plan

MRI

↘ Follow-up · alternative path (orchestrator did not select)

intake5 modalities + history BI-RADS 5A1 · risk = high IDC suspectedA2 · lesion report HER2+ NAC planA3 · care plan

The Orchestrator reads the evolving patient trajectory τ, decides which stage-agent to invoke next based on the current evidence, and aggregates each agent's output back into τ. A single 8-second loop traces one end-to-end screening → diagnosis → treatment handoff.

BreastGPT treats breast oncology as a patient-level trajectory rather than a single image task. A shared multimodal backbone is steered by an Orchestrator that decides — based on the current evidence state — which stage-agent (Screening / Diagnosis / Treatment) acts next, and aggregates every agent's output into one evolving clinical record.

One clinical trajectory

An Orchestrator decides which stage-agent should act next, aggregates its output, and preserves the accumulated patient context.

One workflow corpus

BreastStage aligns 1.86M instruction pairs with the real screening-to-treatment pathway.

One deployable model

A shared backbone handles standard radiology and gigapixel pathology without stage-specific models.

02 · Agent System

Three agents, one workflow.

筛 → 诊 → 治

A single Orchestrator builds the patient-level trajectory: at each step it inspects the current evidence state, selects the stage-agent whose clinical role matches, and aggregates the returned output back into the shared record.

🩺

Screening Agent

筛 · Screening

Role: Early triage from population-screening evidence
Writes to τ: Recall decision, BI-RADS risk, suspicious findings

→

🔬

Diagnosis Agent

诊 · Diagnosis

Role: Characterize lesions and resolve diagnostic uncertainty
Writes to τ: Radiology report, malignancy rationale, next-step recommendation

→

💊

Treatment-Planning Agent

治 · Treatment

Role: Translate pathology and staging evidence into care intent
Writes to τ: Subtype evidence, biomarker interpretation, treatment plan

⚙️

Workflow Orchestrator

All three agents share one Qwen3-VL-8B backbone and one weights checkpoint. The orchestrator changes the stage-conditioned persona and output schema, then appends the result back into the same patient trajectory.

03 · Training Substrate

Teaching the agent the full clinical workflow.

BreastStage is a workflow-aligned corpus with ≈662K images, 606K boxes/masks, and 1.86M instruction-following pairs across screening, diagnosis, and treatment planning.

**Figure 2 · BreastStage construction pipeline.** From multimodal clinical data to expert-verified instruction pairs.

BreastStage stage, modality, and task-category distribution — **Figure 6 · Stage-modality-task distribution.** The corpus preserves the screening-diagnosis-treatment structure.

04 · Agent Internals

Inside the agent: perception, routing, compression.

BreastGPT combines a shared multimodal backbone with stage prompts, modality-aware perception, and compact visual memory.

BreastGPT architecture and performance — **Figure 3 · BreastGPT architecture.** A unified agent stack for standard radiology, pathology, and stage-conditioned reasoning.

Stage-Aware Role Prompting

The Orchestrator switches the agent's clinical role across screening, diagnosis, and treatment planning while keeping one model checkpoint.

Modality-Aware Visual Router

Standard radiology and gigapixel pathology are routed through different visual branches before entering the same language model.

Concept-Based Token Selector

Large visual inputs are compressed to k = 128 clinically relevant tokens, making WSI reasoning practical at inference time.

Workflow Training

A single training recipe aligns visual features first, then teaches the full workflow across modalities and clinical stages.

05 · Evaluation

The agent wins at every stage of the workflow.

On BreastStage-Bench, BreastGPT outperforms proprietary frontier agents, open-source VLMs, and medical-specific VLMs across the full clinical workflow.

Closed-ended VQA (accuracy)

BreastGPT (cluster)

75.66

GPT-5.4

54.00

InternVL3.5

53.64

Gemini-3.1-Pro

51.32

Qwen3-VL-8B

47.93

Open-ended VQA (normalized score)

BreastGPT (cluster)

89.92

BreastGPT (learn)

85.95

GPT-5.4

53.58

InternVL3.5

53.64

Qwen3-VL-8B

44.89

Table 1 · VQA performance on BreastStage-Bench

Model	#P	Screening			Diagnosis			Treatment		Closed Avg	Open Avg
Model	#P	BUS	CT	Mam	BUS	Mam	MRI	MRI	His	Closed Avg	Open Avg
Proprietary Models
GPT-5.4	—	64.89	78.55	68.51	53.46	53.50	38.10	32.28	—	54.00	53.58
Claude-opus-4-6	—	50.21	72.00	39.57	38.83	7.66	45.10	41.27	25.94	41.23	42.97
Gemini-3.1-Pro	—	68.09	73.33	50.21	47.14	23.87	43.88	44.44	46.53	51.32	46.16
Open-Source Models
Qwen2.5-VL	7B	49.15	79.27	44.47	44.76	34.31	14.41	46.85	37.30	44.24	46.55
Qwen3-VL	8B	57.87	78.55	39.68	51.43	48.14	25.83	47.90	34.92	47.93	44.89
InternVL3.5	8B	51.91	77.70	35.85	34.76	16.22	52.27	44.44	52.98	45.41	53.64
Medical-Specific Models
Lingshu	7B	58.94	78.55	39.89	54.29	58.24	8.56	45.28	51.52	50.44	50.26
HuatuoGPT-V	7B	45.74	71.39	43.09	39.05	35.11	9.61	47.73	15.24	45.04	51.71
BreastGPT (cluster)	8B	86.81	77.21	75.00	82.86	77.13	68.32	61.11	71.38	75.66	89.92
BreastGPT (learn)	8B	84.47	71.03	68.51	75.71	78.46	55.26	73.95	71.25	70.64	85.95

BUS = Breast Ultrasound · CT = Computed Tomography · Mam = Mammography · His = Histopathology. Best results in bold. BreastGPT rows highlighted.

Ablations · Token budget & inference cost

**Visual token budget sweep.** k = 128 gives a compact operating point across task families.

GPU latency breakdown — **Inference latency on WSIs.** Concept-based selection delivers **33× faster** inference than direct patch-token feeding.

06 · Case Studies

Representative clinical reasoning cases.

Representative outputs across screening, diagnosis, and treatment planning.

Breast ultrasound grounding-caption sample

BUS · Grounded Caption

BI-RADS 4A ultrasound finding

Grounds the suspicious region and summarizes the key imaging finding.

MRI · Structured Report

BI-RADS 6 MRI report

Produces a structured report from multiparametric MRI evidence.

Histo · Pathology Report

Infiltrating ductal carcinoma

Summarizes pathology evidence for downstream treatment planning.

07 · Citation

BibTeX

@misc{breastgpt2026,
  title     = {BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine},
  author    = {Anonymous Author(s)},
  year      = {2026},
  note      = {Manuscript in preparation}
}