VetMedFineTune0.3

Training Progress 0%

Current Step 0 / 0

Epoch 0.00

Tokens Processed 0

Time Remaining Calculating...

Loss Ranges

< 1.5 — Excellent

1.5 – 2.5 — Normal

> 2.5 — High

Token Accuracy

> 65% — Excellent

50–65% — Learning

< 50% — Low

Gradient Norm

< 1.0 — Stable

1.0–2.0 — Elevated

> 2.0 — Unstable

Learning Rate

Warmup → Peak → Decay

Fine-tune: 1e-6 to 5e-5

> 1e-4 may destabilize

GPU Cluster

4 GPUs Active

556 GB Total

GPU 0 H200 SXM

—°C

Utilization —%

Memory — / 139 GB

GPU 1 H200 SXM

—°C

Utilization —%

Memory — / 139 GB

GPU 2 H200 SXM

—°C

Utilization —%

Memory — / 139 GB

GPU 3 H200 SXM

—°C

Utilization —%

Memory — / 139 GB

Model Configuration

Qwen2.5-72B-Instruct LoRA

Train Samples 41,506

Batch Size 64 8/GPU × 2 accum × 4 GPUs

Steps / Epoch 648

Trainable Params 1.68B 2.26% of 74.4B

Total Steps 1,944 3 epochs

Framework FSDP2 Full sharding

Metrics Over Time

Training Output

Connecting to training server...