VetMedFineTune0.3
FSDP2 + LoRA · 4× NVIDIA H200 · 72B Parameters
Training Active
Training Progress
0%
Current Step
0 / 0
Epoch
0.00
Tokens Processed
0
Time Remaining
Calculating...
Training Loss
Measures prediction error. Should decrease steadily during training. Lower values indicate better model performance.
Loss
—
Target:
< 1.5
Token Accuracy
Percentage of correct next-token predictions. Higher values indicate the model is learning patterns effectively.
Token Accuracy
—
Target:
> 65%
Learning Rate
Controls weight update magnitude. Typically follows warmup → peak → decay schedule for optimal training.
Learning Rate
—
Typical:
1e-6 → 1e-4
Gradient Norm
Magnitude of weight updates. Should remain stable. High values (>2.0) may indicate training instability.
Gradient Norm
—
Target:
< 1.0
Loss Ranges
< 1.5 — Excellent
1.5 – 2.5 — Normal
> 2.5 — High
Token Accuracy
> 65% — Excellent
50–65% — Learning
< 50% — Low
Gradient Norm
< 1.0 — Stable
1.0–2.0 — Elevated
> 2.0 — Unstable
Learning Rate
Warmup → Peak → Decay
Fine-tune: 1e-6 to 5e-5
> 1e-4 may destabilize
GPU Cluster
4 GPUs Active
556 GB Total
GPU 0
H200 SXM
—°C
Utilization
—%
Memory
— / 139 GB
GPU 1
H200 SXM
—°C
Utilization
—%
Memory
— / 139 GB
GPU 2
H200 SXM
—°C
Utilization
—%
Memory
— / 139 GB
GPU 3
H200 SXM
—°C
Utilization
—%
Memory
— / 139 GB
Model Configuration
Qwen2.5-72B-Instruct
LoRA
Train Samples
41,506
Batch Size
64
8/GPU × 2 accum × 4 GPUs
Steps / Epoch
648
Trainable Params
1.68B
2.26% of 74.4B
Total Steps
1,944
3 epochs
Framework
FSDP2
Full sharding
Metrics Over Time
Loss
Accuracy
Grad Norm
LR
Training Output
Connecting to training server...