VetMedFineTune0.3

FSDP2 + LoRA · 4× NVIDIA H200 · 72B Parameters

Training Active
Training Progress 0%
Current Step 0 / 0
Epoch 0.00
Tokens Processed 0
Time Remaining Calculating...
Training Loss
Measures prediction error. Should decrease steadily during training. Lower values indicate better model performance.
Loss
Target: < 1.5
Token Accuracy
Percentage of correct next-token predictions. Higher values indicate the model is learning patterns effectively.
Token Accuracy
Target: > 65%
Learning Rate
Controls weight update magnitude. Typically follows warmup → peak → decay schedule for optimal training.
Learning Rate
Typical: 1e-6 → 1e-4
Gradient Norm
Magnitude of weight updates. Should remain stable. High values (>2.0) may indicate training instability.
Gradient Norm
Target: < 1.0
Loss Ranges
< 1.5 — Excellent
1.5 – 2.5 — Normal
> 2.5 — High
Token Accuracy
> 65% — Excellent
50–65% — Learning
< 50% — Low
Gradient Norm
< 1.0 — Stable
1.0–2.0 — Elevated
> 2.0 — Unstable
Learning Rate
Warmup → Peak → Decay
Fine-tune: 1e-6 to 5e-5
> 1e-4 may destabilize

GPU Cluster

4 GPUs Active
556 GB Total
GPU 0 H200 SXM
—°C
Utilization —%
Memory — / 139 GB
GPU 1 H200 SXM
—°C
Utilization —%
Memory — / 139 GB
GPU 2 H200 SXM
—°C
Utilization —%
Memory — / 139 GB
GPU 3 H200 SXM
—°C
Utilization —%
Memory — / 139 GB

Model Configuration

Qwen2.5-72B-Instruct LoRA
Train Samples 41,506
Batch Size 64 8/GPU × 2 accum × 4 GPUs
Steps / Epoch 648
Trainable Params 1.68B 2.26% of 74.4B
Total Steps 1,944 3 epochs
Framework FSDP2 Full sharding
Metrics Over Time

Training Output

Connecting to training server...