Fine-tuning LLaMA 3 on agricultural domain data — what 10 failed runs taught me
When I first tried to fine-tune LLaMA 3 on our agricultural dataset, I was confident it would take a weekend. Six weeks and ten failed runs later, I finally had something worth deploying. Here's everything I wish someone had told me.
The core challenge: our data was a mix of English technical docs, Hindi agronomic reports, and Kannada field notes — plus domain-specific terminology that the base model had never seen. Getting it to handle all three without catastrophic forgetting took far more than just adjusting learning rate.
The Problem
Base LLaMA 3 8B is strong on general English. But ask it about body condition scoring methodology or kharif crop rotation in semi-arid Karnataka and it hallucinates confidently. We needed domain grounding, not just prompt engineering.
from peft import LoraConfig, get_peft_model
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
load_in_4bit=True,
device_map="auto"
)
# LoRA config — attempt 4
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
Attempt 4 was the first time I used LoRA properly — previous runs tried full fine-tuning on a single A100, which was slow, expensive, and prone to overfitting on our small dataset (~12K examples).
plot_loss_curves(trainer.state.log_history)
Epoch 2/5 — train_loss: 1.87 · val_loss: 1.82
Epoch 3/5 — train_loss: 1.21 · val_loss: 1.19 ← diverged in runs 1–3 here
Epoch 4/5 — train_loss: 0.89 · val_loss: 0.84
Epoch 5/5 — train_loss: 0.63 · val_loss: 0.61 ✓ best
Fig 1. Training vs validation loss — run 10 (successful)
The biggest improvement came from a custom tokenizer that included Kannada script tokens. Without it, the model was spending capacity on subword decomposition of agricultural terms rather than learning the domain.
Results & Metrics
Takeaways
Tokenizer quality matters as much as model size. If your domain has unusual vocabulary — especially non-Latin scripts — invest in the tokenizer before tuning anything else.
LoRA at r=16 with a small, high-quality dataset beats full fine-tuning at every budget. The compute savings let you run more experiments, which is where the real learning happens.