Fine-tuning LLaMA 3 on agricultural domain data — what 10 failed runs taught me

When I first tried to fine-tune LLaMA 3 on our agricultural dataset, I was confident it would take a weekend. Six weeks and ten failed runs later, I finally had something worth deploying. Here's everything I wish someone had told me.

The core challenge: our data was a mix of English technical docs, Hindi agronomic reports, and Kannada field notes — plus domain-specific terminology that the base model had never seen. Getting it to handle all three without catastrophic forgetting took far more than just adjusting learning rate.

The Problem

Base LLaMA 3 8B is strong on general English. But ask it about body condition scoring methodology or kharif crop rotation in semi-arid Karnataka and it hallucinates confidently. We needed domain grounding, not just prompt engineering.

llama_agri_finetune.ipynb · cell [1]
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_4bit=True,
    device_map="auto"
)

# LoRA config — attempt 4
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])

Attempt 4 was the first time I used LoRA properly — previous runs tried full fine-tuning on a single A100, which was slow, expensive, and prone to overfitting on our small dataset (~12K examples).

cell [2]
trainer.train(resume_from_checkpoint=checkpoint_path)
plot_loss_curves(trainer.state.log_history)
OUTPUT [2]
Epoch 1/5 — train_loss: 2.41 · val_loss: 2.38
Epoch 2/5 — train_loss: 1.87 · val_loss: 1.82
Epoch 3/5 — train_loss: 1.21 · val_loss: 1.19 ← diverged in runs 1–3 here
Epoch 4/5 — train_loss: 0.89 · val_loss: 0.84
Epoch 5/5 — train_loss: 0.63 · val_loss: 0.61 ✓ best
[ loss curve chart — train vs val across 5 epochs ]

Fig 1. Training vs validation loss — run 10 (successful)

Key insight

The biggest improvement came from a custom tokenizer that included Kannada script tokens. Without it, the model was spending capacity on subword decomposition of agricultural terms rather than learning the domain.

Results & Metrics

Run
Approach
Val Loss
Perplexity
Run 1–3
Full fine-tune
diverged
Run 4–7
LoRA, base tokenizer
1.42
12.4
Run 10
LoRA + custom tokenizer
0.61 ✓
8.1 ✓

Takeaways

Tokenizer quality matters as much as model size. If your domain has unusual vocabulary — especially non-Latin scripts — invest in the tokenizer before tuning anything else.

LoRA at r=16 with a small, high-quality dataset beats full fine-tuning at every budget. The compute savings let you run more experiments, which is where the real learning happens.

← When YOLO fails in production Building a RAG pipeline that doesn't hallucinate →