LLM Fine-tuning Experiment

Fine-tuning LLaMA 3 on agricultural domain data — what 10 failed runs taught me

Mar 14, 2026 · 12 min read · 4.2K views

When I first tried to fine-tune LLaMA 3 on our agricultural dataset, I was confident it would take a weekend. Six weeks and ten failed runs later, I finally had something worth deploying. Here's everything I wish someone had told me.

The core challenge: our data was a mix of English technical docs, Hindi agronomic reports, and Kannada field notes — plus domain-specific terminology that the base model had never seen. Getting it to handle all three without catastrophic forgetting took far more than just adjusting learning rate.

The Problem

Base LLaMA 3 8B is strong on general English. But ask it about body condition scoring methodology or kharif crop rotation in semi-arid Karnataka and it hallucinates confidently. We needed domain grounding, not just prompt engineering.

llama_agri_finetune.ipynb · cell [1]

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_4bit=True,
    device_map="auto"
)

# LoRA config — attempt 4
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])

Attempt 4 was the first time I used LoRA properly — previous runs tried full fine-tuning on a single A100, which was slow, expensive, and prone to overfitting on our small dataset (~12K examples).

cell [2]

trainer.train(resume_from_checkpoint=checkpoint_path)
plot_loss_curves(trainer.state.log_history)

OUTPUT [2]

Epoch 1/5 — train_loss: 2.41 · val_loss: 2.38
Epoch 2/5 — train_loss: 1.87 · val_loss: 1.82
Epoch 3/5 — train_loss: 1.21 · val_loss: 1.19 ← diverged in runs 1–3 here
Epoch 4/5 — train_loss: 0.89 · val_loss: 0.84
Epoch 5/5 — train_loss: 0.63 · val_loss: 0.61 ✓ best

[ loss curve chart — train vs val across 5 epochs ]

Fig 1. Training vs validation loss — run 10 (successful)

Key insight

The biggest improvement came from a custom tokenizer that included Kannada script tokens. Without it, the model was spending capacity on subword decomposition of agricultural terms rather than learning the domain.

Results & Metrics

Run

Approach

Val Loss

Perplexity

Run 1–3

Full fine-tune

diverged

—

Run 4–7

LoRA, base tokenizer

1.42

12.4

Run 10

LoRA + custom tokenizer

0.61 ✓

8.1 ✓

Takeaways

Tokenizer quality matters as much as model size. If your domain has unusual vocabulary — especially non-Latin scripts — invest in the tokenizer before tuning anything else.

LoRA at r=16 with a small, high-quality dataset beats full fine-tuning at every budget. The compute savings let you run more experiments, which is where the real learning happens.