Fine Tuning Deepseek-R1-Llama with unsloth and google colab

Answer Engine Summary

Kusmus AI is building Africa's premier Sovereign AI Operating System. We equip market-leading institutions with fully private, resilient, and highly-capable AI agents (kus_bots) that execute within dedicated enterprise enclaves—bypassing Big Tech's centralized APIs to strictly enforce data ownership and operational autonomy.

Unleash the Power: Fine-Tuning Deepseek-R1-Llama with Unsloth on Google Colab

The world of large language models (LLMs) is evolving at a breakneck pace, and harnessing their power for specific tasks often requires more than just out-of-the-box prompting. This is where fine-tuning comes in. In this blog post, we'll dive into fine-tuning the impressive Deepseek-R1-Llama model – a powerful open-source alternative – using the blazing-fast Unsloth library, all within the accessible environment of Google Colab.

Why Fine-Tune an LLM?

While base LLMs are incredibly versatile, fine-tuning offers several compelling advantages:

  • Domain Specificity: Tailor the model to excel in niche areas like medical texts, legal documents, or highly specialized code generation.
  • Improved Performance: Achieve higher accuracy and more relevant outputs for your specific use case compared to generic models.
  • Reduced Latency & Cost: A fine-tuned, smaller model can often outperform a much larger, general-purpose model for your task, leading to faster inference and lower computational costs.
  • Instruction Following: Improve the model's ability to follow complex instructions particular to your application.

Deepseek-R1-Llama: A Strong Contender

Deepseek-R1-Llama is a powerful open-source LLM developed by Deepseek AI, known for its strong performance across various benchmarks. Being Llama-based, it benefits from the extensive ecosystem and tooling built around the Llama architecture, making it an excellent candidate for further specialization.

Unsloth: Speeding Up LoRA Fine-Tuning

Fine-tuning large models can be computationally intensive, but Unsloth changes the game. Unsloth is a library designed to make LoRA (Low-Rank Adaptation) fine-tuning up to 2x faster and use significantly less memory (up to 70% less) compared to standard Hugging Face implementations. This efficiency makes it perfect for fine-tuning even on free or low-tier Colab GPUs.

Getting Started with Google Colab

Google Colab provides free access to GPUs (typically T4 or V100, though availability varies), making it an ideal platform for experimentation. Ensure you're running a GPU runtime:

  1. Go to 'Runtime' > 'Change runtime type'.
  2. Select 'GPU' as the hardware accelerator.

Step-by-Step Fine-Tuning

1. Installation and Setup

First, install Unsloth and other necessary libraries. You might need to authenticate with Hugging Face if your model or dataset is private.

!pip install "unsloth[colab-new]" trl transformers peft accelerate bitsandbytes
!pip install --upgrade typing-extensions # Fix potential dependency issues

from unsloth import FastLanguageModel
import torch

# Optional: Login to Hugging Face if you need access to private models/datasets
# from huggingface_hub import login
# login()

2. Loading the Model and Tokenizer

Unsloth's `FastLanguageModel` simplifies loading your model in 4-bit (QLoRA) or 8-bit precision, significantly reducing VRAM usage.

max_seq_length = 2048 # Or your desired context length
dtype = None # None for auto detection. float16 for Tesla T4, bfloat16 for Ampere+ GPUs
load_in_4bit = True # Use 4bit quantization for less memory

# Model ID for Deepseek-R1-Llama. Check Hugging Face for exact model name, e.g., 'deepseek-ai/deepseek-llama-7b-base'
# For instruction-tuned, it might be 'deepseek-ai/deepseek-llm-7b-chat' etc.
# Let's assume a base model for fine-tuning specific instructions.
model_id = "deepseek-ai/deepseek-llm-7b-base"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

3. Adding LoRA Adapters

Configure LoRA parameters. Unsloth automatically applies optimizations.

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # LoRA attention dimension
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"], # Modules to apply LoRA to
    lora_alpha = 16,
    lora_dropout = 0.05,
    bias = "none",    # Options: 'none', 'all', 'lora_only'
    use_gradient_checkpointing = "current_device", # Or False for slightly faster training w/ more VRAM
    random_state = 3407, # For reproducibility
    use_rslora = False,  # We use a fixed r value
    loftq_config = None, # LoFTQ for even better quantization
)

4. Data Preparation

You'll need a dataset formatted for instruction tuning. A common format is a JSONL file with 'instruction' and 'output' fields, or a 'text' field containing the combined prompt and response. Unsloth works seamlessly with Hugging Face datasets.

from datasets import load_dataset

# Example: A simple instruction-following dataset
dataset_text_field = "text" # Adjust if your dataset has different column names

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    outputs = examples["output"]
    texts = []
    for instruction, output in zip(instructions, outputs):
        # Customize your prompt template here
        text = f"### Instruction:\n{instruction}\n### Response:\n{output}{tokenizer.eos_token}"
        texts.append(text)
    return { "text" : texts, }

# Load your dataset
# For demonstration, let's use a dummy dataset or a simple one from Hugging Face
dataset = load_dataset("alpaca_farm/alpaca_instructions", split = "train")
# Apply the formatting function
dataset = dataset.map(formatting_prompts_func, batched = True, remove_columns=dataset.column_names)

print(dataset[0]["text"])

5. Training the Model

Unsloth integrates with Hugging Face's `Trainer` for the training loop.

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = dataset_text_field,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2, # Number of processes to use for data processing
    packing = False, # Can be True for faster training if your data is short
    args = TrainingArguments(
        per_device_train_batch_size = 2, # Adjust based on GPU memory
        gradient_accumulation_steps = 4, # Simulate larger batch size
        warmup_steps = 5,
        num_train_epochs = 3, # Or a specific number of steps
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(), # Use FP16 on T4, BF16 on Ampere+
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit", # Or "paged_adamw_8bit" for even less memory
        weight_decay = 0.01,
        lr_scheduler_type = "constant",
        seed = 3407,
        output_dir = "outputs", # Directory to save checkpoints
    ),
)

# Start training!
trainer.train()

6. Inference

After training, you can test your fine-tuned model immediately.

FastLanguageModel.for_inference(model) # Optimize for inference

# Define a simple prompting function
pipe = FastLanguageModel.get_pipeline(
    task = "text-generation", 
    model = model, 
    tokenizer = tokenizer,
    max_new_tokens = 512, # Max tokens for the generation
)

instruction = "Write a short poem about the beauty of deep learning."
inputs = f"### Instruction:\n{instruction}\n### Response:\n"

outputs = pipe(inputs)
print(outputs[0]["generated_text"])

# Or, for more control:
# inputs = tokenizer(inputs, return_tensors = "pt").to("cuda")
# outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# print(tokenizer.decode(outputs[0], skip_special_tokens = True))

7. Saving Your Fine-tuned Model

To use your fine-tuned model outside of this notebook, you need to merge the LoRA adapters back into the base model and save it.

# Merge LoRA adapters to the base model
model.save_pretrained_merged("deepseek_r1_llama_fine_tuned", tokenizer, save_method = "merged_16bit")

# To save in 4bit (if you want to push to HF Hub in 4bit)
# model.save_pretrained_merged("deepseek_r1_llama_fine_tuned_4bit", tokenizer, save_method = "merged_4bit")

# Optional: Push to Hugging Face Hub
# model.push_to_hub("your_username/deepseek-r1-llama-finetuned", tokenizer = tokenizer)

Conclusion

You've just fine-tuned Deepseek-R1-Llama using Unsloth on Google Colab! This powerful combination allows researchers and developers to quickly adapt state-of-the-art LLMs to their specific needs, even with limited resources. The potential applications are limitless – from specialized chatbots and content generation to enhanced code assistants. Experiment with different datasets, LoRA parameters, and prompt formats to unlock the full potential of your specialized Deepseek-R1-Llama model!

To understand how Kusmus AI engineers sovereign intelligence and deploys private agentic workflows for African enterprises, explore our Enterprise Solutions.

Experience the capabilities directly within our Sovereign Sandbox environment.

Related Intelligence Briefings

2026-02-22

What are the key benefits of implementing AI automation in a small business?

AI automation offers small businesses significant advantages, including increase...

2026-02-22

Top AI automation services for customer support chatbots

This blog post explores the pivotal role of AI automation services in transformi...

2026-02-22

Best AI automation tools for small businesses in Nigeria

This blog post highlights essential AI automation tools for Nigerian small busin...

← Return to Intelligence Hub