dpo_base_model_stage2

This repository provides a DPO LoRA adapter fine-tuned directly from Qwen/Qwen3-4B-Instruct-2507 using QLoRA (4-bit, Unsloth).

Unlike the SFT+DPO pipeline, this adapter applies DPO directly to the base model without a prior SFT adapter.

This repository contains LoRA adapter weights only. The base model must be loaded separately.

Training Objective

This adapter applies Direct Preference Optimization (DPO) to improve structured output accuracy (JSON / YAML / XML / TOML / CSV) by aligning the model with human preferences.

Training Configuration

Base model: Qwen/Qwen3-4B-Instruct-2507
SFT adapter: None (DPO applied directly to base model)
Method: DPO with QLoRA (4-bit)
Max sequence length: 2048
DPO beta: 0.1
Learning rate: 1e-07
LoRA: r=8, alpha=16
Batch size: 2 (grad accum: 4, effective BS=8)
Epochs: 1

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = "Qwen/Qwen3-4B-Instruct-2507"
adapter = "DLNorb/dpo_base_model_stage2"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(
    base,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter)

Sources & Terms (IMPORTANT)

Training data:

https://huggingface.co/datasets/u-10bei/dpo-dataset-qwen-cot: MIT License

Compliance: Users must comply with each dataset's license (including copyright notice) and the base model's original terms of use.

Downloads last month: 1

Model tree for DLNorb/dpo_base_model_stage2

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5532)

this model

DLNorb
/

dpo_base_model_stage2