Fine-Tuning a Qwen model on AMD Radeon AI PRO R9700 (RDNA4) using LlamaFactory
This documents the end-to-end process of fine-tuning Qwen3 models (0.6B and 1.7B) using LlamaFactory running on AMD Radeon AI PRO R9700 GPUs (gfx1201/RDNA4). For the purpose of this tutorial I will tune the model to detect multi-predicate conditional statements in JavaScript that should be extracted into predicate functions.
The final production setup achieves ~80% detection rate with zero false positives on unseen codebases, served via llama.cpp with Vulkan inference.
Hardware
- AMD Radeon AI PRO R9700 (32GB VRAM, RDNA4/gfx1201) - only one GPU needed for this tutorial. Given the small model sizes (0.6B-1.7B with QLoRA), a GPU with less VRAM would also work.
- AMD Threadripper PRO 3945WX
- 64GB RAM
- Fedora 44 Server Edition
Software
- LlamaFactory - a unified framework for fine-tuning large language models. It supports LoRA, QLoRA, and full fine-tuning across 100+ model architectures, with built-in dataset management, a web UI, and export tools. Used in this tutorial for QLoRA training and merging the adapter back into the base model.
- llama.cpp - a lightweight C/C++ inference engine for running LLMs on consumer hardware. It supports CPU, Vulkan, CUDA, and Metal backends, with its own GGUF model format optimised for fast loading and quantisation. Used in this tutorial to serve the fine-tuned model with Vulkan on the R9700.
- tree-sitter - an incremental parsing library used by editors like VS Code and Neovim for syntax highlighting and code navigation. Used in this tutorial to parse JavaScript files into ASTs and mechanically extract multi-predicate conditionals for training data generation.
The RDNA4 Training Problem
AMD's ROCm training tools (AITER, Flash Attention, etc.) were originally built for CDNA GPUs (MI250/MI300) and RDNA4 support is still being added - often through community patches. At the time of writing, PyTorch training on R9700 has known performance and stability issues (ROCm #5674). Key issues encountered:
- rocBLAS Tensile GEMM kernel crashes: Memory access faults on specific matrix shapes during training. The crash is non-deterministic and depends on sequence length, batch composition, and data ordering.
- Standard LoRA training crashes: Full-precision LoRA hits broken GEMM kernels. Only QLoRA (4-bit quantisation) uses different code paths that avoid the crash.
The Fix: ROCm 6.3 + QLoRA + bitsandbytes from Source
Credit to @jd-lo who documented a working QLoRA training setup on R9700 in LlamaFactory issue #10511. Their findings - ROCm 6.3, bitsandbytes compiled from source for gfx1201, and QLoRA 4-bit - were the foundation that made this work.
The solution uses:
- ROCm 6.3 base image (not the vLLM image)
- PyTorch 2.9.1+rocm6.3
- bitsandbytes compiled from source for gfx1201
- QLoRA (4-bit quantisation) which hits different GEMM kernels that work on RDNA4
- Single GPU training (
CUDA_VISIBLE_DEVICES=0)
Docker Setup
Dockerfile.llamafactory:
FROM rocm/pytorch:rocm6.3_ubuntu22.04_py3.10_pytorch_release_2.3.0
ENV AMDGPU_TARGETS=gfx1201
ENV HSA_OVERRIDE_GFX_VERSION=12.0.1
ENV PYTORCH_ROCM_ARCH=gfx1201
# Upgrade PyTorch to 2.9.1+rocm6.3
RUN pip install --upgrade torch==2.9.1+rocm6.3 \
--index-url https://download.pytorch.org/whl/rocm6.3
# Build and install bitsandbytes from source for gfx1201
RUN cd /tmp && \
git clone https://github.com/bitsandbytes-foundation/bitsandbytes.git && \
cd bitsandbytes && \
cmake -DCOMPUTE_BACKEND=hip \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_HIP_ARCHITECTURES="gfx1201" \
-S . -B build && \
cmake --build build -j$(nproc) && \
rm -f /opt/conda/envs/py_3.10/compiler_compat/ld && \
pip install . && \
rm -rf /tmp/bitsandbytes
# Install LLaMA Factory
RUN pip install llamafactory
# Fix accelerate unhashable set bug
RUN MODELING_PY=$(python -c "import accelerate.utils.modeling; print(accelerate.utils.modeling.__file__)") && \
sed -i '/elif not isinstance(no_split_module_classes, (list, tuple)):/i\ elif isinstance(no_split_module_classes, set):\n no_split_module_classes = list(no_split_module_classes)' \
"$MODELING_PY" || true
# Remove old torchvision incompatible with torch 2.9
RUN pip uninstall -y torchvision
WORKDIR /app
Key detail: rm -f /opt/conda/envs/py_3.10/compiler_compat/ld removes conda's broken linker that prevents bitsandbytes from building via pip install .. The ROCm base image ships with conda, which is generally a problematic environment - this linker issue is one of many reasons to avoid it when possible.
docker-compose.yml:
services:
llamafactory:
build:
context: .
dockerfile: Dockerfile.llamafactory
container_name: llamafactory
entrypoint: ""
command: bash
devices:
- /dev/kfd
- /dev/dri
group_add:
- "video"
cap_add:
- SYS_PTRACE
security_opt:
- label=disable
ipc: host
ports:
- "7860:7860"
- "8000:8000"
environment:
- HF_TOKEN=${HF_TOKEN}
- BNB_BACKEND=rocm
volumes:
- ./data:/app/data:Z
- ./output:/app/output:Z
- ./huggingface_cache:/root/.cache/huggingface:Z
stdin_open: true
tty: true
.env:
HF_TOKEN=hf_xxxxxxxxxxxx
Build and Verify
docker compose build
docker compose up -d
docker compose exec llamafactory bash
# Verify GPUs are visible
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count())"
# Should print: True and 1 or more
Dataset Generation
Why Predicate Functions?
Multi-predicate conditional statements like if (a > 5 && b < 10 && c !== null) are a seemingly innocent construct, but when duplicated throughout a codebase they can lead to subtle bugs if not kept in sync. Extracting them into predicate functions makes conditions reusable - change the logic in one place and it gets updated everywhere.
This tutorial focuses on detecting these patterns, but the broader goal is to build a small, fast MoE model composed of multiple LoRA adapters - each trained on a different code style rule - for use in automated code reviews.
Tool: tree-sitter
Tree-sitter parses JavaScript files into an AST and extracts multi-predicate conditional statements mechanically. This provides 100% accurate ground truth for labelling training data.
Dependencies:
pip install tree-sitter tree-sitter-javascript
Dataset Generator Script (generate_dataset.py)
The generator:
- Walks directories for
.jsfiles - Parses each file with tree-sitter
- Extracts all conditionals (if, else if, while, do...while, for, ternary) with multiple predicates
- Generates full-file training examples (violation files + clean files)
- Generates short snippet examples for each individual violation
- Generates clean snippet examples from single-predicate conditionals
Key implementation details:
- Detects
else ifby checking if theif_statementnode's parent is anelse_clause - Counts logical operators (
&&,||) recursively in the condition subtree - Skips minified files (lines > 500 chars)
- Skips files > 32KB (won't fit in context)
- Skips node_modules in subdirectories
- No line numbers in output (model hallucinated them)
- Single dash separator in output format
- Uses "all" in the prompt: "Review this code for all predicate function violations:"
import sys
import os
import json
import tree_sitter_javascript as tsjs
from tree_sitter import Language, Parser, Query, QueryCursor
JS = Language(tsjs.language())
parser = Parser(JS)
def count_logical_ops(node):
count = 0
if node.type == "binary_expression":
op = node.child_by_field_name("operator")
if op and op.text.decode() in ("&&", "||"):
count += 1
for child in node.children:
count += count_logical_ops(child)
return count
CONDITION_QUERIES = [
("if", "(if_statement condition: (_) @cond) @stmt"),
("while", "(while_statement condition: (_) @cond) @stmt"),
("do...while", "(do_statement condition: (_) @cond) @stmt"),
("for", "(for_statement condition: (_) @cond) @stmt"),
("ternary", "(ternary_expression condition: (_) @cond) @stmt"),
]
def find_violations(source):
tree = parser.parse(source)
violations = []
for kind, query_str in CONDITION_QUERIES:
q = Query(JS, query_str)
cursor = QueryCursor(q)
for _, captures in cursor.matches(tree.root_node):
cond = captures["cond"][0]
stmt = captures["stmt"][0]
ops = count_logical_ops(cond)
if ops >= 1:
line = cond.start_point[0] + 1
actual_kind = kind
if kind == "if" and stmt.parent and stmt.parent.type == "else_clause":
actual_kind = "else if"
violations.append({
"line": line,
"code": cond.text.decode(),
"predicates": ops + 1,
"kind": actual_kind,
"start_row": stmt.start_point[0],
"end_row": stmt.end_point[0],
})
return violations
def find_clean_ifs(source):
"""Find single-predicate conditional statements for clean snippet examples."""
tree = parser.parse(source)
clean = []
for _, query_str in CONDITION_QUERIES:
q = Query(JS, query_str)
cursor = QueryCursor(q)
for _, captures in cursor.matches(tree.root_node):
cond = captures["cond"][0]
stmt = captures["stmt"][0]
ops = count_logical_ops(cond)
if ops == 0:
clean.append({
"start_row": stmt.start_point[0],
"end_row": stmt.end_point[0],
})
return clean
def extract_snippet(source_text, start_row, end_row, context=3):
"""Extract a few lines around a target range."""
lines = source_text.split("\n")
s = max(0, start_row - context)
e = min(len(lines), end_row + context + 1)
return "\n".join(lines[s:e]), s
def build_example(source_text, violations):
user_content = f"Review this code for all predicate function violations:\n```js\n{source_text}\n```"
if violations:
lines = []
for v in violations:
kind = v.get("kind", "if")
lines.append(f"{kind} ({v['code']}) - should be a predicate function.")
assistant_content = "\n".join(lines)
else:
assistant_content = "No violations found."
return {
"messages": [
{"role": "user", "content": user_content},
{"role": "assistant", "content": assistant_content}
]
}
def find_js_files(path):
js_files = []
for root, dirs, files in os.walk(path):
# skip hidden dirs and node_modules in subdirs
dirs[:] = [d for d in dirs if not d.startswith(".") and d != "node_modules"]
for f in files:
if f.endswith(".js"):
js_files.append(os.path.join(root, f))
return js_files
def main():
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <path> [path2] [path3] ...")
sys.exit(1)
js_files = []
for path in sys.argv[1:]:
js_files.extend(find_js_files(path))
dataset = []
violations_count = 0
clean_count = 0
skipped = 0
for filepath in js_files:
try:
with open(filepath, "rb") as f:
source = f.read()
# skip huge files that won't fit in context
if len(source) > 32000:
print(f"Skipping {filepath}: too large ({len(source)} bytes)", file=sys.stderr)
skipped += 1
continue
source_text = source.decode("utf-8", errors="replace")
# skip minified files
lines = source_text.split("\n")
if any(len(line) > 500 for line in lines[:10]):
skipped += 1
continue
violations = find_violations(source)
example = build_example(source_text, violations)
dataset.append(example)
if violations:
violations_count += 1
# Add individual snippet examples for each violation
for v in violations:
snippet, offset = extract_snippet(source_text, v["start_row"], v["end_row"])
snippet_violation = {
"line": v["line"] - offset,
"code": v["code"],
"predicates": v["predicates"],
"kind": v.get("kind", "if"),
}
dataset.append(build_example(snippet, [snippet_violation]))
violations_count += 1
else:
clean_count += 1
# Add clean snippet examples from single-predicate ifs
clean_ifs = find_clean_ifs(source)
for ci in clean_ifs[:3]: # max 3 clean snippets per file
snippet, _ = extract_snippet(source_text, ci["start_row"], ci["end_row"])
dataset.append(build_example(snippet, []))
clean_count += 1
except Exception as e:
print(f"Skipping {filepath}: {e}", file=sys.stderr)
skipped += 1
# write dataset
output_path = "dataset.jsonl"
with open(output_path, "w") as f:
for example in dataset:
f.write(json.dumps(example) + "\n")
print(f"Files scanned: {len(js_files)}")
print(f"Examples with violations: {violations_count}")
print(f"Clean examples: {clean_count}")
print(f"Skipped (too large / errors): {skipped}")
print(f"Dataset written to: {output_path}")
if __name__ == "__main__":
main()
Training Data Format
{"messages": [
{"role": "user", "content": "Review this code for all predicate function violations:\n```js\n<full file content>\n```"},
{"role": "assistant", "content": "if (a > 5 && b < 10) - should be a predicate function.\nelse if (x || y) - should be a predicate function."}
]}
{"messages": [
{"role": "user", "content": "Review this code for all predicate function violations:\n```js\n<clean file content>\n```"},
{"role": "assistant", "content": "No violations found."}
]}
Generating the Dataset
Clone open source JavaScript projects for training data:
git clone --depth 1 https://github.com/expressjs/express
git clone --depth 1 https://github.com/fastify/fastify
git clone --depth 1 https://github.com/koajs/koa
git clone --depth 1 https://github.com/socketio/socket.io
git clone --depth 1 https://github.com/webpack/webpack
git clone --depth 1 https://github.com/eslint/eslint
git clone --depth 1 https://github.com/chalk/chalk
git clone --depth 1 https://github.com/nodemailer/nodemailer
git clone --depth 1 https://github.com/sequelize/sequelize
git clone --depth 1 https://github.com/mongoose-io/mongoose
Scan all repos at once:
python generate_dataset.py ./express ./fastify ./koa ./socket.io ./webpack ./eslint ./chalk ./nodemailer ./sequelize ./mongoose
You can also include your own project directories alongside these.
Balancing the Dataset
An unbalanced dataset causes the model to favour whichever class dominates. Too many clean examples and it defaults to "No violations found" on everything. Too many violations and it hallucates problems where there are none. A 50/50 split between violation and clean examples gives the model equal exposure to both outcomes.
balance_dataset.py:
import random
import sys
input_file = sys.argv[1] if len(sys.argv) > 1 else "dataset.jsonl"
output_file = sys.argv[2] if len(sys.argv) > 2 else "dataset_balanced.jsonl"
with open(input_file) as f:
lines = f.readlines()
violations = [l for l in lines if "should be a predicate" in l]
clean = [l for l in lines if "No violations found" in l]
# Trim clean examples to match violation count
random.shuffle(clean)
clean = clean[:len(violations)]
# Shuffle so violations and clean examples are interleaved randomly.
# Without this, the model sees all violations first then all clean,
# and learns position-dependent patterns instead of the actual rule.
balanced = violations + clean
random.shuffle(balanced)
with open(output_file, "w") as f:
for l in balanced:
f.write(l)
print(f"Violations: {len(violations)}, Clean: {len(clean)}, Total: {len(balanced)}")
python balance_dataset.py dataset.jsonl dataset_balanced.jsonl
Final dataset: ~9,300 balanced examples from 10+ open source repos.
dataset_info.json
Place in data/ directory alongside the dataset:
{
"predicate_violations": {
"file_name": "dataset_balanced.jsonl",
"formatting": "sharegpt",
"columns": {
"messages": "messages"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
}
}
The tags section is required because LlamaFactory's default ShareGPT format expects from/value keys, but the dataset uses role/content.
Training Configuration
I initially aimed for the smallest model possible (Qwen3-0.6B) but after evaluation settled on Qwen3-1.7B. The 0.6B model learns the pattern and catches obvious violations but struggles with longer files - it tends to stop generating after the first violation and misses the rest. The 1.7B model is significantly more thorough, consistently listing multiple violations per file with fewer misses. I ended up choosing the 1.7B for production, but training both is recommended - the 0.6B trains in half the time and is useful for quick iteration when experimenting with dataset changes.
Training YAML (predicate_lora.yaml)
For Qwen3-0.6B:
model_name_or_path: Qwen/Qwen3-0.6B
stage: sft
do_train: true
finetuning_type: lora
quantization_bit: 4
lora_target: all
lora_rank: 16
lora_alpha: 16
dataset: predicate_violations
template: qwen3
cutoff_len: 4096
output_dir: /app/output/predicate_lora
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
num_train_epochs: 1
learning_rate: 0.0002
lr_scheduler_type: cosine
warmup_steps: 5
logging_steps: 10
save_steps: 200
save_total_limit: 2
bf16: true
plot_loss: true
report_to: none
For Qwen3-1.7B (predicate_lora_large.yaml):
Same as above but:
model_name_or_path: Qwen/Qwen3-1.7B
output_dir: /app/output/predicate_lora_large
learning_rate: 0.0001
Lower learning rate (0.0001 vs 0.0002) for the larger model - bigger models want lower LR.
Critical Settings
quantization_bit: 4- QLoRA. This is what makes training work on RDNA4. Standard LoRA crashes.cutoff_len: 4096- Must be large enough for whole-file examples. Setting this to 1024 caused the model to learn truncated responses and default to "No violations found."lora_target: all- Targets all linear modules (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). Better results than default q_proj/v_proj only.num_train_epochs: 1- Single epoch on 9,300 diverse examples. Multiple epochs on smaller datasets caused overfitting.learning_rate: 0.0002- Written as0.0002not2e-4because the older transformers version in the ROCm 6.3 image parses scientific notation as a string.
Training Command
# Single GPU, no distributed
FORCE_TORCHRUN=0 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train /app/data/predicate_lora.yaml
# Can train two models simultaneously on different GPUs
FORCE_TORCHRUN=0 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train /app/data/predicate_lora.yaml
FORCE_TORCHRUN=0 CUDA_VISIBLE_DEVICES=1 llamafactory-cli train /app/data/predicate_lora_large.yaml
Training Results
| Model | Examples | Epochs | Steps | Time | Final Loss |
|---|---|---|---|---|---|
| Qwen3-0.6B | 9,300 | 1 | 1,163 | ~37 min | 0.092 |
| Qwen3-1.7B | 9,300 | 1 | 1,163 | ~52 min | 0.105 |
Export and Conversion
Step 1: Merge LoRA Adapter into Base Model
This step is generic - it produces a standard HuggingFace model directory that can be used with any inference framework (vLLM, llama.cpp, TGI, etc.). Inside the llamafactory container:
llamafactory-cli export \
--model_name_or_path Qwen/Qwen3-1.7B \
--adapter_name_or_path /app/output/predicate_lora_large \
--template qwen3 \
--finetuning_type lora \
--export_dir /app/output/merged_predicate_1.7b
Step 2: Convert to GGUF (llama.cpp specific)
GGUF is llama.cpp's model format. The conversion requires Python dependencies that aren't in the llamafactory container. Add them to your llama.cpp Dockerfile:
FROM ubuntu:24.04
# ... your existing llama.cpp build steps ...
RUN apt-get update && apt-get install -y python3 python3-pip && \
rm -rf /var/lib/apt/lists/*
# GGUF conversion deps (CPU-only torch to keep it small)
RUN pip install --break-system-packages \
gguf transformers sentencepiece protobuf numpy
RUN pip install --break-system-packages \
torch --index-url https://download.pytorch.org/whl/cpu
Then inside the llama.cpp container, clone the conversion script and run it:
git clone --depth 1 https://github.com/ggml-org/llama.cpp /tmp/llama.cpp
python3 /tmp/llama.cpp/convert_hf_to_gguf.py \
/path/to/merged_predicate_1.7b \
--outfile predicate-1.7b-q8_0.gguf \
--outtype q8_0
Make sure the merged model directory from step 1 is accessible to the llama.cpp container via a shared volume.
Step 3: Serve with llama.cpp
llama-server -m predicate-1.7b-q8_0.gguf \
--host 0.0.0.0 --port 8080
Inference Configuration
Critical Sampling Parameters
{
"model": "Qwen3-1.7B",
"messages": [{"role": "user", "content": "Review this code for all predicate function violations:\n```js\n...\n```"}],
"temperature": 0.1,
"min_tokens": 500,
"max_tokens": 4096
}
temperature: 0.1- Low temperature for consistent, deterministic output. Higher temperatures cause hallucinations.min_tokens: 500- Critical. Without this, the model generates EOS after the first violation and stops listing the rest. This was the fix for the early-stopping problem. LlamaFactory's API does NOT support this parameter - llama.cpp does.max_tokens: 4096- Ensure long files get complete violation lists.
Production Scanner (scan.sh)
Bash script that walks a directory, sends each JS file to the model, and optionally shows tree-sitter comparison:
# Model only
./scan.sh ./my-project
# Model + tree-sitter comparison
./scan.sh ./my-project --ts
# Custom API endpoint
./scan.sh ./my-project --api http://localhost:8080
scan.sh:
#!/bin/bash
# Usage: ./scan.sh /path/to/project [--ts] [--api URL]
DIR=""
API="http://localhost:8080"
SHOW_TS=false
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
while [[ $# -gt 0 ]]; do
case "$1" in
--ts) SHOW_TS=true; shift ;;
--api) API="$2"; shift 2 ;;
*) DIR="$1"; shift ;;
esac
done
DIR="${DIR:-.}"
find "$DIR" -name '*.js' -not -path '*/node_modules/*' -not -path '*/.*' | sort | while read -r file; do
# skip minified files
if head -1 "$file" | wc -c | grep -q '[0-9]\{4,\}'; then
continue
fi
# skip files > 32KB
size=$(wc -c < "$file")
if [ "$size" -gt 32000 ]; then
continue
fi
# Model scan
model_result=$(jq -n --arg code "$(cat "$file")" '{
model: "Qwen3-1.7B",
messages: [{
role: "user",
content: ("Review this code for all predicate function violations:\n```js\n" + $code + "\n```")
}],
temperature: 0.1,
min_tokens: 500,
max_tokens: 4096
}' | curl -s "$API/v1/chat/completions" \
-H "Content-Type: application/json" \
-d @- | jq -r '.choices[0].message.content' | sed 's/<think>//g; s/<\/think>//g' | sed '/^$/d')
model_has_violations=false
if [ -n "$model_result" ] && ! echo "$model_result" | grep -q "No violations found"; then
model_has_violations=true
fi
# Tree-sitter scan (only if --ts)
ts_has_violations=false
ts_result=""
if $SHOW_TS; then
ts_result=$(python3 "$SCRIPT_DIR/ts_single.py" "$file" 2>/dev/null)
if [ -n "$ts_result" ] && ! echo "$ts_result" | grep -q "No violations found"; then
ts_has_violations=true
fi
fi
# Only print if either found something
if $model_has_violations || $ts_has_violations; then
echo ""
echo "=== $file ==="
echo "$model_result"
if $SHOW_TS; then
echo "[Tree-sitter]"
echo "$ts_result"
fi
fi
done
The --ts flag requires ts_single.py in the same directory, which uses tree-sitter to provide ground truth comparison. The sed commands strip Qwen3's <think> tags from the output.
Results
Detection Accuracy (1.7B model, unseen codebases)
- Overall: ~80% detection rate (94/118 violations caught)
- False positives: 0 (zero hallucinated violations at temp 0.1)
- Perfect on: standalone if statements, validation patterns, guard clauses
- Weak on: ternary expressions, repeated boilerplate near file tops, some duplicate patterns deep in long files
What the Model Adds Over Tree-sitter
For this specific rule (predicate function violations), tree-sitter achieves 100% accuracy mechanically. The model's value is:
- Natural language output format (human-readable reports)
- Foundation for rules tree-sitter CAN'T handle (naming conventions, comment quality, code readability)
- Proof of concept for the LoRA adapter architecture - add more adapters for more rules
Architecture for Multiple Rules
Each rule gets its own LoRA adapter trained on the same base model:
Base Qwen3-1.7B (frozen)
├── predicate_lora (predicate function violations)
├── naming_lora (camelCase + descriptive names)
├── comments_lora (comment quality)
└── ... more rules
Swap adapters at inference time. llama.cpp and vLLM both support runtime LoRA loading.
Lessons Learned
- QLoRA is required on RDNA4 - standard LoRA hits broken rocBLAS Tensile GEMM kernels. QLoRA uses different code paths that work.
- ROCm 6.3 is more stable than the vLLM image for training. The vLLM image crashed at step 40-50 consistently; ROCm 6.3 ran 1,163 steps clean.
cutoff_lenmust match your data - training on whole files but setting cutoff_len: 1024 truncated the assistant responses, teaching the model to default to "No violations found."- Dataset balance matters - too many clean examples (66%) made the model never flag violations. 50/50 is the sweet spot.
min_tokensis essential - the model learned to generate EOS after the first violation. Forcing minimum generation length fixed multi-violation detection.- 1 epoch on diverse data beats 3 epochs on small data - more codebases > more passes over the same code.
- Scientific notation breaks older transformers - write
0.0002not2e-4in the YAML. - Remove conda's broken linker -
rm -f /opt/conda/envs/py_3.10/compiler_compat/ldbeforepip install .for bitsandbytes.