A Local AI Coding Assistant. How Hard Could It Be? (Pretty Hard, Actually.)

A Local AI Coding Assistant. How Hard Could It Be? (Pretty Hard, Actually.)

March 2026 | AI Development & Local LLMs

My codebase has never touched a cloud API. That’s a choice, not an accident — and it’s why I spent two afternoons and most of a weekend fighting Ollama, Docker, llama.cpp, and a GitHub issue thread before I got a local AI coding assistant that actually worked.

Here’s what went wrong, what finally fixed it, and what I still haven’t solved.


Why Local at All

Two reasons, in order of importance.

The first is data control. Every prompt you send to a cloud API is a potential leak vector. The model sees your variable names, your architecture patterns, your bugs. Most developers accept this tradeoff implicitly. I stopped accepting it. Services get hacked; my desktop doesn’t. That’s a humble brag, and I mean it.

The second is latency. Once a local model is loaded into memory, generation starts in milliseconds. No network round-trip, no rate limits, no pricing surprises at the end of the month.

I’ve wanted this since 2004 — when I took a year of AI in college, data mining included, back when neural networks were still mostly a theoretical exercise. Twenty years later I can almost achieve what I conceptualized back then. Almost is doing real work in that sentence. More on that at the end.

The tradeoff is setup complexity. That’s what this post actually addresses.


The Model: Jackrong/Qwen3.5-9B

This model is a reasoning-focused distillation of Claude Opus 4.6, compressed to 9B parameters and quantized at Q8_0. The Q8_0 quantization preserves more model capacity than Q4 variants at the cost of roughly 2x disk and RAM usage — worth it for a coding workload where precision matters.

Feature Details
Base model Distilled from Claude Opus 4.6
Specialization Reasoning + code generation
Parameter count 9B
Quantization Q8_0

I started with the official Qwen 3.5 model from the Ollama library. It ran on my GPU. It did not run tools. It returned tool call blocks as plain text and ignored them — and that was the beginning of two very frustrating afternoons.


What Actually Went Wrong First

I Googled everything I could think of. I switched to llama.cpp. I got the models running in Docker without Ollama. Nothing worked. The tool calls kept coming back as text. At no point did the Ollama documentation, the model card, or any forum post I could find explain why.

The answer was buried in a GitHub issue comment. I clicked away from it the first time. Came back. Read it again. Came back a third time before I could make sense of it.

The Modelfile was wrong. The default Modelfile Ollama generates when you pull a model doesn’t include a TEMPLATE block. Without that block, the model has no idea what a tool call is supposed to look like. It just returns the raw syntax as text and moves on.

That one omission — undocumented, unmentioned by Ollama, absent from every guide I found — was the entire problem. The fix is in Step 3 below.


The Hugging Face Problem

While I was deep in the Docker detour, I ran into a second issue: models pulled directly from Hugging Face through Ollama break in two distinct ways.

The first model I tried threw a 404 error downloading the third file in its set — partway through a multi-gigabyte download, no warning. The second model downloaded cleanly and then silently failed to run tools.

The fix for both: use ollama create to rebuild the model from the downloaded weights file. When a Hugging Face pull fails mid-download, Ollama has usually already grabbed the largest blob — the actual weights. You can build a working model from that file directly. This is not documented anywhere I could find. I worked it out myself.


What You Need Before Starting

  • Ollama installed (ollama.com)
  • 10–15 GB of free disk space for the quantized weights
  • A terminal
  • Patience for a slow initial download

Step 1: Pull the Model

ollama pull hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q8_0

This downloads the model weights and generates a temporary Modelfile with default settings. Expect around 20–25 GB depending on compression. If it fails partway through, go to Step 2 anyway.


Step 2: Find the Model Blob

Ollama stores weights as content-addressed blobs in /usr/share/ollama/.ollama/models/blobs/. You need the path to the largest blob — that’s the weights file.

ollama show --modelfile hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q8_0 \
  | awk '/^FROM/ {print $2}' \
  | xargs -r du -h

Output looks like:

11.0G    /usr/share/ollama/.ollama/models/blobs/sha256-01ab75e862bf...

Copy that path. If the file size shown is under 1 GB, you have a metadata blob — not the weights. The weights file is always the largest one.


Step 3: Configure for Coding Work — Including the Tool Template

This is the step nobody documents. Three parameters matter for coding; the TEMPLATE block matters for everything.

Context window (num_ctx 32768): 32k tokens lets you paste entire files into a single prompt. This is the most important setting for coding tasks.

Temperature (0.6): Lower temperature means more deterministic output. For code generation, you want the model to commit rather than explore.

Repeat penalty (1.0): At 1.0, no penalty for repeating tokens. This sounds counterintuitive — but coding models at higher repeat penalties will avoid reusing variable names and function signatures from earlier in the context. That’s exactly the wrong behavior when you want consistent naming.

FROM /usr/share/ollama/.ollama/models/blobs/sha256-01ab75e862bf...

PARAMETER num_ctx 32768
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER presence_penalty 0.0
PARAMETER repeat_penalty 1.0

Replace the FROM path with the blob path from Step 2.

These parameters are tuned for coding work. If you want to use this model more like a general-purpose LLM — longer, more exploratory responses — the original parameter documentation from Hugging Face has the full range of options.

Now add the tool-calling template. Without this block, tools won’t work — the model will return tool call syntax as plain text. This is the fix for everything that was broken:

TEMPLATE """{{- if .Suffix }}<|fim_prefix|>{{ .Prompt }}<|fim_suffix|>{{ .Suffix }}<|fim_middle|>
{{- else -}}
{{- $lastUserIdx := -1 -}}
{{- range $idx, $msg := .Messages -}}
{{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
{{- end }}
{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end -}}
<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{- if .Content }}
{{ .Content }}
{{- else if .ToolCalls }}
{{- range .ToolCalls }}
<tool_call>
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
</tool_call>
{{- end }}
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- end }}
"""

Step 4: Create the Model

ollama create Qwen3.5-Coder-Distilled

Verify It Works

ollama list | grep qwen3.5
# qwen3.5-coder-distilled:latest      06113f46d78a    9.5 GB

ollama run Qwen3.5-Coder-Distilled "Hello"

What to Expect on Real Hardware

I run this on a 12 GB Nvidia RTX 4070. The model occupies most of that VRAM; there isn’t much headroom left.

Metric What I Saw
Initial load time 2–4 minutes
Resident memory 9–11 GB while active
Token generation (GPU) Usable; faster than CPU

GPU acceleration makes a significant difference on generation speed. The load time is largely fixed regardless of hardware.


What It Can’t Do Yet

I tested this model with three coding agents — Claude Code, Opencode, and Qwen Code — against the same task: Create and compile a simple “Hello world!” program in C++. Compile it and run it to verify your work.

Claude Code one-shotted it consistently. Opencode got there, but needed more than one run. Qwen Code generated code that looked correct and then failed to compile; it stalled trying to invoke g++ and just stopped — no error to chase.

That’s not a model indictment. It’s a system indictment. The 32k context window starts to become a constraint as a codebase grows. The 9B Q8_0 model is good, but it’s working at the edge of available VRAM. A smaller model with a larger context window might actually outperform it on real-world coding tasks — I haven’t tested that yet.

The hardware ceiling is visible. More VRAM would help. A server with two or four GPUs would help more. Whether to wait for an AMD Ryzen AI chip or an Nvidia Blackwell at consumer prices — or whether the right move is a smaller model today — is the next question, not a conclusion.


This Post Was Written with the Model It Describes

The provenance of this post is worth being honest about. It started as my own documentation — notes I took while working through the problem. The local model, running as an LLM, rewrote those notes into a first draft. I edited that draft. Then Claude rewrote it with some context. I edited it again. Then Claude rewrote it a final time — this version — after an interview process where I answered 25 questions about what actually happened, what failed, and what I still don’t know. Followed by some additional editing.

Four passes. Three rewrites. Two different AI systems. One human who had to do the actual work before any of it was worth writing down.

There are probably still sentences in here that a model wrote and I didn’t catch. You’ll recognize them; they’re the ones that couldn’t have come from doing the work.

The ones that did come from doing the work:

no one knew that anywhere else on the entire internet.

I clicked away from the answer three times before I believed it.


The next post covers what happens when you actually push this in a real coding workflow — context limits, model tradeoffs, and whether the hardware ceiling is the problem or just the most visible one.

Here’s the original technical documentation.