Author: zero4281

  • I Built a Local AI Coding Assistant on Consumer Hardware…and It Works. I think.

    May 2026 | AI Development & Local LLMs

    I used a local LLM to build the software that runs the local LLM. That sentence is either impressive or embarrassing depending on how it went. It went both ways — and the embarrassing half is the part worth talking about.

    The question I started with was simple: can a consumer GPU run a coding assistant well enough to actually use the stupid thing? All the buzz is around Claude Code and MCPs. I wanted to know what happens when you kick the corps out of your workflow and do it yourself. Cost restrictions are real. Capacity constraints are real. While I was writing this post, Claude hit a wall mid-conversation due to capacity limits — the kind of friction that doesn’t exist when the model runs on your own hardware. That’s not a complaint; it’s a data point.

    The answer is yes. It works…with some caveats. I have a slightly buggy MVP v1.0 on GitHub to prove it.


    How It Started

    I started with a test prompt and a repository called opencode-test:

    Write a python script that can search models in the huggingface hub. Save it to ./hf_search.py. Use this documentation as a starting point. https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api.md

    Then I let it run to see what would happen. I was still running Ollama at the time. The models I tested were:

    Every model eventually succeeded. Different sizes and context windows produced different results, which was expected. What wasn’t expected: every model would randomly stop mid-task and wait. I had to type “continue” every few minutes. It took an hour to finish one test. I spent days on it.

    That turned out to be a known Ollama bug. Reddit confirmed it. But the bug isn’t really the story — the story is what came after.

    Every time I customized Ollama’s service file to expose it on the local network instead of just localhost, the next update would overwrite it. Every single time. I wanted it to run like a server, not a desktop application. Ollama has a huge following and that’s fine; it just couldn’t do what I needed it to do, and it kept undoing the work I’d done to make it work anyway. So I walked away from it.

    I tried vLLM. Probably great for a server, useless for my constraint. Then I found out that Ollama is just a wrapper around llama.cpp — which I could run myself.


    The Hardware Math

    Context windows and VRAM aren’t linear, they’re quadratic. This matters more than most local AI tutorials will tell you.

    A roughly 9 billion parameter quantized model with an 87k context window fits on a 12GB graphics card. Reduce that to 4 billion parameters and you can push the context window to 165k. Want a 9 billion parameter model without quantization and the full 256k context window? You immediately need over 20GB of VRAM.

    I have an Nvidia RTX 4070 with 12GB. That’s my constraint. Every model choice in this project runs through that math.


    The Problem With llama.cpp Naked

    llama.cpp crushed it technically. Fast, no connection drops, no forced “continue” interruptions.

    The workflow was not a workflow. Save the launch command to a text file. Copy-paste it every time. Leave a terminal window open to kill the server later — or hunt down the process manually to get your VRAM back. That’s not a coding assistant, that’s a chore. I hate chores. I have a robot vacuum for that reason.

    And then there’s the install problem. llama.cpp has 28 different archives per release, not counting source code. I know which one I need, but I don’t even know what all 28 of them are for. We need AI for everyone, not just the few with the technical background to understand their own hardware before they can download the right file.

    I needed real software. Something that would configure llama.cpp with a config file (for some reason it doesn’t have one), switch between models on the fly from inside OpenCode, and guide a new user through installation without making them read release notes.

    I don’t write Python. The solution was obvious: use a local LLM to write the code that runs the local LLM.

    The solution: llama-server-manager.

    ./llama-server-manager --install-llama
    

    That command walks you through an interactive TUI menu — arrow keys, numbered options, Enter to confirm — that detects your OS and hardware and picks the right archive from those 28 choices. After that, a conf.json handles the rest: host IP, port, idle unload timeout. The server starts as a background process; close the terminal if you want, it keeps running. Stop it with --stop-server. Update llama.cpp itself with --update-llama. Update the wrapper with --self-update.

    The API it exposes is OpenAI-compatible. OpenCode, or anything else that speaks to OpenAI’s endpoint, drops in without modification. The server speaks a language every tool already understands; the wrapper runs the server.


    The Recursive Part

    I wrote several prompts and consistently came up short. The model kept going sideways and I kept wondering why. Then I thought back to what actually works in software development — what the lead developer does that makes everything else possible. The lead writes the requirements. Nobody codes their way out of a missing spec. So I stopped vibe coding and became the lead developer.

    I wrote a requirements outline, handed it to Claude, had it ask me questions, and got Requirements.md out the other end. One document. Every decision the model would need to make, already made.

    That’s the difference between vibe coding and agentic engineering — and it took hitting the wall a few times to figure it out. A digital junior developer; that’s what the local model felt like after that. Handing a spec to someone who could write the code I couldn’t, then watching what came back. OpenCode connected to the local model. I described what I needed. The model wrote Python. I tested it. Things broke. I described the breakage. The model fixed it.

    Here’s the part nobody tells you: doing three things simultaneously with an AI coding assistant is hard. I was building the wrapper, writing unit tests, and building a library of skills — structured prompts in markdown that tell the model exactly how to approach recurring tasks — all at the same time. The skills, or lack thereof, were the thing that almost killed the project.


    When the Tests Ate Themselves

    This is the embarrassing half.

    I wrote a bug-reporting prompt that told the AI to add tests if they didn’t exist. That instruction made perfect sense to me. To the model it meant: there are no tests yet, add them…every single time. The test directory ballooned to two dozen files. Duplicates everywhere. More tests than working code, and the code still didn’t work.

    It wasn’t that the model couldn’t write code. All those tests passed. They were just the wrong tests because I hadn’t defined what done looked like, so it found its own definition. The context window was 87k and the model couldn’t hold the full picture across sessions, so every time it needed to write a test it would check what existed, fail to find anything meaningful, and create a new file.

    That’s not a coding failure. That’s a requirements failure, and the requirements failure was mine.

    The solution wasn’t a better prompt. It was a constraint document: Testing Strategy.md. I used Claude to write it with the same approach as Requirements.md; use the tool that’s good at turning a conversation into a specification. It tells the model exactly which five test files exist, what each one covers, and explicitly forbids creating new ones. It defines the mocking patterns. It sets target test counts per file. It’s a contract written in a language the model can actually follow.

    Then I deleted all the tests and started over. The AI model can’t fix what the model can’t load. The context window forced that decision. I created the /project-update-tests skill to write new, targeted tests against the new document.

    Without the constraint document the model makes decisions. With it, the decisions are already made.


    Three AIs, One Workflow

    I’m not loyal to a single tool. The goal of this experiment is to limit the amount of data sent to the cloud, to find out how far I can push each tool, what the differences are, and how to test them.

    The workflow that emerged:

    1. Claude handled the meta-work: Requirements.md, Testing Strategy.md, turning conversations into specifications. It’s good at that. It would not, however, write skills for OpenCode. It kept rewriting them for itself instead, telling me “Claude Code doesn’t have that tool.” That’s a real limitation, not a complaint.
    2. Gemini had no such objections. It soared where Claude failed. I’ve never been tied to one AI tool anyway, so switching made sense.
    3. The local model wrote the code.

    That division isn’t something I designed in advance. I backed into it by hitting walls. But it maps cleanly: the expensive cloud model writes the rules; the local model follows them.

    Smaller local models can’t quite handle writing yet. If I had an Nvidia DGX Spark running a 70 billion parameter model, I’d probably use it as a writing assistant. I don’t. Part of the experiment is finding out what a single 12GB GPU can do by itself. It can write code if everything is structured correctly, but it can’t write a blog article.


    The Settings Nobody Talks About

    Everyone knows you need to pick the right model. Here’s what the YouTube tutorials skip: temperature, repeat-penalty, and presence-penalty matter. A lot. The right values are different for every model, and getting them wrong doesn’t just produce worse output — it sends the model into logical loops it can’t escape from. I lost time to that before I understood what was actually happening.

    There’s information out there about what the settings do. Then you tweak them until the behavior changes. No magic formula; just iteration.

    Now I’m using Jackrong/Qwen3.5-9B-Neo — fine-tuned, custom settings, context window sized to fit inside 12GB. It’s 6.47GB on disk and can solve some pretty complex problems if you help it break them down into reasonably sized pieces.

    Here’s the section of my OpenCode config that brings the whole thing together:

    "Jackrong/Qwen3.5-9B-Neo-GGUF:Q5_K_M": {
      "name": "Qwen 3.5 Coder Neo (9B Q5)",
      "options": {
        "presence-penalty": 0.2,
        "repeat-penalty": 1.2,
        "temp": 0.7,
        "top-k": 20,
        "top-p": 0.95,
        "min-p": 0.1
      }
    }
    

    The Waiting

    There’s a certain kind of anticipation when you fire off a task to a local model. I keep looking at the screen. Is it done yet? What did it do?

    Then I go back to playing Magic the Gathering on a tablet while I wait. I also took the dog out. Ran the Brava Jet mop. The model runs on the other screen while I do something else — like write this blog post.

    I’m always a little excited to see what it does, even when it fails. I want to examine the failure as much as I want to see it succeed; the failure is data.

    It makes me want a second graphics card. 24GB of VRAM would change the math considerably. An unquantized model would likely produce better results than the compressed versions I’m running now and I could push the context window to 256k. The Nvidia DGX Spark would be nice, but the price tag is a different kind of constraint.

    If I had one tomorrow, the first thing I’d run is Qwen3.6-27B — supposed to be really, like really really, good at writing code. Better than models bigger than it is. Better than some that are only cloud-hosted. That’s the next test.


    The Workflow That Came Out of It

    Five prompts turned into seven skills. A complete development lifecycle that fits inside a limited context window:

    • /project-plan — reads Requirements.md, writes Plan.md
    • /project-update — gap analysis against the codebase, writes Update.md
    • /project-implement — reads Update.md, writes code
    • /project-create-bug — takes bug reports, saves to Bugs.md
    • /project-bug-fix — reads Bugs.md, lets you select a bug, delegates the fix to an agent
    • /project-commit — runs pytest, creates a branch, pushes the commit
    • /project-update-tests — reads Testing Strategy.md, updates the test suite

    Requirements → Plan → Implementation → Bug tracking → Bug fixing → Testing → Commit. The model doesn’t decide what to do next. The skill does — by reading the documents before executing an agent.

    The insight that made everything manageable: every time the model needed to make a decision it couldn’t make well, I wrote a document that made the decision for it in advance. Every time it couldn’t solve the problem because the context window was full, I had it write a document and then read that same document in the next prompt. This is how you control a context window from inside the model.

    The skills need more work. /project-plan and /project-implement aren’t as well written as the bug skills. I need to circle back to them before the skills become their own repo. The lesson eating its own tail; the skills need their own Requirements.md before they’re ready to ship.


    Vibe Coding Is a Trap

    Write a prompt. Didn’t get what you wanted? Write another prompt. That’s the trap. That’s why developers complain about AI output. The results turn into AI slop as code; code bloat. Code bloat creates bugs.

    Lines of code aren’t supposed to be the metric. We knew that 25 years ago. Why did they become the metric when AI started writing the code? Why should anyone care how many tokens you used this month?

    The software development lifecycle isn’t a revelation. Requirements, planning, implementation, testing, commit. That’s what software development is. Most vibe coders skip it and then wonder why the output is garbage. I’m not a better developer because I used AI, but I am a better AI user because I already understood software development.


    What I Thought This Would Be

    I took AI in college — back when neural networks could only do data mining. I wrote a paper as a junior hypothesizing that we’d eventually have custom hardware for running AI, either on the desktop or on a server. I thought we’d have something like Dr. Know from the movie A.I. — an oracle you could actually query. Something like the Enterprise computer: a system that reasons through problems with you.

    Twenty years later, I have a 9 billion parameter model running on a consumer graphics card, writing Python I don’t know how to write, inside a workflow I designed to work around its limitations.

    It’s not the computer from the U.S.S. Enterprise; it’s a proof of concept that points toward one. The custom hardware I predicted exists now. The Nvidia DGX is real. The AMD Ryzen AI Max+ PRO 495 is coming. What I built works on the hardware I have today.

    I’m breaking the conventional rules by opting out of cloud-hosted models. That’s a choice, not an accident; and the wrapper that makes local AI easier to use might matter more when the hardware gets better than it is right now.

    I feel like this is something I should keep doing. More tests. More code. More publishing. The experiment isn’t finished; the fact that it isn’t finished is the point.


    Try It

    The wrapper is at github.com/zero4281/llama-server-manager.

    My OpenCode config is here: opencode.json

    The initial model tests are at github.com/zero4281/opencode-test.

    It’s v1.0. It’s slightly buggy. It works.

    The better question — now that “can consumer hardware run a local coding assistant” is answered — is what the workflow looks like when the context window doubles and the model isn’t quantized. That’s the next test.

  • opencode.json

    {
      "$schema": "https://opencode.ai/config.json",
      "autoupdate": true,
      "server": {
        "port": 4096
      },
      "provider": {
        "llama-server": {
          "npm": "@ai-sdk/openai-compatible",
          "name": "LLama Server (Local)",
          "options": {
            "baseURL": "http://192.168.1.100:11235/v1",
            "timeout": false,
            "chunkTimeout": 28800000
          },
          "models": {
            "Jackrong/Qwen3.5-4B-Neo-GGUF:Q5_K_M": {
              "name": "Qwen 3.5 Coder Neo (4B Q5)",
              "options": {
                "presence-penalty": 0,
                "repeat-penalty": 1,
                "temp": 0.6,
                "top-k": 20,
                "top-p": 0.95,
                "min-p": 0
              }
            },
            "Jackrong/Qwen3.5-9B-Neo-GGUF:Q5_K_M": {
              "name": "Qwen 3.5 Coder Neo (9B Q5)",
              "options": {
                "presence-penalty": 0.2,
                "repeat-penalty": 1.2,
                "temp": 0.7,
                "top-k": 20,
                "top-p": 0.95,
                "min-p": 0.1
              }
            },
            "Jackrong/Qwopus3.5-4B-v3-GGUF:Q8_0": {
              "name": "Qwopus3.5 v3 (4B Q8)",
              "options": {
                "presence-penalty": 0.2,
                "repeat-penalty": 1.1,
                "temp": 0.7,
                "top-k": 20,
                "top-p": 0.95,
                "min-p": 0.05
              }
            },
            "Jackrong/Qwopus3.5-9B-v3-GGUF:Q5_K_S": {
              "name": "Qwopus3.5 v3 (9B Q5)",
              "options": {
                "presence-penalty": 0,
                "repeat-penalty": 1,
                "temp": 0.6,
                "top-k": 20,
                "top-p": 0.95,
                "min-p": 0
              }
            },
            "Jackrong/Gemopus-4-E4B-it-GGUF:Q8_0": {
              "name": "Gemma 4 (4B Q8)",
              "options": {
                "presence-penalty": 0,
                "repeat-penalty": 1,
                "temp": 0.6,
                "top-k": 20,
                "top-p": 0.95,
                "min-p": 0
              }
            }
          }
        }
      }
    }
    
  • AI Model from Hugging Face to Ollama

    Modelfile Setup Guide

    To import a model from Hugging Face into Ollama for coding, do the following.

    1. Download the model

    $ ollama pull hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q8_0
    

    2. Find the correct blob

    Hint: It’s the big one.

    $ ollama show --modelfile hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q8_0 \
      | awk '/^FROM/ {print $2}' \
      | xargs -r du -h
    

    3. Create a new Modelfile

    Set the context length to at least 16k or 32k, and add the TEMPLATE to enable tool calling.

    Sources:
    The TEMPLATE is from a GitHub comment.
    The additional Coding parameters for Qwen 3.5 are from hf.co. Parameters for additional uses are also available.

    # Modelfile generated by "ollama show"
    # To build a new Modelfile based on this, replace FROM with:
    # FROM hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q8_0
    
    FROM /usr/share/ollama/.ollama/models/blobs/sha256-01ab75e862bf61c2fd20babc55d396181580722b7af76ec4ebfb83224218c723
    
    PARAMETER num_ctx 32768
    PARAMETER temperature 0.6
    PARAMETER top_p 0.95
    PARAMETER top_k 20
    PARAMETER min_p 0.0
    PARAMETER presence_penalty 0.0
    PARAMETER repeat_penalty 1.0
    
    TEMPLATE """{{- if .Suffix }}<|fim_prefix|>{{ .Prompt }}<|fim_suffix|>{{ .Suffix }}<|fim_middle|>
    {{- else -}}
    {{- $lastUserIdx := -1 -}}
    {{- range $idx, $msg := .Messages -}}
    {{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
    {{- end }}
    {{- if or .System .Tools }}<|im_start|>system
    {{ if .System }}
    {{ .System }}
    {{- end }}
    {{- if .Tools }}
    
    # Tools
    
    You may call one or more functions to assist with the user query.
    
    You are provided with function signatures within <tools></tools> XML tags:
    <tools>
    {{- range .Tools }}
    {"type": "function", "function": {{ .Function }}}
    {{- end }}
    </tools>
    
    For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
    <tool_call>
    {"name": <function-name>, "arguments": <args-json-object>}
    </tool_call>
    {{- end -}}
    <|im_end|>
    {{ end }}
    {{- range $i, $_ := .Messages }}
    {{- $last := eq (len (slice $.Messages $i)) 1 -}}
    {{- if eq .Role "user" }}<|im_start|>user
    {{ .Content }}
    {{ else if eq .Role "assistant" }}<|im_start|>assistant
    {{- if .Content }}
    {{ .Content }}
    {{- else if .ToolCalls }}
    {{- range .ToolCalls }}
    <tool_call>
    {"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
    </tool_call>
    {{- end }}
    {{- end }}{{ if not $last }}<|im_end|>
    {{ end }}
    {{- else if eq .Role "tool" }}<|im_start|>user
    <tool_response>
    {{ .Content }}
    </tool_response><|im_end|>
    {{ end }}
    {{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
    {{ end }}
    {{- end }}
    {{- end }}
    """
    

    4. Create a new model from the Modelfile

    $ ollama create Qwen3.5-Coder-Distilled
    
  • A Local AI Coding Assistant. How Hard Could It Be? (Pretty Hard, Actually.)

    March 2026 | AI Development & Local LLMs

    My codebase has never touched a cloud API. That’s a choice, not an accident — and it’s why I spent two afternoons and most of a weekend fighting Ollama, Docker, llama.cpp, and a GitHub issue thread before I got a local AI coding assistant that actually worked.

    Here’s what went wrong, what finally fixed it, and what I still haven’t solved.


    Why Local at All

    Two reasons, in order of importance.

    The first is data control. Every prompt you send to a cloud API is a potential leak vector. The model sees your variable names, your architecture patterns, your bugs. Most developers accept this tradeoff implicitly. I stopped accepting it. Services get hacked; my desktop doesn’t. That’s a humble brag, and I mean it.

    The second is latency. Once a local model is loaded into memory, generation starts in milliseconds. No network round-trip, no rate limits, no pricing surprises at the end of the month.

    I’ve wanted this since 2004 — when I took a year of AI in college, data mining included, back when neural networks were still mostly a theoretical exercise. Twenty years later I can almost achieve what I conceptualized back then. Almost is doing real work in that sentence. More on that at the end.

    The tradeoff is setup complexity. That’s what this post actually addresses.


    The Model: Jackrong/Qwen3.5-9B

    This model is a reasoning-focused distillation of Claude Opus 4.6, compressed to 9B parameters and quantized at Q8_0. The Q8_0 quantization preserves more model capacity than Q4 variants at the cost of roughly 2x disk and RAM usage — worth it for a coding workload where precision matters.

    Feature Details
    Base model Distilled from Claude Opus 4.6
    Specialization Reasoning + code generation
    Parameter count 9B
    Quantization Q8_0

    I started with the official Qwen 3.5 model from the Ollama library. It ran on my GPU. It did not run tools. It returned tool call blocks as plain text and ignored them — and that was the beginning of two very frustrating afternoons.


    What Actually Went Wrong First

    I Googled everything I could think of. I switched to llama.cpp. I got the models running in Docker without Ollama. Nothing worked. The tool calls kept coming back as text. At no point did the Ollama documentation, the model card, or any forum post I could find explain why.

    The answer was buried in a GitHub issue comment. I clicked away from it the first time. Came back. Read it again. Came back a third time before I could make sense of it.

    The Modelfile was wrong. The default Modelfile Ollama generates when you pull a model doesn’t include a TEMPLATE block. Without that block, the model has no idea what a tool call is supposed to look like. It just returns the raw syntax as text and moves on.

    That one omission — undocumented, unmentioned by Ollama, absent from every guide I found — was the entire problem. The fix is in Step 3 below.


    The Hugging Face Problem

    While I was deep in the Docker detour, I ran into a second issue: models pulled directly from Hugging Face through Ollama break in two distinct ways.

    The first model I tried threw a 404 error downloading the third file in its set — partway through a multi-gigabyte download, no warning. The second model downloaded cleanly and then silently failed to run tools.

    The fix for both: use ollama create to rebuild the model from the downloaded weights file. When a Hugging Face pull fails mid-download, Ollama has usually already grabbed the largest blob — the actual weights. You can build a working model from that file directly. This is not documented anywhere I could find. I worked it out myself.


    What You Need Before Starting

    • Ollama installed (ollama.com)
    • 10–15 GB of free disk space for the quantized weights
    • A terminal
    • Patience for a slow initial download

    Step 1: Pull the Model

    ollama pull hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q8_0
    

    This downloads the model weights and generates a temporary Modelfile with default settings. Expect around 20–25 GB depending on compression. If it fails partway through, go to Step 2 anyway.


    Step 2: Find the Model Blob

    Ollama stores weights as content-addressed blobs in /usr/share/ollama/.ollama/models/blobs/. You need the path to the largest blob — that’s the weights file.

    ollama show --modelfile hf.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q8_0 \
      | awk '/^FROM/ {print $2}' \
      | xargs -r du -h
    

    Output looks like:

    11.0G    /usr/share/ollama/.ollama/models/blobs/sha256-01ab75e862bf...
    

    Copy that path. If the file size shown is under 1 GB, you have a metadata blob — not the weights. The weights file is always the largest one.


    Step 3: Configure for Coding Work — Including the Tool Template

    This is the step nobody documents. Three parameters matter for coding; the TEMPLATE block matters for everything.

    Context window (num_ctx 32768): 32k tokens lets you paste entire files into a single prompt. This is the most important setting for coding tasks.

    Temperature (0.6): Lower temperature means more deterministic output. For code generation, you want the model to commit rather than explore.

    Repeat penalty (1.0): At 1.0, no penalty for repeating tokens. This sounds counterintuitive — but coding models at higher repeat penalties will avoid reusing variable names and function signatures from earlier in the context. That’s exactly the wrong behavior when you want consistent naming.

    FROM /usr/share/ollama/.ollama/models/blobs/sha256-01ab75e862bf...
    
    PARAMETER num_ctx 32768
    PARAMETER temperature 0.6
    PARAMETER top_p 0.95
    PARAMETER top_k 20
    PARAMETER min_p 0.0
    PARAMETER presence_penalty 0.0
    PARAMETER repeat_penalty 1.0
    

    Replace the FROM path with the blob path from Step 2.

    These parameters are tuned for coding work. If you want to use this model more like a general-purpose LLM — longer, more exploratory responses — the original parameter documentation from Hugging Face has the full range of options.

    Now add the tool-calling template. Without this block, tools won’t work — the model will return tool call syntax as plain text. This is the fix for everything that was broken:

    TEMPLATE """{{- if .Suffix }}<|fim_prefix|>{{ .Prompt }}<|fim_suffix|>{{ .Suffix }}<|fim_middle|>
    {{- else -}}
    {{- $lastUserIdx := -1 -}}
    {{- range $idx, $msg := .Messages -}}
    {{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
    {{- end }}
    {{- if or .System .Tools }}<|im_start|>system
    {{ if .System }}
    {{ .System }}
    {{- end }}
    {{- if .Tools }}
    
    # Tools
    
    You may call one or more functions to assist with the user query.
    
    You are provided with function signatures within <tools></tools> XML tags:
    <tools>
    {{- range .Tools }}
    {"type": "function", "function": {{ .Function }}}
    {{- end }}
    </tools>
    
    For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
    <tool_call>
    {"name": <function-name>, "arguments": <args-json-object>}
    </tool_call>
    {{- end -}}
    <|im_end|>
    {{ end }}
    {{- range $i, $_ := .Messages }}
    {{- $last := eq (len (slice $.Messages $i)) 1 -}}
    {{- if eq .Role "user" }}<|im_start|>user
    {{ .Content }}
    {{ else if eq .Role "assistant" }}<|im_start|>assistant
    {{- if .Content }}
    {{ .Content }}
    {{- else if .ToolCalls }}
    {{- range .ToolCalls }}
    <tool_call>
    {"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
    </tool_call>
    {{- end }}
    {{- end }}{{ if not $last }}<|im_end|>
    {{ end }}
    {{- else if eq .Role "tool" }}<|im_start|>user
    <tool_response>
    {{ .Content }}
    </tool_response><|im_end|>
    {{ end }}
    {{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
    {{ end }}
    {{- end }}
    {{- end }}
    """
    

    Step 4: Create the Model

    ollama create Qwen3.5-Coder-Distilled
    

    Verify It Works

    ollama list | grep qwen3.5
    # qwen3.5-coder-distilled:latest      06113f46d78a    9.5 GB
    
    ollama run Qwen3.5-Coder-Distilled "Hello"
    

    What to Expect on Real Hardware

    I run this on a 12 GB Nvidia RTX 4070. The model occupies most of that VRAM; there isn’t much headroom left.

    Metric What I Saw
    Initial load time 2–4 minutes
    Resident memory 9–11 GB while active
    Token generation (GPU) Usable; faster than CPU

    GPU acceleration makes a significant difference on generation speed. The load time is largely fixed regardless of hardware.


    What It Can’t Do Yet

    I tested this model with three coding agents — Claude Code, Opencode, and Qwen Code — against the same task: Create and compile a simple “Hello world!” program in C++. Compile it and run it to verify your work.

    Claude Code one-shotted it consistently. Opencode got there, but needed more than one run. Qwen Code generated code that looked correct and then failed to compile; it stalled trying to invoke g++ and just stopped — no error to chase.

    That’s not a model indictment. It’s a system indictment. The 32k context window starts to become a constraint as a codebase grows. The 9B Q8_0 model is good, but it’s working at the edge of available VRAM. A smaller model with a larger context window might actually outperform it on real-world coding tasks — I haven’t tested that yet.

    The hardware ceiling is visible. More VRAM would help. A server with two or four GPUs would help more. Whether to wait for an AMD Ryzen AI chip or an Nvidia Blackwell at consumer prices — or whether the right move is a smaller model today — is the next question, not a conclusion.


    This Post Was Written with the Model It Describes

    The provenance of this post is worth being honest about. It started as my own documentation — notes I took while working through the problem. The local model, running as an LLM, rewrote those notes into a first draft. I edited that draft. Then Claude rewrote it with some context. I edited it again. Then Claude rewrote it a final time — this version — after an interview process where I answered 25 questions about what actually happened, what failed, and what I still don’t know. Followed by some additional editing.

    Four passes. Three rewrites. Two different AI systems. One human who had to do the actual work before any of it was worth writing down.

    There are probably still sentences in here that a model wrote and I didn’t catch. You’ll recognize them; they’re the ones that couldn’t have come from doing the work.

    The ones that did come from doing the work:

    no one knew that anywhere else on the entire internet.

    I clicked away from the answer three times before I believed it.


    The next post covers what happens when you actually push this in a real coding workflow — context limits, model tradeoffs, and whether the hardware ceiling is the problem or just the most visible one.

    Here’s the original technical documentation.