I Built a Local AI Coding Assistant on Consumer Hardware…and It Works. I think.

May 2026 | AI Development & Local LLMs

I used a local LLM to build the software that runs the local LLM. That sentence is either impressive or embarrassing depending on how it went. It went both ways.

The question I started with was simple: can a consumer GPU run a coding assistant well enough to actually use the stupid thing? All the buzz is around Claude Code and MCPs. I wanted to know what happens when you kick the corps out of your workflow and do it yourself.

The answer is yes. It works…with some caveats. I have a slightly buggy MVP v1.0 on GitHub to prove it.

The Test

Initially I set up Ollama and pointed OpenCode at four models with a real task: write a Python script that searches the Hugging Face Hub API. No hand-holding. Just a prompt and documentation. You can find the exact testing parameters in my testing repo.

The models I tested were:

OmniCoder Qwen3.5 (9B Parameters — 112k Context Window)
OmniCoder Qwen3.5 Custom (9B Parameters — 112k Context Window)
Qwen3.5 Coder Neo (4B Parameters — 160k Context Window)
Qwopus3.5 v3 (4B Parameters — 128k Context Window)

Every model had the same problem, but eventually succeeded. Different sizes and context windows produced different results, which was expected. What wasn’t expected was this: the model would randomly stop mid-task and wait. I had to type “continue” every few minutes. It took a freaking hour to finish one test. I spent days on it.

This turned out to be a known Ollama issue. There were posts on Reddit that confirmed it. So I went looking for something else.

The Hardware Math

Before I go further: context windows and VRAM aren’t linear, they’re quadratic. This makes a huge difference in how much RAM you need.

A roughly 9 billion parameter quantized model with a 128k context window fits on a 12GB graphics card. Reduce that to a 4 billion parameter model and you can push the context window to 160k. Want a 9 billion parameter model without quantization and the full 256k context window? You immediately need over 20GB of VRAM.

I have an Nvidia RTX 4070 with 12GB. That’s my constraint. Every model choice in this project runs through that math.

I watched YouTube videos, read Reddit posts, and searched for articles. Eventually I decided to give vLLM a try. I got it set up, but I couldn’t get the context window down to a size that I could actually use. Probably great for a server, but useless for me.

Then I found out that Ollama is just a wrapper around llama.cpp. Which I could install and run myself.

The Problem With llama.cpp Naked

llama.cpp crushed it (…technically). It was fast, had no connection drops, and no forced “continue” interruptions.

The issue was that the workflow was not a workflow. It was: save the launch command to a text file, copy-paste it every time I wanted to use it, leave a terminal window open to kill the server later, or hunt down the process manually to get my VRAM back. That’s not a coding assistant. That’s a chore. I hate chores. I have a robot vacuum for that reason.

And then there’s the install problem. llama.cpp has 28 different archives per release — not counting source code. How would anyone know which one to pick? I know which one I need, but I don’t even know what all 28 of them are for. We need AI for everyone, not just the few with the technical skills necessary to understand their own hardware before they can even download the right file.

I needed real software. Something that would let me configure llama.cpp with a config file (because for some reason it doesn’t have one), switch between models on the fly from inside OpenCode like Ollama but without the overhead, and guide a new user through installation without making them read release notes.

I don’t write Python. I can read it well enough to debug it if I can Google some of it, but writing it from scratch isn’t in my wheelhouse. The solution was painfully obvious: use a local LLM to write the code that runs the local LLM.

The solution? llama-server-wrapper.

Here’s what it actually does:

./llama-server-wrapper --install-llama

That command walks you through an interactive TUI menu — arrow keys, numbered options, Enter to confirm — that detects your OS and hardware and picks the right archive from those 28 choices. You don’t read release notes. You pick the download…sort of. I said it’s a buggy MVP, I’ll work on it, I swear.

After that, a conf.json handles the rest. You can configure the host IP, port, context size, how long before the model unloads when idle. The server starts as a background process; close the terminal if you want, it keeps running. Stop it with --stop-server. Update llama.cpp itself with --update-llama. Update the wrapper with --self-update.

The API it exposes is OpenAI-compatible. Which means OpenCode, or anything else that speaks to OpenAI’s endpoint, drops in without modification. That’s the part that makes the whole stack work — the server speaks a language every tool already understands and the wrapper runs the server.

The Recursive Part

I wrote several prompts and I consistently came up short. The model kept going sideways and I kept wondering why. Then I thought back to other projects — what actually worked, what the lead developer did that made everything else possible. The answer was obvious once I asked that question; the lead writes the requirements. Nobody codes their way out of a missing spec. So I stopped vibe coding and became the lead developer. I wrote a requirements outline, handed it to Claude, had it ask me questions, and got Requirements.md out the other end. One document. Every decision the model would need to make, already made.

That’s the difference between vibe coding and agentic engineering — and it took hitting the wall a few times before I figured it out. A digital junior developer. That’s what the local model felt like after that — handing a spec to someone who could write the code I couldn’t, then watching what came back. OpenCode connected to the local model. I described what I needed. The model wrote Python. I tested it. Things broke. I described the breakage. The model fixed it.

Here’s the part nobody tells you: doing three things simultaneously with an AI coding assistant is hard. I was building the wrapper, writing the unit tests, and building a library of skills — structured prompts in markdown that tell the model exactly how to approach recurring tasks — all at the same time. The skills, or lack thereof, were the thing that almost killed the project.

When the Tests Ate Themselves

Lots of developers have a library of prompts they use. They look through it to find a prompt, copy-paste it, maybe tweak it a little, and the model does what they want. That works, but it breaks down the moment the project gets complex enough that the model needs to make decisions across sessions — because the prompt library lives in your head, not in the model’s context window.

My bug-reporting and bug-fixing prompts started generating tests on their own. Technically I told them to, but I hadn’t given them any constraints. The model would just create new test files, over and over. The test directory ballooned to two dozen files; duplicate tests everywhere. The model couldn’t hold the full picture in its 128k context window, so every time it needed to write a test it would check what existed, fail to find anything, and write a new file.

I had more tests than working code. The context window was full of test infrastructure. The model couldn’t load all of it to run the actual project. I took a step back.

The solution wasn’t a better prompt. It was a constraint document: Testing Strategy.md. Claude helped me write it — same approach as Requirements.md, use the tool that’s good at turning a conversation into a specification. It tells the model exactly which five test files exist, what each one covers, and explicitly forbids creating new ones. It defines the mocking patterns. It sets target test counts per file. It’s a contract written in a language the model can actually follow.

The skills themselves went to Gemini. Sorry Claude, you got skipped on this one. Turns out Claude won’t write skills for a competitor’s tool — it just rewrites the prompt for itself instead. It kept telling me that “Claude Code doesn’t have that tool”. Gemini had no such objections.

Without this new infrastructure the model makes decisions. With it, the decisions are already made. Now I had control over my cute little rogue AI again.

The Settings Nobody Talks About

Everyone knows you need to pick the right model. That’s not a secret. Here’s something the YouTube tutorials skip: temperature, repeat-penalty, and presence-penalty matter. A lot. Like, really a lot — and the right values are different for every model. Getting those wrong doesn’t just produce worse output. It sends the model down rabbit holes it can’t escape from. I lost time to that before I understood what was actually happening.

Now I’m using a model that’s been fine-tuned, with a few custom settings, and a context window that makes the memory footprint fit into 12GB. It’s 6.47GB on disk and can solve some pretty complex problems, if you help it break them down into reasonably sized pieces.

Jackrong/Qwen3.5-9B-Neo

Here’s the section of my OpenCode config that brings the whole thing together:

"Jackrong/Qwen3.5-9B-Neo-GGUF:Q5_K_M": {
  "name": "Qwen 3.5 Coder Neo (9B 128k)",
  "options": {
    "ctx-size": 131072,
    "presence-penalty": 0.2,
    "repeat-penalty": 1.2,
    "temp": 0.7,
    "top-k": 20,
    "top-p": 0.95,
    "min-p": 0.1
  }
}

The Waiting

There’s a certain kind of anticipation when you fire off a task to a local model. I keep looking at the screen. Is it done yet? What did it do?

Then I go back to playing Magic the Gathering on a tablet while I wait. I also took the dog out. Ran the Brava Jet mop. The model runs on the other screen while I do something else — like write this blog post.

It makes me want a second graphics card. 24GB of VRAM would change the math considerably. An unquantized model would likely produce better results than the compressed versions I’m running now and I could double the context window to 256k. The Nvidia DGX Spark would be nice, but the price tag is a different kind of constraint. Would it run faster though? I dunno, but I could test that.

The Workflow That Came Out of It

I had 5 prompts that turned into seven skills. A complete development lifecycle that fits inside a limited context window:

/project-plan — reads Requirements.md, writes Plan.md
/project-update — gap analysis against the codebase, writes Update.md
/project-implement — reads Update.md, writes code
/project-create-bug — takes bug reports, saves to Bugs.md
/project-bug-fix — reads Bugs.md, lets you select a bug, delegates the fix to an agent
/project-commit — runs pytest, creates a branch, pushes the commit
/project-update-tests — reads Testing Strategy.md, updates the test suite

Requirements → Plan → Implementation → Bug tracking → Bug fixing → Testing → Commit. The model doesn’t decide what to do next. The skill does — by reading the documents before executing an agent.

That’s the insight that made everything else manageable. Every time the model needed to make a decision it couldn’t make well, I wrote a document that made the decision for it in advance. Every time it couldn’t solve the problem because the context window was full, I had it write a document and then read that same document in the next prompt.

This is how you control a context window from inside the model.

What I Thought This Would Be

I took AI in college — back when neural networks could only do data mining. I wrote a paper as a junior hypothesizing that we’d eventually have custom hardware for running AI, either on the desktop or on a server. I thought we’d have something like Dr. Know from the movie A.I. — an oracle you could actually query, but more real. Something like the Enterprise computer: a system that reasons through problems with you.

Twenty years later, I have a 9 billion parameter model running on a consumer graphics card, writing Python I don’t know how to write, inside a workflow I designed to work around its limitations.

It’s not the computer from the U.S.S. Enterprise, but it’s a proof of concept that points toward one; if it can fit in a finite state machine, it can be solved by an AI.

The custom hardware I predicted exists now. The Nvidia DGX is real. The AMD Ryzen AI Max+ PRO 495 is coming. What I built works on the hardware I have today — and the wrapper that makes it easier to use might matter more when the hardware gets better, or I have better hardware to throw at it.

Try It

The wrapper is at github.com/zero4281/llama-server-wrapper.

My OpenCode config is here: opencode.json

The initial model tests are at github.com/zero4281/opencode-test.

It’s v1.0. It’s slightly buggy. It works.

The better question — now that the answer to “can consumer hardware run a local coding assistant” is yes — is what the workflow looks like when the context window doubles and the model isn’t quantized. That’s the next test.

I Built a Local AI Coding Assistant on Consumer Hardware…and It Works. I think.

The Test

The Hardware Math

The Problem With llama.cpp Naked

The Recursive Part

When the Tests Ate Themselves

The Settings Nobody Talks About

The Waiting

The Workflow That Came Out of It

What I Thought This Would Be

Try It

More posts

I Built a Local AI Coding Assistant on Consumer Hardware…and It Works. I think.

opencode.json

AI Model from Hugging Face to Ollama

A Local AI Coding Assistant. How Hard Could It Be? (Pretty Hard, Actually.)