May 2026 | AI Development & Local LLMs
I used a local LLM to build the software that runs the local LLM. That sentence is either impressive or embarrassing depending on how it went. It went both ways — and the embarrassing half is the part worth talking about.
The question I started with was simple: can a consumer GPU run a coding assistant well enough to actually use the stupid thing? All the buzz is around Claude Code and MCPs. I wanted to know what happens when you kick the corps out of your workflow and do it yourself. Cost restrictions are real. Capacity constraints are real. While I was writing this post, Claude hit a wall mid-conversation due to capacity limits — the kind of friction that doesn’t exist when the model runs on your own hardware. That’s not a complaint; it’s a data point.
The answer is yes. It works…with some caveats. I have a slightly buggy MVP v1.0 on GitHub to prove it.
How It Started
I started with a test prompt and a repository called opencode-test:
Write a python script that can search models in the huggingface hub. Save it to ./hf_search.py. Use this documentation as a starting point. https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api.md
Then I let it run to see what would happen. I was still running Ollama at the time. The models I tested were:
- OmniCoder Qwen3.5 (9B Parameters — 112k Context Window)
- OmniCoder Qwen3.5 Custom (9B Parameters — 112k Context Window)
- Qwen3.5 Coder Neo (4B Parameters — 160k Context Window)
- Qwopus3.5 v3 (4B Parameters — 128k Context Window)
Every model eventually succeeded. Different sizes and context windows produced different results, which was expected. What wasn’t expected: every model would randomly stop mid-task and wait. I had to type “continue” every few minutes. It took an hour to finish one test. I spent days on it.
That turned out to be a known Ollama bug. Reddit confirmed it. But the bug isn’t really the story — the story is what came after.
Every time I customized Ollama’s service file to expose it on the local network instead of just localhost, the next update would overwrite it. Every single time. I wanted it to run like a server, not a desktop application. Ollama has a huge following and that’s fine; it just couldn’t do what I needed it to do, and it kept undoing the work I’d done to make it work anyway. So I walked away from it.
I tried vLLM. Probably great for a server, useless for my constraint. Then I found out that Ollama is just a wrapper around llama.cpp — which I could run myself.
The Hardware Math
Context windows and VRAM aren’t linear, they’re quadratic. This matters more than most local AI tutorials will tell you.
A roughly 9 billion parameter quantized model with an 87k context window fits on a 12GB graphics card. Reduce that to 4 billion parameters and you can push the context window to 165k. Want a 9 billion parameter model without quantization and the full 256k context window? You immediately need over 20GB of VRAM.
I have an Nvidia RTX 4070 with 12GB. That’s my constraint. Every model choice in this project runs through that math.
The Problem With llama.cpp Naked
llama.cpp crushed it technically. Fast, no connection drops, no forced “continue” interruptions.
The workflow was not a workflow. Save the launch command to a text file. Copy-paste it every time. Leave a terminal window open to kill the server later — or hunt down the process manually to get your VRAM back. That’s not a coding assistant, that’s a chore. I hate chores. I have a robot vacuum for that reason.
And then there’s the install problem. llama.cpp has 28 different archives per release, not counting source code. I know which one I need, but I don’t even know what all 28 of them are for. We need AI for everyone, not just the few with the technical background to understand their own hardware before they can download the right file.
I needed real software. Something that would configure llama.cpp with a config file (for some reason it doesn’t have one), switch between models on the fly from inside OpenCode, and guide a new user through installation without making them read release notes.
I don’t write Python. The solution was obvious: use a local LLM to write the code that runs the local LLM.
The solution: llama-server-manager.
./llama-server-manager --install-llama
That command walks you through an interactive TUI menu — arrow keys, numbered options, Enter to confirm — that detects your OS and hardware and picks the right archive from those 28 choices. After that, a conf.json handles the rest: host IP, port, idle unload timeout. The server starts as a background process; close the terminal if you want, it keeps running. Stop it with --stop-server. Update llama.cpp itself with --update-llama. Update the wrapper with --self-update.
The API it exposes is OpenAI-compatible. OpenCode, or anything else that speaks to OpenAI’s endpoint, drops in without modification. The server speaks a language every tool already understands; the wrapper runs the server.
The Recursive Part
I wrote several prompts and consistently came up short. The model kept going sideways and I kept wondering why. Then I thought back to what actually works in software development — what the lead developer does that makes everything else possible. The lead writes the requirements. Nobody codes their way out of a missing spec. So I stopped vibe coding and became the lead developer.
I wrote a requirements outline, handed it to Claude, had it ask me questions, and got Requirements.md out the other end. One document. Every decision the model would need to make, already made.
That’s the difference between vibe coding and agentic engineering — and it took hitting the wall a few times to figure it out. A digital junior developer; that’s what the local model felt like after that. Handing a spec to someone who could write the code I couldn’t, then watching what came back. OpenCode connected to the local model. I described what I needed. The model wrote Python. I tested it. Things broke. I described the breakage. The model fixed it.
Here’s the part nobody tells you: doing three things simultaneously with an AI coding assistant is hard. I was building the wrapper, writing unit tests, and building a library of skills — structured prompts in markdown that tell the model exactly how to approach recurring tasks — all at the same time. The skills, or lack thereof, were the thing that almost killed the project.
When the Tests Ate Themselves
This is the embarrassing half.
I wrote a bug-reporting prompt that told the AI to add tests if they didn’t exist. That instruction made perfect sense to me. To the model it meant: there are no tests yet, add them…every single time. The test directory ballooned to two dozen files. Duplicates everywhere. More tests than working code, and the code still didn’t work.
It wasn’t that the model couldn’t write code. All those tests passed. They were just the wrong tests because I hadn’t defined what done looked like, so it found its own definition. The context window was 87k and the model couldn’t hold the full picture across sessions, so every time it needed to write a test it would check what existed, fail to find anything meaningful, and create a new file.
That’s not a coding failure. That’s a requirements failure, and the requirements failure was mine.
The solution wasn’t a better prompt. It was a constraint document: Testing Strategy.md. I used Claude to write it with the same approach as Requirements.md; use the tool that’s good at turning a conversation into a specification. It tells the model exactly which five test files exist, what each one covers, and explicitly forbids creating new ones. It defines the mocking patterns. It sets target test counts per file. It’s a contract written in a language the model can actually follow.
Then I deleted all the tests and started over. The AI model can’t fix what the model can’t load. The context window forced that decision. I created the /project-update-tests skill to write new, targeted tests against the new document.
Without the constraint document the model makes decisions. With it, the decisions are already made.
Three AIs, One Workflow
I’m not loyal to a single tool. The goal of this experiment is to limit the amount of data sent to the cloud, to find out how far I can push each tool, what the differences are, and how to test them.
The workflow that emerged:
- Claude handled the meta-work: Requirements.md, Testing Strategy.md, turning conversations into specifications. It’s good at that. It would not, however, write skills for OpenCode. It kept rewriting them for itself instead, telling me “Claude Code doesn’t have that tool.” That’s a real limitation, not a complaint.
- Gemini had no such objections. It soared where Claude failed. I’ve never been tied to one AI tool anyway, so switching made sense.
- The local model wrote the code.
That division isn’t something I designed in advance. I backed into it by hitting walls. But it maps cleanly: the expensive cloud model writes the rules; the local model follows them.
Smaller local models can’t quite handle writing yet. If I had an Nvidia DGX Spark running a 70 billion parameter model, I’d probably use it as a writing assistant. I don’t. Part of the experiment is finding out what a single 12GB GPU can do by itself. It can write code if everything is structured correctly, but it can’t write a blog article.
The Settings Nobody Talks About
Everyone knows you need to pick the right model. Here’s what the YouTube tutorials skip: temperature, repeat-penalty, and presence-penalty matter. A lot. The right values are different for every model, and getting them wrong doesn’t just produce worse output — it sends the model into logical loops it can’t escape from. I lost time to that before I understood what was actually happening.
There’s information out there about what the settings do. Then you tweak them until the behavior changes. No magic formula; just iteration.
Now I’m using Jackrong/Qwen3.5-9B-Neo — fine-tuned, custom settings, context window sized to fit inside 12GB. It’s 6.47GB on disk and can solve some pretty complex problems if you help it break them down into reasonably sized pieces.
Here’s the section of my OpenCode config that brings the whole thing together:
"Jackrong/Qwen3.5-9B-Neo-GGUF:Q5_K_M": {
"name": "Qwen 3.5 Coder Neo (9B Q5)",
"options": {
"presence-penalty": 0.2,
"repeat-penalty": 1.2,
"temp": 0.7,
"top-k": 20,
"top-p": 0.95,
"min-p": 0.1
}
}
The Waiting
There’s a certain kind of anticipation when you fire off a task to a local model. I keep looking at the screen. Is it done yet? What did it do?
Then I go back to playing Magic the Gathering on a tablet while I wait. I also took the dog out. Ran the Brava Jet mop. The model runs on the other screen while I do something else — like write this blog post.
I’m always a little excited to see what it does, even when it fails. I want to examine the failure as much as I want to see it succeed; the failure is data.
It makes me want a second graphics card. 24GB of VRAM would change the math considerably. An unquantized model would likely produce better results than the compressed versions I’m running now and I could push the context window to 256k. The Nvidia DGX Spark would be nice, but the price tag is a different kind of constraint.
If I had one tomorrow, the first thing I’d run is Qwen3.6-27B — supposed to be really, like really really, good at writing code. Better than models bigger than it is. Better than some that are only cloud-hosted. That’s the next test.
The Workflow That Came Out of It
Five prompts turned into seven skills. A complete development lifecycle that fits inside a limited context window:
/project-plan— readsRequirements.md, writesPlan.md/project-update— gap analysis against the codebase, writesUpdate.md/project-implement— readsUpdate.md, writes code/project-create-bug— takes bug reports, saves toBugs.md/project-bug-fix— readsBugs.md, lets you select a bug, delegates the fix to an agent/project-commit— runs pytest, creates a branch, pushes the commit/project-update-tests— readsTesting Strategy.md, updates the test suite
Requirements → Plan → Implementation → Bug tracking → Bug fixing → Testing → Commit. The model doesn’t decide what to do next. The skill does — by reading the documents before executing an agent.
The insight that made everything manageable: every time the model needed to make a decision it couldn’t make well, I wrote a document that made the decision for it in advance. Every time it couldn’t solve the problem because the context window was full, I had it write a document and then read that same document in the next prompt. This is how you control a context window from inside the model.
The skills need more work. /project-plan and /project-implement aren’t as well written as the bug skills. I need to circle back to them before the skills become their own repo. The lesson eating its own tail; the skills need their own Requirements.md before they’re ready to ship.
Vibe Coding Is a Trap
Write a prompt. Didn’t get what you wanted? Write another prompt. That’s the trap. That’s why developers complain about AI output. The results turn into AI slop as code; code bloat. Code bloat creates bugs.
Lines of code aren’t supposed to be the metric. We knew that 25 years ago. Why did they become the metric when AI started writing the code? Why should anyone care how many tokens you used this month?
The software development lifecycle isn’t a revelation. Requirements, planning, implementation, testing, commit. That’s what software development is. Most vibe coders skip it and then wonder why the output is garbage. I’m not a better developer because I used AI, but I am a better AI user because I already understood software development.
What I Thought This Would Be
I took AI in college — back when neural networks could only do data mining. I wrote a paper as a junior hypothesizing that we’d eventually have custom hardware for running AI, either on the desktop or on a server. I thought we’d have something like Dr. Know from the movie A.I. — an oracle you could actually query. Something like the Enterprise computer: a system that reasons through problems with you.
Twenty years later, I have a 9 billion parameter model running on a consumer graphics card, writing Python I don’t know how to write, inside a workflow I designed to work around its limitations.
It’s not the computer from the U.S.S. Enterprise; it’s a proof of concept that points toward one. The custom hardware I predicted exists now. The Nvidia DGX is real. The AMD Ryzen AI Max+ PRO 495 is coming. What I built works on the hardware I have today.
I’m breaking the conventional rules by opting out of cloud-hosted models. That’s a choice, not an accident; and the wrapper that makes local AI easier to use might matter more when the hardware gets better than it is right now.
I feel like this is something I should keep doing. More tests. More code. More publishing. The experiment isn’t finished; the fact that it isn’t finished is the point.
Try It
The wrapper is at github.com/zero4281/llama-server-manager.
My OpenCode config is here: opencode.json
The initial model tests are at github.com/zero4281/opencode-test.
It’s v1.0. It’s slightly buggy. It works.
The better question — now that “can consumer hardware run a local coding assistant” is answered — is what the workflow looks like when the context window doubles and the model isn’t quantized. That’s the next test.