The post of Mark Watson encouraged me to try running LLMs locally on my MacBook Pro M1 Max 64 for coding with Claude Code. I experimented running the following models with ollama: deepseek-coder-v2:16b-lite-instruct, qwen2.5-coder:14b-base, nemotron-3-nano, devstral-small-2, glm-4.7-flash - and I was disappointed: only with one of the models (don't remeber which one) did I manage to have a tiny (and useless) conversation. Although, recently Google released its model Gemma-4 which I tried with llama.cpp and this was a magic (comparing to the previous experience).
Here I log my experience on this subject, mostly for myself.
Installing llama.cpp
brew install llama.cpp
brew install hfDownload models
Here we download two versions: the 6 which is smaller (21Gb) and so must be quicker but still good enough and the 8 which is a bit bigger (25Gb) and must be slower but more powerful - the idea was to find a good balance between the quality of the LLM output and speed (tokens/sec).
hf download bartowski/google_gemma-4-26B-A4B-it-GGUF \
google_gemma-4-26B-A4B-it-Q6_K.gguf \
google_gemma-4-26B-A4B-it-Q8_0.gguf \
--local-dir .Running LLM
llama-server \
-m ~/Downloads/google_gemma-4-26B-A4B-it-Q6_K.gguf \
--port 1234 --host 0.0.0.0 \
-ngl 99 \
-c 163840 \
--jinja \
--flash-attn on \
-ctk q4_0 -ctv q4_0Legend:
- -ngl 99 - try to upload up to 99 transformer layers to GPU. Since most models have fewer than 99 layers, this effectively means “offload everything possible to GPU”.
- -c 163840 - set context window size to 163,840 tokens. This is large. For 26B model that is used, this takes a huge amount of RAM. For small/average code base a size of 16,384 must be good enough. For really large code base (thousands of files, a million of LOC) it might be necessary to set it to a large value.
- --jinja - enable Jinja chat templates embedded in the GGUF metadata. If not set, prompts may be formatted incorrectly, quality can degrade badly. Must be used for: Gemma, Qwen, Llama, DeepSeek, Mistral, others.
- --flash-attn on - enable Flash Attention: optimized attention kernel, reduces memory bandwidth usage, improves speed, especially important for long context.
- -ctk q4_0 - quantize K cache to 4-bit: memory savings, allows larger contexts. Quality degradation possible.
- -ctv q4_0 - quantize V cache to 4-bit. Same idea and effects but in scope of V tensors.
Claude Code
For my experiments I used Claude Code connected to the locally running LLM.
ANTHROPIC_BASE_URL="http://<IP>:1234" \
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 \
claude --model gemma-4-26bCheck the /status output to make sure it has connected to the local LLM.
With this setup I have got a very good experience - Claude Code with its native API still works better and the local setup cannot be even compared to it, but I was able to use superpowers:brainstorm skills for solving a coding task successfully (even though it took much more time than native API would have taken).
Code index MCP
If you have a big enough code base, it probably will be benefitial to also configure and use a tool that can index the code base and provide an MCP to the LLM for speed searching on the code base - which will be much smarter and quicker than using grep trying to identify relevant pieces of code.
For my experiments, I used code-index-mcp. Python and its tools (uv) must be installed already.
The setup is pretty simple - put the following into the project's section of ~/.claude.json:
{
"mcpServers": {
"code-index": {
"command": "uvx",
"args": ["code-index-mcp", "--project-path", "/absolute/path/to/repo"]
}
}
}Now start claude code and make sure with /mcp that the MCP is up and running. If everything is OK, tell it run refresh_index via code-index mcp - it will index the project and return number of files it has indexed; after that tell it run build_deep_index via code-index mcp - it will build a deep index of the project, which is useful for deep code analysis.
After everything is set up, when setting a task for Claude Code remind it that it can use the MCP - it speeds up the code analysis and navigation drastically and shortens the overall time the LLM takes to resolve a task very significantly.