Local AI for coding — Mens Clara

The post of Mark Watson encouraged me to try running LLMs locally on my MacBook Pro M1 Max 64 for coding with Claude Code. I experimented running the following models with ollama: deepseek-coder-v2:16b-lite-instruct, qwen2.5-coder:14b-base, nemotron-3-nano, devstral-small-2, glm-4.7-flash - and I was disappointed: only with one of the models (don't remeber which one) did I manage to have a tiny (and useless) conversation. Although, recently Google released its model Gemma-4 which I tried with llama.cpp and this was a magic (comparing to the previous experience).

Here I log my experience on this subject, mostly for myself.

Installing llama.cpp

brew install llama.cpp
brew install hf

Download models

Here we download two versions: the 6 which is smaller (21Gb) and so must be quicker but still good enough and the 8 which is a bit bigger (25Gb) and must be slower but more powerful - the idea was to find a good balance between the quality of the LLM output and speed (tokens/sec).

hf download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  google_gemma-4-26B-A4B-it-Q6_K.gguf \
  google_gemma-4-26B-A4B-it-Q8_0.gguf \
  --local-dir .

Running LLM

llama-server \
  -m ~/Downloads/google_gemma-4-26B-A4B-it-Q6_K.gguf \
  --port 1234 --host 0.0.0.0 \
  -ngl 99 \
  -c 163840 \
  --jinja \
  --flash-attn on \
  -ctk q4_0 -ctv q4_0

Legend:

-ngl 99 - try to upload up to 99 transformer layers to GPU. Since most models have fewer than 99 layers, this effectively means “offload everything possible to GPU”.
-c 163840 - set context window size to 163,840 tokens. This is large. For 26B model that is used, this takes a huge amount of RAM. For small/average code base a size of 16,384 must be good enough. For really large code base (thousands of files, a million of LOC) it might be necessary to set it to a large value.
--jinja - enable Jinja chat templates embedded in the GGUF metadata. If not set, prompts may be formatted incorrectly, quality can degrade badly. Must be used for: Gemma, Qwen, Llama, DeepSeek, Mistral, others.
--flash-attn on - enable Flash Attention: optimized attention kernel, reduces memory bandwidth usage, improves speed, especially important for long context.
-ctk q4_0 - quantize K cache to 4-bit: memory savings, allows larger contexts. Quality degradation possible.
-ctv q4_0 - quantize V cache to 4-bit. Same idea and effects but in scope of V tensors.

Claude Code

For my experiments I used Claude Code connected to the locally running LLM.

ANTHROPIC_BASE_URL="http://<IP>:1234" \
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 \
claude --model gemma-4-26b

Check the /status output to make sure it has connected to the local LLM.

With this setup I have got a very good experience - Claude Code with its native API still works better and the local setup cannot be even compared to it, but I was able to use superpowers:brainstorm skills for solving a coding task successfully (even though it took much more time than native API would have taken).

Code index MCP

If you have a big enough code base, it probably will be benefitial to also configure and use a tool that can index the code base and provide an MCP to the LLM for speed searching on the code base - which will be much smarter and quicker than using grep trying to identify relevant pieces of code.

For my experiments, I used code-index-mcp. Python and its tools (uv) must be installed already.

The setup is pretty simple - put the following into the project's section of ~/.claude.json:

{
  "mcpServers": {
    "code-index": {
      "command": "uvx",
      "args": ["code-index-mcp", "--project-path", "/absolute/path/to/repo"]
    }
  }
}

Now start claude code and make sure with /mcp that the MCP is up and running. If everything is OK, tell it run refresh_index via code-index mcp - it will index the project and return number of files it has indexed; after that tell it run build_deep_index via code-index mcp - it will build a deep index of the project, which is useful for deep code analysis.

After everything is set up, when setting a task for Claude Code remind it that it can use the MCP - it speeds up the code analysis and navigation drastically and shortens the overall time the LLM takes to resolve a task very significantly.