Local models are (not) cope

The imminent release of the shiny and aptly named DGX Spark led to some discussions about what you actually do with this thing. Cause let's be real: No one really wants to run Llama 3 405B using 8,000 dollars worth of hardware in 4 bit precision. For local models, people often cite privacy as a reason on social media, as they want to talk to their PDFs hosted on Google Drive. Or that they love to code with a 30B parameter model, as they want the rush of re-generating code over and over again instead of hitting codex for a working solution once. I think those people are weird^[1] and willingly take sacrifices to be the cool kid running a local LLM.

However, some people are yearning for deep integration of LLMs into modern operating systems. Why would you want this compared to having a shortcut to ChatGPT everywhere? There is one area where local models beat any hosted model: End-to-end latency, i.e., the time it takes from your prompt to get the whole response back. It will always be faster to use your own hardware than to send the request through the internet to some rural place which hosts a datacenter just to generate 20 tokens and then send everything back again.

Time and time again I underestimate the effect of latency and speed when it comes to usability of applications. Having instant reactions to what you do unlocks so many new possibilities as you don't have to sit around awkwardly and wait up to five seconds just for something to appear on your screen.

What does this mean in practice for me? I have started transcribing things. A lot. This text? Transcribed. The mail I've sent? Transcribed. A lengthy Discord message for an argument? Transcribed. Prompts for LLMs like Codex? Believe it or not, transcribed. The latter is really important as the more context you give to models, especially GPT-5, the better their results are. Especially since Parakeet 3 dropped, which is a very capable and even faster model (you get the idea) than Whisper, my usage has increased substantially, reshaping how I interact with my Macbook.

Another use case are small scripts that use LLMs to do things where latency would kill the flow. I have, for example, a script which translates the content of my clipboard to German or English. With models like Llama 4 Scout hosted on Cerebras, this procedure was really fast due to them hosting it at thousands of tokens. However, I still needed to wait for the full response to come back to me, resulting in me awkwardly sitting around for seconds, staring at a blank screen. With local models, which I have loaded all the time in the background, this time is nearly instant. I also have similar scripts for the extraction of content from unstructured texts.

In the end, we all have to use LLMs more.

# How I set up my local system

Important: I use a beefy Macbook (M4 Max, 64GB RAM), but even with lesser hardware you get pretty much the same experience. For the speech-to-text software, I have no experience with alternatives.

I have tried a lot of speech to text apps, some of them are free; while others ask for 100 dollars a year (or more!) to use your hardware. I personally think this is ridiculous and have opted to use MacWhisper instead, which is a fair one-time purchase and offers all the same features everyone else has. In fact, every closed app uses argmax under the hood, which makes the quality virtually the same, with open source solutions having to opt into less capable inference engines.

To enable the before mentioned Parakeet model in MacWhisper, go to Settings > Transcription Models > Pro > Parakeet v3. MacWhisper has a ton of modes and functions, but I exclusively use the Dictation mode. You can set up an AI service to clean up the raw transcription from Parakeet, which I strongly recommend. Of course, you could use any cloud LLM, but that would defeat the point of running local models and would add a ton of latency. Therefore, I use the free (and amazing!) LM Studio.

MacWhisper Settings

LM Studio is a chat app (similar to ChatGPT) which also exposes a web server (with OpenAI compatible endpoints for scripts etc) and allows you to use different runtimes and models. For macOS, they offer Metal llama.cpp (the real one, not a butchered re-implementation) and MLX, the engine by Apple for Apple Silicon. Needless to say that you should always use MLX if available. In my own tests, I get 20-50% speedups by using MLX over Metal llama.cpp. Ollama, a very popular alternative, has started implementing its own runtime, resulting in even slower speeds than those of llama.cpp.

Downloading the model is really simple as seen in the screenshot. After clicking on the model name, it gets loaded into memory, allowing you to chat with it right away. As a reference: I get 100 tok/s on my Macbook with this model, others do, too.

Why this model and this quantization (the 6bit)? Personally, I find the 3-5B parameter size range appealing, as it only takes 3-4 GB of RAM, a willing sacrifice. I have tried a lot of open models, ranging from AFM, LFM, SmolLM3 to IBM's Granite. However, no model is as good as Qwen3 in the two areas that matter for local models: Multilinguality and instruction following. While the former is something that some people care less about, the latter is crucial. For the cleanups of the dictation, you need a model to adhere to your instruction, else the system prompt leaks into your dictation text.

LM Studio Settings

For LM Studio, there are two other, crucial settings which you can find under App Settings > Developer: First, you need to turn on the LLM Service for other apps, including MacWhisper or your scripts, to use the models. The other setting (Max idle TTL) ensures that the Qwen3 model is loaded into memory for as long as you desire. I have set this to a high value to ensure that the models I use for everything is loaded all the time, so I don't have to ever wait until the model is loaded into memory.

LM Studio Settings

I will quote this blog in a subsequent post called "How I used Qwen4 to vibe-code a RAG app for my documents on my MacBook", stay tuned. ↩︎