Your Own AI: Local, Private, Affordable
- Max Bäumler
- 21 Mar, 2026
- 06 Minuten
- Background , AI , En
ChatGPT, Claude, Gemini, all the major AI services have one thing in common: your data ends up on someone else’s servers. For many businesses, that’s fine. Really? If you’re working with confidential documents, customer data, or internal know-how, the picture looks quite different.
The alternative: an AI that runs at your place. On your hardware. In your network. Without a single word leaving your premises.
Sounds expensive and complicated? Not anymore. With the right hardware, you can get a fully capable local AI infrastructure today for around €3,000. Running models that perform on par with the top models from a few months ago.
What’s Behind It?
The foundation is a relatively new chip from AMD: the Strix Halo APU, found in machines like the Framework Desktop or the GMKtec EVO-X2. What makes this chip special: it combines CPU and GPU on a single die and shares a common pool of 128 GB of memory between them.
That’s the key point. Regular, affordable graphics cards have 32 GB of VRAM. The only workstation card with 96 GB costs around €9,000 on its own, before you even build a PC around it. Here you get 128 GB, shared between the processor and the GPU. And that’s exactly what large language models need: a lot, a very lot, of memory.
The GPU portion of the chip is called the Radeon 8060S, not an enthusiast part, but fully equipped with GPU acceleration for AI inference.
How Does It Compare Pricewise?
For context: there are a few other routes to local AI with comparable memory capacity.
| Hardware | Memory | Price (approx.) | Notes |
|---|---|---|---|
| GMKtec EVO-X2 / Framework Desktop (Strix Halo) | 128 GB unified | ~€3,000 | This setup |
| Apple Mac Studio M4 Max | 128 GB unified | ~€5,200 | macOS, Apple Metal instead of ROCm |
| NVIDIA DGX Spark (Asus Ascent GX10) | 128 GB unified | ~€3,300 | OEM variant of the DGX Spark |
| NVIDIA DGX Spark (Founders Edition) | 128 GB unified | ~€4,300 | NVIDIA Blackwell, recently price-hiked |
| NVIDIA RTX 5090 (GPU only, 32 GB) | 32 GB GDDR7 | ~€3,300 | Only sufficient for smaller models; workstation extra |
| NVIDIA RTX Pro 6000 (GPU only, 96 GB) | 96 GB GDDR7 | ~€9,000 | 96 GB fits most 120B models; workstation extra |
A few notes: the Mac Studio is a solid alternative for anyone who prefers macOS and wants to stay away from Linux. The DGX Spark in its OEM form as the Asus Ascent GX10 costs similarly to the Strix Halo setup but runs Blackwell hardware and is a real alternative in the same price range. The RTX 5090’s 32 GB is simply not enough memory for the large models described here. The RTX Pro 6000 with 96 GB would be the GPU of choice for maximum performance, but it costs three times as much as the complete Strix Halo system — for the GPU alone.
What Runs on It?
The server can run a range of models, switching between them automatically. Here are a few examples:
| Model | Size | Strengths |
|---|---|---|
| GPT OSS 120B | ~61 GB | All-rounder, strong reasoning |
| Qwen3 Coder Next 80B | ~79 GB | Code, technical tasks |
| Qwen3.5 122B | ~99 GB | General purpose + image understanding (stronger than GPT OSS 120B) |
The Qwen3.5 122B is the strongest of the three. In independent benchmarks it scores roughly on par with GPT-5 mini, placing it just below the very best models available a few months ago. Running locally doesn’t make it worse, just more private.
All models support a context window of 128,000 to 262,144 tokens, which translates to roughly 90,000 to 180,000 words, or a complete book. You can process long documents, smaller code repositories, or extensive conversation histories without compromise.
Since memory is capped at 128 GB, only one model can be loaded at a time. That sounds like a constraint, but in practice it mostly isn’t. The system loads the requested model automatically in the background when you switch.
How Fast Is It?
Here are the real benchmark numbers from the system:
| Model | Prompt processing | Text generation |
|---|---|---|
| GPT OSS 120B | 637 tokens/s | 37 tokens/s |
| Qwen3 Coder Next 80B | 735 tokens/s | 37 tokens/s |
| Qwen3.5 122B | 288 tokens/s | 20 tokens/s |
For context: the average reader manages around 250 words per minute, which is roughly 5 to 6 tokens per second. The system generates at 20 to 37 tokens/s, so 3 to 6 times faster than you can read. Output feels fluid, with no noticeable wait.
Prompt processing is even faster. Even the heaviest model hits nearly 300 tokens/s when reading in long documents or prompts. That matters a lot when working with large documents or big codebases.
For one or two concurrent users this is no problem at all. For a small team of 5 to 10 people who aren’t all active at the same time, it works well too.
To be honest though: for some tasks, especially with large documents or a lot of code, you’ll wait a moment for the response.
The Software Stack
The setup is automated via an Ansible playbook. Anyone who has set up a Linux machine before will be comfortable with it. A playbook is nothing more than a script that runs all the installation steps automatically. I’ve published the complete setup as an open-source project: github.com/schutzpunkt/strix-halo-ai-stack.
Several open-source projects work together in my playbook:
llama.cpp is the actual inference engine. It loads the models and runs the computations, with full GPU acceleration via AMD’s ROCm stack.
llama-swap handles model management. It exposes an Anthropic and OpenAI-compatible API and takes care of unloading the old model and loading the new one when you switch. Because the API is OpenAI-compatible, common development tools like Continue, Cursor, or Claude Code can connect to it directly. The same tools you’d normally point at a cloud service just point at your local server instead.
Open WebUI is the interface for everyone on the team who doesn’t want to work via an API. It looks and works like ChatGPT: each team member gets their own account with their own conversation history, can switch models from a dropdown, and upload files or images. Admins manage users and access rights centrally. The Qwen3.5 122B can also understand images, so scanned documents, screenshots, or photos can be analysed directly.
On top of that, an NGINX reverse proxy handles secure HTTPS connections with automatic certificate renewal, so your browser never shows a security warning.
Privacy and Compliance
This is the real reason local AI is interesting for businesses:
- No data leaves your network. No vendor sees your inputs.
- No API contract, no terms of service from external providers.
- Full control over which models you run and how they’re configured.
- No internet connection required, the system runs completely offline.
For law firms, engineering offices, pentesters, accountants, or anyone working with personal or highly sensitive data, this is a significant difference compared to any cloud service.
Limitations: Be Honest With Yourself
It would be dishonest to describe the system without its constraints. Here’s what you need to know:
Only one model at a time. Memory fits exactly one large model. If someone switches models, everyone waits about 30 seconds for it to load. You can avoid this by only offering your team a single fixed model. It really depends on your team’s needs. For most regular users one fixed model is enough. IT can coordinate internally about which models make sense.
Not for many concurrent users. If ten people are actively typing at the same time, the ones at the back of the queue will feel it. For 2 to 3 parallel conversations performance is good, beyond that it gets slower.
No replacement for specialised services. For image generation, real-time speech-to-text, or other specialised AI tasks, you’ll need other tools. This system is designed for text inference.
ROCm on this chip is relatively new. AMD GPU support for Strix Halo is built on community work (thanks to kyuz0) and works well, but it’s not the same as a mature NVIDIA system with years of driver stability. Occasional updates may require adjustments.
Setup requires someone with Linux knowledge. The playbook takes a lot of the work off your hands, but installing Fedora, setting up SSH, and configuring DNS are prerequisites that someone on the team needs to handle.
No VMM support. The chip doesn’t support GPU virtualisation. The system runs directly on the host, so running the AI inside a VM isn’t possible.
Who Is This For?
Local AI makes sense when at least one of these applies:
- You work with data that can’t go to the cloud
- You have 5 to 30 internal users working with AI daily
- You want to avoid ongoing API costs (which add up fast with heavy use)
- You want to adapt the AI to your own processes without dependency on external providers
Anyone who occasionally asks AI to rewrite a paragraph is probably fine with a regular cloud subscription. But if you want to seriously integrate AI into your daily work, for document analysis, code reviews, internal knowledge bases, or summaries, the hardware pays for itself quickly.
Conclusion
A mini PC, 128 GB of memory, open source all the way through: the result is a fully capable AI server running models on par with GPT-5 mini. Two years ago this was simply not conceivable. Today it’s reality for around €3,000.
Not perfect. Not infinitely scalable. But for what most smaller businesses need, it’s more than sufficient, and from a data privacy standpoint it’s considerably cleaner than any cloud alternative.