Guide on How to Run LLMs Locally While Everyone Else Burns Money on GPUs

_Total build cost: ~$1,299 | 8 components | Built for inference, fine-tuning, and serious local ML work_ ![[IMG_0705.jpg]] --- Most "build a PC for ML" guides are written by people who picked parts from a Reddit thread and called it a day. This one isn't that. Every component in this rig was chosen because it solves a specific constraint. I'll tell you what the constraint was, what I picked, and why the alternatives didn't make the cut. If you're building for LLM work specifically — fine-tuning, inference, RAG pipelines, running 13B+ models locally — this guide is for you. Let's get into it. --- ## The Philosophy Before the Parts List Here's the thing nobody says clearly: **LLM training is a memory bandwidth problem, not a compute problem.** Most builders optimize for CUDA cores and clock speed. For LLM work, the bottleneck is almost always VRAM and the speed at which you can move tensors across the memory bus. A GPU with slower cores but more VRAM will out-train a faster GPU with less. Keep that principle in your head as you read the rest of this. Second thing: **don't cheap out on the parts that touch your data.** Storage, RAM, and the CPU's memory controller all determine how fast your training pipeline feeds the GPU. A starved GPU is a wasted GPU. With that framing — here's the build. --- ## The Parts ### GPU — ASUS Prime GeForce RTX 5070 12GB GDDR7 | $609.99 This is the centerpiece and the hardest decision in any ML build. The 5070 sits in an interesting spot: it's not the flagship, but it's the highest VRAM-per-dollar in the current Blackwell lineup at 12GB GDDR7. For LLM work, 12GB is the practical floor for running quantized 13B models (GGUF Q4 fits comfortably), fine-tuning smaller models with LoRA/QLoRA, and running multi-modal inference without constant OOM errors. **Why not the 4070 Ti Super (16GB)?** The 5070's GDDR7 has significantly higher memory bandwidth than GDDR6X. Bandwidth wins over raw capacity for training throughput when you're working with batch sizes that fit in 12GB. **Why not wait for the 5080/5090?** Because the price jump doesn't linearly translate to training speedup for the workloads I'm running. If you're training foundation models from scratch on massive datasets, get the 5090. If you're fine-tuning, doing inference, and building pipelines — the 5070 is the high-leverage buy. **CUDA compatibility:** Full support for CUDA 12.x, cuDNN 9.x, and the entire PyTorch/JAX stack. No surprises here — ASUS's Prime series runs cool and quiet, which matters for multi-hour training runs. --- ### CPU — AMD Ryzen 5 9600X 6-Core | $207.99 The CPU's job in an ML rig is to not be the bottleneck. That sounds dismissive, but it's actually a high bar. The 9600X (Zen 5 architecture) hits it cleanly: fast single-core performance for Python data pipelines, excellent memory controller for feeding the GPU, and 6 cores that handle preprocessing, tokenization, and DataLoader workers without choking. **Why not more cores?** For LLM training, your GPU is doing the heavy lifting. You need enough CPU cores to saturate GPU utilization — typically 4–8 workers for DataLoaders — and after that, more cores don't translate to faster training. The 9600X is exactly enough, which means no money left on the table for cores that sit idle. **Why not Intel?** The B650 platform (what this pairs with) is excellent, and AMD's memory controller on Zen 5 is genuinely fast with DDR5-6000 — more on that in the RAM section. --- ### Motherboard — MSI PRO B650M-A WiFi | $156.99 The B650M-A is a Micro-ATX board that does what it needs to do without padding the price for features you won't use. What actually mattered here: - **PCIe 4.0 x16** — full bandwidth lane for the RTX 5070. Don't put a current-gen GPU on PCIe 3.0. - **DDR5 support up to 7800MHz+** — the memory controller needs to match the RAM's potential. - **M.2 slots** — two of them, NVMe, for fast dataset access. - **WiFi built-in** — one less card to buy. The B650 chipset is the sweet spot for Ryzen 9000-series. X670 adds PCIe 5.0 lanes and costs more — useful if you're stacking NVMe drives that saturate Gen 5 bandwidth. For this build, B650 is the right call. --- ### RAM — Crucial Pro 32GB DDR5 Kit (2×16GB) CL36 6000MHz | $86.99 **32GB is the minimum for serious LLM work.** This isn't a general computing recommendation — it's specific to the workload. Here's why: large datasets don't fully fit in GPU VRAM. Your training pipeline is constantly moving data between system RAM and GPU memory. If system RAM is the bottleneck, you'll see your GPU utilization drop between batches. That's wasted compute. **Why 6000MHz specifically?** On AMD Ryzen with Zen 5, DDR5-6000 is the sweet spot for the memory controller's FCLK (Fabric Clock) ratio. Running at 6000MHz achieves a 1:1 FCLK-to-MCLK ratio, which maximizes memory bandwidth without instability. Going higher (6400, 6800) gives diminishing returns and often requires relaxed timings that eat back the gains. CL36 at 6000MHz gives you tight-enough timings for this platform. The Crucial Pro kit is XMP-certified for this config, so it runs at rated speed out of the box. --- ### Storage — Kingston NV3 1TB M.2 NVMe PCIe 4.0 | $62.00 Fast local storage matters more for ML than people realize. Training pipelines that read large datasets from slow storage will starve the GPU, causing low utilization even when VRAM is fine. The Kingston NV3 tops out around 6,000 MB/s sequential read on PCIe 4.0. For training on datasets that don't fit in RAM (which is most real-world cases), that bandwidth directly affects how long you're waiting between epochs. **1TB is enough if you're disciplined.** Keep your active datasets and model checkpoints on the NVMe, offload older checkpoints and raw data to a secondary HDD or cloud. Most people bloat their NVMe with datasets they've already processed — don't. **For serious dataset work:** add a second NVMe (the B650M-A has two M.2 slots) or a large SATA SSD for bulk storage. That upgrade costs ~$60–80 and removes storage as a bottleneck entirely. --- ### CPU Cooler — Thermalright Peerless Assassin 120 SE | $34.90 The Ryzen 9600X runs hot under sustained load — it boosts aggressively and the stock cooler isn't adequate for multi-hour training runs where the CPU is constantly feeding the GPU. The Peerless Assassin 120 SE is a dual-tower, 6-heat-pipe cooler that keeps the 9600X well under thermal limits even under sustained load. It's the #1 best seller in CPU coolers for a reason: absurd performance-per-dollar. **Why cooling matters for ML specifically:** thermal throttling mid-training is a silent killer. Your training run appears fine, your loss curves look normal, but your CPU has been running at 70% of its rated speed for the last hour because it's thermally limited. A good cooler is cheap insurance. --- ### Case — Cooler Master MasterBox Q300L Micro-ATX | $39.99 The Q300L is a Micro-ATX case with magnetic dust filters, good airflow routing, and room for the Peerless Assassin cooler (verify clearance — it's tight, but it fits). **Why Micro-ATX?** This is a workstation, not a showpiece. Smaller footprint, same full-length GPU support, same cooling potential. If you're running this under a desk or in a tight space, Micro-ATX is the right form factor. The Q300L's mesh front panel is the key feature: passive airflow is good, which means the case fans don't have to work as hard to maintain temperatures. Quieter sustained operation. --- ### PSU — Corsair RM750e (2025) Fully Modular ATX | $99.99 The RTX 5070 has a 200W TDP. The 9600X has a 65W TDP. Everything else is noise. Budget ~300W for the GPU and CPU under full load, add 100W headroom for the rest, and you need a PSU rated for at least 500W. **750W is the right call.** Headroom matters for two reasons: efficiency (PSUs run most efficiently at 50–80% load, so 750W at 350W actual draw is in the sweet zone) and future-proofing (if you upgrade to a 5080 down the line, 750W still covers you). The RM750e is 80+ Gold certified, fully modular, and dead silent under normal load — the fan doesn't spin up until the PSU hits significant load, which almost never happens in an ML rig at 350W draw. **Don't cheap out on the PSU.** It's the only component whose failure takes everything else with it. --- ## The Build at a Glance |Component|Part|Price| |---|---|---| |GPU|ASUS Prime RTX 5070 12GB GDDR7|$609.99| |CPU|AMD Ryzen 5 9600X|$207.99| |Motherboard|MSI PRO B650M-A WiFi|$156.99| |RAM|Crucial Pro 32GB DDR5-6000|$86.99| |Storage|Kingston NV3 1TB NVMe PCIe 4.0|$62.00| |CPU Cooler|Thermalright Peerless Assassin 120 SE|$34.90| |Case|Cooler Master MasterBox Q300L|$39.99| |PSU|Corsair RM750e 750W Fully Modular|$99.99| |**Total**||**~$1,298.84**| --- ## Software Stack (Don't Skip This) The hardware is half the story. Here's what to install once it's running: ```bash # CUDA Toolkit (match to your driver version) # https://developer.nvidia.com/cuda-downloads # PyTorch with CUDA support pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 # For LLM fine-tuning pip install transformers accelerate peft bitsandbytes # For inference pip install llama-cpp-python # CPU+GPU hybrid inference for GGUF models ``` **Key config decisions:** - **bitsandbytes** enables QLoRA — fine-tune 7B–13B models in 4-bit with this rig's 12GB VRAM. Without it, you're limited to smaller models or full-precision inference only. - **accelerate** handles mixed-precision training out of the box. Run `accelerate config` before your first training run. - **VRAM management:** use `torch.cuda.empty_cache()` between runs and set `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512` to reduce memory fragmentation on longer runs. --- ## What This Rig Can Actually Do To be concrete about capabilities: - **Inference:** Run Llama 3.1 8B at full precision (BF16), Llama 3.1 70B in 4-bit GGUF at usable speeds - **Fine-tuning:** QLoRA fine-tune 7B models on custom datasets comfortably; 13B with careful batch size management - **Embeddings:** Generate embeddings at scale for RAG pipelines — this is CPU+GPU parallelism, and this rig handles it well - **Local development:** Full replacement for cloud GPU instances for anything under ~20B parameters in quantized form What it won't do: train 70B+ models from scratch, run multiple large models simultaneously, or compete with an A100 on raw FLOPS. For that, you're looking at a different budget tier or a cloud burst strategy. --- ## Final Thought Under $1,300 for a rig that runs local LLMs, fine-tunes on custom data, and eliminates cloud GPU costs for most research and production workloads — that's a good trade. The parts in this build aren't the cheapest options. They're the highest-leverage ones. There's a difference. If you have questions on any component choice or want to talk through adapting this for a specific use case, I'm at [linkedin.com/in/harshmarar](https://linkedin.com/in/harshmarar). --- _Tags: `#ml-infra` `#llm` `#hardware` `#build-log` `#pytorch`_