Graphics hardware | Technology Tales

Running local Large Language Models on desktop computers and workstations with 8GB VRAM

7^th February 2026

Running large language models locally has shifted from being experimental to practical, but expectations need to match reality. A graphics card with 8 GB of VRAM can support local workflows for certain text tasks, though the results vary considerably depending on what you ask the models to do.

Understanding the Hardware Foundation

The Critical Role of VRAM

The central lesson is that VRAM is the engine of local performance on desktop systems. Whilst abundant system RAM helps avoid crashes and allows larger contexts, it cannot replace VRAM for throughput.

Models that fit in VRAM and keep most of their computation on the GPU respond promptly and maintain a steady pace. Those that overflow to system RAM or the CPU see noticeable slowdowns.

Hardware Limitations and Thresholds

On a desktop GPU with 8 GB of VRAM, this sets a practical ceiling. Models in the 7 billion to 14 billion parameter range fit comfortably enough to exploit GPU acceleration for typical contexts.

Much larger models tend to offload a significant portion of the work to the CPU. This shows up as pauses, lower token rates and lag when prompts become longer.

Monitoring GPU Utilisation

GPU utilisation is a reliable way to gauge whether a setup is efficient. When the GPU is consistently busy, generation is snappy and interactive use feels smooth.

A model like llama3.1:8b can run almost entirely on the GPU at a context length of 4,096 tokens. This translates into sustained responsiveness even with multi-paragraph prompts.

Model Selection and Performance

Choosing the Right Model Size

A frequent instinct is to reach for the largest model available, but size does not equal usefulness when running locally on a desktop or workstation. In practice, models between 7B and 14B parameters are what you can run on this class of hardware, though what they do well is more limited than benchmark scores might suggest.

What These Models Actually Do Well

Models in this range handle certain tasks competently. They can compress and reorganise information, expand brief notes into fuller text, and maintain a reasonably consistent voice across a draft. For straightforward summarisation of documents, reformatting content or generating variations on existing text, they perform adequately.

Where things become less reliable is with tasks that demand precision or structured output. Coding tasks illustrate this gap between benchmarks and practical use. Whilst llama3.1:8b scores 72.6% on the HumanEval benchmark (which tests basic algorithm problems), real-world coding tasks can expose deeper limitations.

Commit message generation, code documentation and anything requiring consistent formatting produce variable results. One attempt might give you exactly what you need, whilst the next produces verbose or poorly structured output.

The gap between solving algorithmic problems and producing well-formatted, professional code output is significant. This inconsistency is why larger local models like gpt-oss-20b (which requires around 16GB of memory) remain worth the wait despite being slower, even when the 8GB models respond more quickly.

Recommended Models for Different Tasks

Llama3.1:8b handles general drafting reasonably well and produces flowing output, though it can be verbose. Benchmark scores place it above average for its size, but real-world use reveals it is better suited to free-form writing than structured tasks.

Phi3:medium is positioned as stronger on reasoning and structured output. In practice, it can maintain logical order better than some alternatives, though the official documentation acknowledges quality of service limitations, particularly for anything beyond standard American English. User reports also indicate significant knowledge gaps and over-censorship that can affect practical use.

Gemma3 at 12B parameters produces polished prose and smooths rough drafts effectively when properly quantised. The Gemma 3 family offers models from 1B to 27B parameters with 128K context windows and multimodal capabilities in the larger sizes, though for 8GB VRAM systems you are limited to the 12B variant with quantisation. Google also offers Gemma 3n, which uses an efficient MatFormer architecture and Per-Layer Embedding to run on even more constrained hardware. These are primarily optimised for mobile and edge devices rather than desktop use.

Very large models remain less efficient on desktop hardware with 8 GB VRAM. Attempting to run them results in heavy CPU offloading, and the performance penalty can outweigh any quality improvement.

Memory Management and Configuration

Managing Context Length

Context length sits alongside model size as a decisive lever. Every extra token of context demands memory, so doubling the window is not a neutral choice.

At around 4,096 tokens, most of the well-matched models stay predominantly on the GPU and hold their speed. Push to 8,192 or beyond, and the memory footprint swells to the point where more of the computation ends up taking place on the CPU and in system RAM.

Ollama's Keep-Alive Feature

Ollama keeps already loaded models resident in VRAM for a short period after use so that the next call does not pay the penalty of a full reload. This is expected behaviour and is governed by a keep_alive parameter that can be adjusted to hold the model for longer if a burst of work is coming, or to release it sooner when conserving VRAM matters.

Practical Memory Strategies

Breaking long jobs into a series of smaller, well-scoped steps helps both speed and stability without constraining the quality of the end result. When writing an article, for instance, it can be more effective to work section by section rather than asking for the entire piece in one pass.

Optimising the Workflow

The Benefits of Streaming

Streaming changes the way output is experienced, rather than the content that is ultimately produced. Instead of waiting for a block of text to arrive all at once, words appear progressively in the terminal or application. This makes longer pieces easier to manage and revise on the fly.

Task-Specific Model Selection

Because each model has distinct strengths and weaknesses, matching the tool to the task matters. A fast, GPU-friendly model like llama3.1:8b works for general writing and quick drafting where perfect accuracy is not critical. Phi3:medium may handle structured content better, though it is worth testing against your specific use case rather than assuming it will deliver.

Understanding Limitations

It is important to be clear about what local models in this size range struggle with. They are weak at verifying facts, maintaining strict factual accuracy over extended passages, and providing up-to-date knowledge from external sources.

They also perform inconsistently on tasks requiring precise structure. Whilst they may pass coding benchmarks that test algorithmic problem-solving, practical coding tasks such as writing commit messages, generating consistent documentation or maintaining formatting standards can reveal deeper limitations. For these tasks, you may find yourself returning to larger local models despite preferring the speed of smaller ones.

Integration and Automation

Using Ollama's Python Interface

Ollama integrates into automated workflows on desktop systems. Its Python package allows calls from scripts to automate summarisation, article generation and polishing runs, with streaming enabled so that logs or interfaces can display progress as it happens. Parameters can be set to control context size, temperature and other behavioural settings, which helps maintain consistency across batches.

Building Production Pipelines

The same interface can be linked into website pipelines or content management tooling, making it straightforward to build a system that takes notes or outlines, expands them, revises the results and hands them off for publication, all locally on your workstation. The same keep_alive behaviour that aids interactive use also benefits automation, since frequently used models can remain in VRAM between steps to reduce start-up delays.

Recommended Configuration

Optimal Settings for 8 GB VRAM

For a desktop GPU with 8 GB of VRAM, an optimal configuration builds around models that remain GPU-efficient whilst delivering acceptable results for your specific tasks. Llama3.1:8b, phi3:medium and gemma3:12b are the models that fit this constraint when properly quantised, though you should test them against your actual workflows rather than relying on general recommendations.

Performance Monitoring

Keeping context windows around 4,096 tokens helps sustain GPU-heavy operation and consistent speeds, whilst streaming smooths the experience during longer outputs. Monitoring GPU utilisation provides an early warning if a job is drifting into a configuration that will trigger CPU fallbacks.

Planning for Resource Constraints

If a task does require more memory, it is worth planning for the associated slowdown rather than assuming that increasing system RAM or accepting a bigger model will compensate for the VRAM limit. Tuning keep_alive to the rhythm of work reduces the frequency of reloads during sessions and helps maintain responsiveness when running sequences of related prompts.

A Practical Content Creation Workflow

Multi-Stage Processing

This configuration supports a division of labour in content creation on desktop systems. You start with a compact model for rapid drafting, switch to a reasoning-focused option for structured expansions if needed, then finish with a model known for adding polish to refine tone and fluency. Insert verification steps between stages to confirm facts, dates and citations before moving on.

Because each stage is local, revisions maintain privacy, with minimal friction between idea and execution. When integrated with automation via Ollama's Python tools, the same pattern can run unattended for batches of articles or summaries, with human review focused on accuracy and editorial style.

In Summary

Desktop PCs and workstations with 8 GB of VRAM can support local LLM workflows for specific tasks, though you need realistic expectations about what these models can and cannot do reliably. They handle basic text generation and reformatting, though prone to hallucinations and misunderstandings. They struggle with precision tasks, structured output and anything requiring consistent formatting. Whilst they may score well on coding benchmarks that test algorithmic problem-solving, practical coding tasks can reveal deeper limitations.

The key is to select models that fit the VRAM envelope, keep context lengths within GPU-friendly bounds, and test them against your actual use cases. For tasks where local models prove inconsistent, such as generating commit messages or producing reliably structured output, larger local models like gpt-oss-20b (which requires around 16GB of memory) may still be worth the wait despite being slower. Local LLMs work best when you understand their limitations and use them for what they genuinely do well, rather than expecting them to replace more capable models across all tasks.

Additional Resources

Ollama Official Documentation
Ollama GitHub Repository
Hugging Face model hub
LM Studio for local model management
Jan AI for local AI deployment
Meta Llama 3.1 Model Card
Microsoft Phi-3 Documentation
Google Gemma 3 Overview
Google Gemma 3n Overview
Artificial Analysis for model benchmarks

Upheaval and miniaturisation

4^th March 2025

The ongoing AI boom got me refreshing my computer assets. One was a hefty upgrade to my main workstation, still powered by Linux. Along the way, I learned a few lessons:

Processing with LLM's only works on a graphics card when everything can remain within its onboard memory. It is all too easy to revert to system memory and CPU usage, given the amount of memory you get on consumer graphics cards. That applies even with the latest and greatest from Nvidia, when the main use case is for gaming. Things become prohibitively expensive when you go on from there.
Even with water cooling, keeping a top of the range CPU cool and its fans running quietly remains a challenge, more so than when I last went for a major upgrade. It takes time for things to settle down.
My Iiyama monitor now feels flaky with input from the latest technology. This is enough to make me look for a replacement, and it is waking up from dormancy that is the real issue. While it was always slow, plugging out from mains electricity and then back in again is a hack that is needed all too often.
KVM switches may need upgrading to work with the latest graphical input. The monitor may have been a culprit with the problems that I was getting, yet things were smoother once I replaced the unit that I had been using with another that is more modern.
AMD Ryzen 9 chips now have onboard graphics, a boon when things are not proceeding too well with a dedicated graphics card. Even though this was not the case when the last major upgrade happened, there were no issues like what I faced this time around.
Having LED's on a motherboard to tell what might be stopping system startup is invaluable. This helped in July 2021 and averted confusion this time around as well. While only four of them were on offer, knowing which of CPU, DRAM, GPU or system boot needs attention is a big help.
Optical drives are not needed any longer. Booting off a USB drive was enough to get Linux Mint installed, once I got the image loaded on there properly. Rufus got used, and I needed to select the low-level writing option before things proceeded as I had hoped.

Just like 2021, the 2025 upgrade cycle needed a few weeks for everything to settle down. The previous cycle was more challenging, and this was not just because of an accompanying heatwave. The latest one was not so bedevilled.

Given the above, one might be tempted to go for a less arduous path, like my acquisition of an iMac last year for another place that I own. After all, a Mac Mini packs in quite a lot of power, and it is not the only miniature option. Now that I have one, I have moved image processing off the workstation and onto it. The images are stored on the Linux machine and edited on the Mac, which has plenty of memory and storage of its own. There is also an M4 chip, so processing power is not lacking either.

It could have been used for work affairs, yet I acquired a Geekom A8 for just that. Though seeking work as I write this, my being an incorporated freelancer means that having a dedicated machine that uses my main monitor has its advantages. Virtualisation can allow drift from business affairs to business matters, that is not so easy when a separate machine is involved. There is no shortage of power either with an AMD Ryzen 9 8945HS and Radeon 780M Graphics on board. Add in 32 GB of memory and 2 TB of storage and all is commodious. It can be surprising what a small package can do.

The Iiyama's travails also pop up with these smaller machines, less so on the Geekom than with the Mac. The latter needs the HDMI cable to be removed and reinserted after a delay to sort out things. Maybe that new monitor may not be such an off the wall idea after all.

A need to update graphics hardware

16^th June 2013

As someone who doesn't play computer games, I rarely prioritise graphics card upgrades. Yet, I recently upgraded graphics cards in two of my PCs despite nothing being broken. My backup machine, built nearly four years ago, has run multiple Linux distributions. It uses an ASRock K10N78 motherboard from MicroDirect with an integrated NVIDIA graphics chip that performs adequately, if not exceptionally. The only issue was slightly poor text rendering in web browsers, but this alone wasn't enough to justify adding a dedicated graphics card.

More recently, I ran into trouble with Sabayon 13.04 with only the 2D variant of the Cinnamon desktop environment working on it and things getting totally non-functional when a full re-installation of the GNOME edition was attempted. Everything went fine until I added the latest updates to the system, when a reboot revealed that it was impossible to boot into a desktop environment. Some will relish this as a challenge, but I need to admit that I am not one of those. In fact, I tried out two Arch-based distros on the same PC and got the same results following a system update on each. So, my explorations of Antergos and Manjaro have to continue in virtual machines instead.

When I tried Linux Mint 15 Cinnamon, it worked perfectly. However, newer distributions with systemd didn't work with my onboard NVIDIA graphics. Since systemd will likely come to Linux Mint eventually, I decided to add a dedicated graphics card. Based on good past experiences with Radeon, I chose an AMD Radeon HD 6450 from PC World, confirming it had Linux driver support. Installation was simple: power off, insert card, close case, power on. Later, I configured the BIOS to prioritise PCI Express graphics, though this step wasn't necessary. I then used Linux Mint's Additional Driver applet to install the proprietary driver and restarted. To improve web browser font rendering, I selected full RGBA hinting in the Fonts applet. The improvement was obvious, though still not as good as on my main machine. Overall, the upgrade improved performance and future-proofed my system.

After upgrading my standby machine, I examined my main PC. It has both onboard Radeon graphics and an added Radeon 4650 card. Ubuntu GNOME 12.10 and 13.04 weren't providing 3D support to VMware Player, which complained when virtual machines were configured for 3D. Installing the latest fglrx driver only made things worse, leaving me with just a command line instead of a graphical interface. The only fix was to run one of the following commands and reboot:

sudo apt-get remove fglrx

sudo apt-get remove fglrx-updates

Looking at the AMD website revealed that they no longer support 2000, 3000 or 4000 series Radeon cards with their latest Catalyst driver, the last version that did not install on my machine since it was built for version 3.4.x of the Linux kernel. A new graphics card then was in order if I wanted 3D graphics in VMware VM's and both GNOME and Cinnamon appear to need this capability. Another ASUS card, a Radeon HD 6670, duly was acquired and installed in a manner similar to the Radeon HD 6450 on the standby PC. Apart from not needing to alter the font rendering (there is a Font tab on the Gnome Tweak Tool where this can be set), the only real exception was to add the Jockey software to my main PC for installation of the proprietary Radeon driver. The following command does this:

sudo apt-get install jockey-kde

After completing installation, I ran the jockey-kde command and selected the first driver option. Upon restart, the system worked properly except for an AMD message in the bottom-right corner warning about unrecognised hardware. Since there were two identical entries in the Jockey list, I tried the second option. After restarting, the incompatibility message disappeared and everything functioned correctly. VMware even ran virtual machines with 3D support without any errors, confirming the upgrade had solved my problem.

Hearing of someone doing two PC graphics card upgrades during a single weekend may make you see them as an enthusiast, but my disinterest in computer gaming belies this. Maybe it highlights that Linux operating systems need 3D more than might be expected. The Cinnamon desktop environment now issues messages if it is operating in 2D mode with software 3D rendering and GNOME always had the tendency to fall back to classic mode, as it had been doing when Sabayon was installed on my standby PC. However, there remain cases where Linux can rejuvenate older hardware and I installed Lubuntu onto a machine with 10-year-old technology on there (an 1100 MHz Athlon CPU, 1GB of RAM and 60GB of hard drive space in a case dating from 1998) and it works surprisingly well too.

It appears that having fancier desktop environments like GNOME Shell and Cinnamon means having the hardware on which it could run. For a while, I have been tempted by the possibility of a new PC, since even my main machine is not far from four years old either. However, I also spied a CPU, motherboard and RAM bundle featuring an Intel Core i5-4670 CPU, 8GB of Corsair Vengeance Pro Blue memory and a Gigabyte Z87-HD3 ATX motherboard included as part of a pre-built bundle (with a heat sink and fan for the CPU) for around £420. Even for someone who has used AMD CPU's since 1998, that does look tempting, but I'll hold off before making any such upgrade decisions. Apart from exercising sensible spending restraint, waiting for Linux UEFI support to mature a little more may be no bad idea either.

Update 2013-06-23: The new graphics card in my main machine works well and has reduced system error messages; Ubuntu GNOME 13.04 likely had issues with my old card. On my standby machine, I found and removed a rogue .fonts.conf file in my home directory, which dramatically improved font display. If you find this file on your system, consider removing or renaming it to see if it helps. Alternatively, adjusting font rendering settings can improve display quality, even on older systems like Debian 6 with GNOME 2. I may test these improvements on Debian 7.1 in the future.