Generating commit messages and summarising text locally with Ollama running on Linux
For generating GitHub commit messages, I use aicommit, which I have installed using Homebrew on macOS and on Linux. By default, this needs access to the OpenAI API using a token for identification. However, I noticed that API usage is heavier than when I summarise articles using Python scripting. In the interest of cutting the load and the associated cost, I began to look at locally run LLM options. Here, I discuss things mainly from a Linux point of view, particularly since I use Linux Mint for daily work.
Hardware Considerations
That led me to Ollama, which also has a local API in the mould of what you get from OpenAI. It also offers a Python interface, which has plenty of uses. This experimentation began on an iMac, where macOS can access all the available memory, offering flexibility when it comes to model selection. On a desktop PC or workstation, the architecture is different, which means that you are dependent on GPU processing for added speed. Should the load fall on the CPU, the lag in performance cannot be missed. The situation can be seen from this command while an LLM is loaded:
ollama ps
That discovery was made at the end of 2024, prompting me to do a system upgrade that only partially addressed the need, even if a quieter cooler case was part of the new machine. Before that, I had tried a new Nvidia GeForce RTX 4060 graphics card with 8 GB of VRAM. That continued in use, though the amount of onboard memory meant that larger models overflowed into system memory, bringing the CPU in use, still substantially slowing processing. Though there are some reasonable models like llama3.1:8b that will fit within 8 GB of VRAM, that has limitations that became apparent with use. Hallucinations were among those, and that also afflicted alternative options.
That led me to upgrade to a GeForce RTX 5060 Ti with 16 GB of VRAM, which meant that larger models could be used. Two of these have become my choices for different tasks: gpt-oss for GitHub commit messages and qwen3:14b for summarising blocks of text (albeit with Anthropic's API for when the output is not to my expectations, not that it happens often). Both fit of these within the available memory, allowing for GPU processing without any CPU involvement.
Generating Commit Messages
To use aicommit with Ollama, the command needs to be changed to use the Ollama API, and it is better to define a function like this:
run_aicommit() { env OPENAI_BASE_URL="http://localhost:11434/v1" OPENAI_API_KEY="ollama" AICOMMIT_MODEL="gpt-oss" /home/linuxbrew/.linuxbrew/bin/aicommit "$@"; }
This avoids having to alter the values of any global variables, with the env command setting up an ephemeral environment within which these are available. Here, using env may not be essential, even if it makes things clearer. The shell variable names should be self-explanatory given the names, and this way of doing things does not clash with any global variables that are set. Since aicommit was added using Homebrew, the full path is defined to avoid any ambiguity for the shell. At the end, "$@" passes any parameters or modifiers like 2>/dev/null, which redirects stderr output so that it does not appear when the function is being called. While you need to watch the volume of what is being passed to it, this approach works well and mostly produces sensible commit messages.
Text Summarisation
For text generation with a Python script, using streaming helps to keep everything in hand. Here is the core code:
chunks = []
for part in ollama.chat(
model=model,
messages=[{'role': 'user', 'content': prompt}],
options={'num_ctx': context, 'temperature': 0.2, 'top_p': 0.9},
stream=True,
):
chunks.append(part['message']['content'])
summary = re.sub(r'\s+', ' ', ''.join(chunks)).strip()
Above, a for loop iterates over each streamed chunk as it arrives, extracting the text content from part['message']['content'] and appending it to the chunks list. Once streaming is finished, ''.join(chunks) reassembles all the pieces into a single string. The re.sub(r'\s+', ' ', ...) call then collapses any intermediate sequences of whitespace characters (newlines, tabs, multiple spaces) down to a single space, and .strip() removes any leading or trailing whitespace, storing the cleaned result in summary.
Within the loop itself, an ollama.chat() call initiates an interaction with the specified model (defined as qwen3:14b earlier in the code), passing the user's prompt as a message. This is controlled by a few parameters, with num_ctx controlling the context window size and 4096 as the recommended limit to ensure that everything remains on the GPU. Defining a model temperature of 0.2 grounds the model to keep the output focussed and deterministic, while a top_p value of 0.9 applies nucleus sampling to filter the token pool. Setting stream=True means the model returns its response incrementally as a series of chunks, rather than waiting until generation is complete.
A Beneficial Outcome
Most of the time, local LLM usage suffices for my needs and reserves the use of remote models from the likes of OpenAI or Anthropic for when they add real value. The hardware outlay remains a sizeable investment, though, even if it adds significantly to one's personal privacy. For a long time, graphics cards have not interested me aside from basic functions like desktop display, making this a change from how I used to view such devices before the advent of generative AI.
Please be aware that comment moderation is enabled and may delay the appearance of your contribution.