Using Local LLMs to Find High-Performance Algorithms

We need to talk about how we’re using AI. Most developers I talk to are just using it to generate boilerplate or write mediocre unit tests. But if you’re trying to squeeze every millisecond out of a performance-critical bottleneck—like matrix multiplication in Rust—the standard cloud-based models often fail to grasp the hardware-specific nuances. Specifically, using Local LLMs to iterate on low-level algorithms is where the real architectural gains are happening right now.

I’ve seen plenty of “AI-optimized” code that looks great but fails under pressure. Recently, Stefano Bosisio shared a fascinating experiment where he used a MacBook Pro M3 and open-source models (specifically Mixtral 8x7B) to discover better matrix multiplication (matmul) algorithms. This wasn’t just a single prompt; it was a multi-agent workflow designed to refine code until it outperformed standard implementations.

Building a Multi-Agent Roundtable with Local LLMs

The core strategy here involves Microsoft Autogen. Instead of asking one model for a solution, you set up a roundtable of agents with specific roles: a Proposer for theory, a Coder for implementation, and a Tester for benchmarking. When you run Local LLMs in this type of agentic loop, you’re essentially automating the “refine and iterate” mantra that every senior dev lives by.

Furthermore, this approach bypasses the “token limit” frustration. By saving state and context into a vector database (like Chroma), the agents can “remember” what worked in previous runs. This prevents the model from repeating the same mistakes—a common race condition in human-led debugging sessions.

The Messy Reality: Debugging AI Confabulations

Let’s be honest: AI makes mistakes. Stefano’s experiment hit several walls, including what he calls “diagonal fallacies”—where the generated code only calculated diagonal blocks and ignored the rest of the matrix. I’ve seen similar logic gaps in Technical Debt in AI Development, where the code looks mathematically sound but is practically useless.

Specifically, the Local LLMs initially struggled with buffer overwrites and cache misses. However, by the third or fourth iteration, the model successfully implemented NEON SIMD intrinsics and Rayon parallelism. It went from a baseline of 760ms down to 359ms. That’s a 50% speedup found by a model running on a consumer laptop.

The Technical Stack: From Naive to NEON

To give you an idea of the jump, look at the difference between a naive Rust matmul and what the agent eventually produced using SIMD (Single Instruction, Multiple Data).

// The Naive Approach (Slow, lacks vectorization)
fn naive_matmul(a: &[f32], b: &[f32], c: &mut [f32], size: usize) {
    for i in 0..size {
        for j in 0..size {
            for k in 0..size {
                c[i * size + j] += a[i * size + k] * b[k * size + j];
            }
        }
    }
}

// The Optimized Approach (Rayon + NEON intrinsics)
// Discovered by Local LLMs during iteration
use std::arch::aarch64::*;
use rayon::prelude::*;

fn optimized_matmul(a: &[f32], b: &[f32], c: &mut [f32], size: usize) {
    c.par_chunks_mut(size).enumerate().for_each(|(i, row)| {
        for k in 0..size {
            let va = unsafe { vdupq_n_f32(a[i * size + k]) };
            for j in (0..size).step_by(4) {
                unsafe {
                    let vb = vld1q_f32(&b[k * size + j]);
                    let mut vc = vld1q_f32(&row[j]);
                    vc = vfmaq_f32(vc, va, vb);
                    vst1q_f32(&mut row[j], vc);
                }
            }
        }
    });
}

Notice the use of vfmaq_f32 and vld1q_f32. These are hardware-specific instructions for ARM processors (M3 MacBooks). Most developers don’t have these intrinsics memorized, but a properly tuned Local LLMs agent can pull these from its training data when the “Tester” agent reports poor baseline performance.

Look, if this AI development stuff is eating up your dev hours, let me handle it. I’ve been wrestling with WordPress and custom integrations since the 4.x days.

Refactor and Ship It

We’re entering a phase where the “tiny” models on our laptops are capable of mastering complex optimizations that used to require massive clusters. You don’t need a BLAS-level library for every custom need if you can deploy an agentic workflow to find a specific hack or workaround for your bottleneck. Stop using AI just for chat; start using it as an automated architect.

Ahmad Wael

I'm a WordPress and WooCommerce developer with 15+ years of experience building custom e-commerce solutions and plugins. I specialize in PHP development, following WordPress coding standards to deliver clean, maintainable code. Currently, I'm exploring AI and e-commerce by building multi-agent systems and SaaS products that integrate technologies like Google Gemini API with WordPress platforms, approaching every project with a commitment to performance, security, and exceptional user experience.

See Full Bio