Mastering Synthetic Training Data: A Senior Dev's Guide

We need to talk about how we handle missing information. In my 14 years of wrestling with WordPress and high-traffic WooCommerce sites, I’ve seen countless devs hit a wall because they lacked real-world data to test against. Whether you’re a graduate student trying to pass Calc 1 or a senior engineer refactoring a legacy checkout flow, the core issue is the same: the Synthetic Training Data problem.

For some reason, the standard advice has become “just wing it” or “test on staging with 10 rows of dummy data.” That’s a recipe for a production disaster. When real-world training data isn’t available—like those non-existent past exams for a niche grad course—you have to get creative. You have to build your own reality. Furthermore, this isn’t just about passing tests; it’s about building resilient systems.

The Human Training Data Problem

I’ve been following the work of researchers who describe “human underfitting.” It’s that feeling of impending doom when you’ve read the documentation (the lecture notes) but can’t solve the actual problem (the exam). This gap exists because our mental models lack sufficient training data. In a dev context, this looks like a junior dev who knows the syntax of WP_Query but can’t architect a performant faceted search.

To bridge this gap, using Synthetic Training Data generated by LLMs like Claude or Gemini has become a literal lifesaver. By feeding these models your notes, Q&A sessions, and specific constraints, you can generate novel “mock exams” or “test cases” that simulate the complexity of the real world. This approach, which Jonathan Yahav recently discussed in his deep dive into solving the human training data problem, is a total game changer for learning.

Generating Synthetic Training Data for WordPress Testing

In the WordPress ecosystem, we often need synthetic data to test edge cases without exposing sensitive customer info. For instance, if you’re building a custom reporting engine for WooCommerce, you need thousands of varied orders. Specifically, you need to simulate different payment gateways, shipping zones, and tax rates to catch those nasty race conditions.

Here’s a “naive” way I’ve seen people try to generate data—usually just looping a single product. It’s too uniform and fails to catch real bugs.

// The Naive Approach - Don't do this
for ($i = 0; $i < 100; $i++) {
    $order = wc_create_order();
    $order->add_product( get_product(123) );
    $order->calculate_totals();
    $order->save();
}

A better senior-level approach is to use an LLM to generate a JSON schema of varied scenarios, then ingest that to create Synthetic Training Data. Consequently, your local environment actually mirrors the “messy” reality of production. Here is how I refactor that logic to be more useful:

<?php
/**
 * Generate varied synthetic WooCommerce orders for testing.
 * Prefixing with bbioon_ to keep the namespace clean.
 */
function bbioon_generate_synthetic_orders( $scenarios ) {
    foreach ( $scenarios as $scenario ) {
        $order = wc_create_order();
        
        // Add random products based on the LLM-generated scenario
        foreach ( $scenario['products'] as $item ) {
            $order->add_product( wc_get_product( $item['id'] ), $item['qty'] );
        }
        
        $order->set_billing_country( $scenario['country'] );
        $order->set_payment_method( $scenario['gateway'] );
        
        // Force specific statuses to test reporting hooks
        $order->set_status( $scenario['status'] );
        
        $order->calculate_totals();
        $order->save();
        
        error_log( "Synthetic Order Created: " . $order->get_id() );
    }
}

The Gotchas: Context Rot and Bias

However, you can’t just trust the machine blindly. There are two major risks when relying on Synthetic Training Data: “Context Rot” and subjective bias. Anthropic defines context rot as the model’s decreasing ability to recall information as the context window fills up. If you keep the same chat open for weeks, the synthetic data quality drops. Therefore, start fresh sessions for every new testing phase.

Bias is even more insidious. If you only prompt for “standard” orders, you’ll never see how your plugin handles a botched POST request from a failing payment gateway. You end up overfitting your mental model to a “perfect world” that doesn’t exist. I’ve written about similar issues in my post on Technical Debt in AI Development.

Look, if this Synthetic Training Data stuff is eating up your dev hours, let me handle it. I’ve been wrestling with WordPress since the 4.x days.

Final Takeaway

LLMs are what you make of them. They can be a shortcut that makes you dumber, or a coach that helps you take on the heavy lifting. By using Synthetic Training Data intentionally—using separate chats, keeping an open mind, and augmenting with real-world snippets—you can master complex systems far faster than we could a decade ago. It’s about building a better mental model, one prompt at a time.

If you’re interested in the intersection of AI and development, check out my thoughts on AI in the real world or dive into the official IBM Research on Synthetic Data.

Ahmad Wael

I'm a WordPress and WooCommerce developer with 15+ years of experience building custom e-commerce solutions and plugins. I specialize in PHP development, following WordPress coding standards to deliver clean, maintainable code. Currently, I'm exploring AI and e-commerce by building multi-agent systems and SaaS products that integrate technologies like Google Gemini API with WordPress platforms, approaching every project with a commitment to performance, security, and exceptional user experience.

See Full Bio