Got a call from a client last week. They’d just moved their WooCommerce store from some ancient hosting setup, and things were… weird. Their European customers were complaining that any reviews or contact forms submitted with special characters—think umlauts, accent marks, the whole lot—were showing up as garbled nonsense. Just a sea of those black diamond question mark symbols (�). Total mess.
It was a classic WordPress UTF-8 encoding problem. My first thought, the gut reaction, was to just find where the data was being saved and slap a utf8_encode() on it. It’s a quick and dirty fix you see plastered all over Stack Overflow. And yeah, it might have worked for some of the characters, but it’s the wrong tool for the job. It assumes the original text is ISO-8859-1, which is a huge gamble and often makes things worse. You end up with double-encoded garbage or breaking stuff that was perfectly fine. Not good.
Why Most UTF-8 Fixes Are Just Guesswork
Here’s the kicker: for years, WordPress has had functions like seems_utf8() that encouraged this kind of guesswork. The function name itself tells you everything you need to know. It *seems* right? That’s not a solid foundation for handling data reliably, especially for an e-commerce store. The problem is that a string in PHP is just a sequence of bytes. Without knowing the encoding, you have no idea if the byte value 0xA9 is supposed to be © or some other character entirely. Guessing is a recipe for data corruption.
Thankfully, WordPress 6.9 is finally modernizing its approach to UTF-8 support, and it’s a change I’m genuinely happy to see. The core team has deprecated those old, misleading functions and given us much more reliable tools to work with. This is all laid out in a recent dev note on the Make WordPress Core blog, which you can find at https://make.wordpress.org/core/2025/11/18/modernizing-utf-8-support-in-wordpress-6-9/.
The Right Way to Handle WordPress UTF-8 Encoding
Instead of guessing, the new approach is all about validation. We now have two key functions that are built for the job: wp_is_valid_utf8() and wp_scrub_utf8(). One checks, the other cleans. Simple as that.
wp_is_valid_utf8( $string ): This does exactly what it says on the tin. It returnstrueif the string is a valid sequence of UTF-8 bytes, andfalseif not. No guesswork.wp_scrub_utf8( $string ): If you absolutely have to accept a string that might have invalid bytes, this function is your friend. It replaces the invalid parts with the standard Unicode Replacement Character (�), so you can safely save it to the database without causing downstream issues, like breaking an XML feed.
So, for my client’s site, the fix wasn’t to encode anything. It was to validate the input when the form was submitted. Here’s what the logic looks like now:
<?php
$submitted_review = $_POST['customer_review'];
// Stop guessing and just validate the input.
if ( ! wp_is_valid_utf8( $submitted_review ) ) {
// The input is not valid UTF-8. Reject it.
// You could also try to scrub it, but rejecting is often safer.
wp_die( 'Invalid character encoding detected. Please use UTF-8.' );
}
// If we're here, the string is valid. Proceed with saving it.
$safe_review = sanitize_textarea_field( $submitted_review );
// ... rest of the save logic ...
So, What’s the Point?
The real takeaway here is a shift in mindset. You have to stop trying to magically “fix” broken strings and start enforcing a standard at the point of entry. It’s not about converting character sets on the fly; it’s about deciding what your application will accept. Period.
- Don’t Guess: If you don’t know a string’s encoding for a fact, don’t try to convert it.
- Validate on Input: Use
wp_is_valid_utf8()to check data coming from forms or APIs. Reject anything that isn’t valid. - Scrub, Don’t Strip: If you must clean a string, use
wp_scrub_utf8()to replace bad bytes. Never just strip them out, as that can create new, dangerous byte combinations.
Look, this stuff gets complicated fast. If you’re tired of debugging someone else’s mess and just want your site to work, drop my team a line. We’ve probably seen it before.
Character encoding might not be the sexiest topic, but getting it right is the difference between a reliable, global-ready website and a ticking time bomb of corrupted data. Trust me on this.
Leave a Reply