MySQL charset and collation recommendations

Let’s go through a humourous journey when setting up a MySQL database for our application. The story becomes much lesser funnier when you get to know that we had to go through this in production because we started with the wrong MySQL charset and collation.

TLDR: Always use utf8mb4 charset and utf8mb4_unicode_ci collation for your MySQL database if you don’t understand what to use. It will be more or less be for the best. Planning to use emojis, these settings rock, planning to add internationalization, they rock!

We simply created a MySQL database for our application and it was (too) easy.

Everything was beautiful, until alas

One day, we started adding translations. And there our Indic characters were replaced by ???. Damn! We understood more about MySQL, and changed charset of our databases from latin to utf8.

Everything was beautiful again, until alas

We found one day some characters started failing. We couldn’t understand why some characters worked, and some didn’t. Damn! We understood more and realized that utf8 is a misnomer with MySQL supporting only 3 bytes and not 4. So all characters greater than 3 bytes would be broken with utf8 charset. We changed our charset from utf8 to utf8mb4(aka True UTF-8).

Everything was beautiful, until alas

One day we needed to add DB filtering, and Django’s iexact started failing. We simply couldn’t do case insensitive searches. Damn, yet again! We dug through more of Django, and understood more about MySQL to realize that we were using utf8mb4_bin collation which doesn’t support case insensitive searches. We changed to using utf8mb4_unicode_ci

Everything is beautiful again, until next time.

Here’s a little more detail if you are interested

I would recommend this reading for you to understand more about MySQL’s implementation of UTF-8 and why it is broken.
Django’s stance is that the Database is set up much before the project is started, so the users’ would take care of this(which admittedly I didn’t). Which is why you will no recommendations around this. Here’s a ticket around this.
You might face troubles upgrading to utf8mb4 if you run an old version, Indices are limited to 767 bytes in InnoDB in MySQL 5.6. You might get this error:
```

#1071 - Specified key was too long; max key length is 767 bytes
```
This means CharField‘s which are indices can now just have a max_length of 767/4 ~ 191 characters. The solution is to either upgrade your MySQL version to MySQL 5.7 where the index size is now 3072 bytes, or allow longer indexes with InnoDB. Another hacky solution is to use Indices capped at 191 characters without reducing size of the field itself. This comment here helps you to achieve this in Django with MySQL 5.6 and fix the error: 1071 - Specified key was too long; max key length is 767 bytes
Here’s the Django ticket which documents that iexact breaks with binary collation. Why does insensitive search fail for Binary collation? Binary sequences are stored and compared as bytes. Which means that the concept of case breaks as strings are stored as binary. It is going to be extremely fast to do comparisons which match exactly though.

For anyone starting a new application, I would simply recommend to go with MySQL 5.7 and above, with utf8mb4 charset and utf8mb4_unicode_ci.

Source link

MySQL charset and collation recommendations

Here’s a little more detail if you are interested

Like this:

Related