At work we have been trying to ensure that all of our applications correctly support UTF-8. This includes making sure that our REST APIs can handle accented characters (ex: é) and emoji (?) when it makes sense to. This is somewhat complicated by the multiple database platforms we have (MSSQL, MySQL, and Postgres).
MSSQL is fairly straight forward - it usually involves swapping the text column type.
Postgres has given us no problems.
MySQL - well that one is complicated. Using the correct character collation (ie: utf8mb4_unicode_ci / utf8mb4). But, we have some rather large databases that would require multiple hours of downtime to adequately fix this one.
There has been one Java service though that has been tough to get correct. Most of our Java services use Spring Boot, and our REST APIs all explicitly communicate using UTF-8 in the Accept and Content-Type headers. I verified that I could POST emoji's to this service, and they would be persisted correctly into the underlying database.
This service has some special stuff going on in the background. There is a process attached to it that pulls in data from a Salesforce long polling event stream using a library called CometD. We verified that data in Salesforce was encoded correctly, but after pulling it out of the event stream, all characters that were not in the ASCII range were converted to question marks.
The developer who had built this originally more or less copied Salesforce's sample code. So, I downloaded the original sample code, went through the painful process of authenticating with Salesforce, and subscribed to a stream on our test instance. On my Macbook, I was seeing emoji come through on the messages. But, our production servers are not Macbooks. We are running Docker containers. So, I took my sample code and ran it on top of our base Java Docker image, and sure enough, no emoji, just question marks.
After some more digging, it turns out that some code buried deep inside of CometD will use the Java system default characterset, which can be set by the system property file.encoding
. It can also come from the environment variable LANG
. Sure enough, inside our Docker images this was being set somewhere outside of our customizations to en_US.US-ASCII
. This never affected our REST APIs since we were always explicit about the encoding, but it would affect any other libraries that relied on the system default.
The one line changed to fix this was to add this to our base Docker images:
ENV LANG=en_US.UTF-8
Update After Posting
After posting this, I noticed that the emoji I put into this post, ?, was showing up as a ?
. It turns out that after having the same WordPress site that has migrated across 3 or 4 servers over the years, the collation on the underlying MySQL tables were all sorts of messed up. There were 6 distinct collations being used, and the core posts
table was using a us_latin
variation. This has been fixed.
e6ecd405217c3b8f