jpcarzolio » memcached

Memcached extensions for PHP: some caveats

Juan — Mon, 03 Aug 2015 20:19:32 +0000

There are two PHP extensions to work with Memcached, which go by the somewhat unfortunate names of Memcache and Memcached (note the missing ending ‘d’ in the first one). In this post I’m going to share my experience of using them together, either to migrate from one to the other or use them simultaneously, and I’ll also describe a really strange issue I once ran into, and how to avoid it. As in my previous post, I ran into the problems described below while working on a high traffic Facebook app a few years ago.

One day, after some code changes, our memcached servers started appearing to behave in a strange way. Intermittently, in spans of several minutes, nothing would work. All caching operations — set, get, add, anything — would fail. The problem had clearly started after that last code push, but the pushed code looked innocent enough, and the strangest part was that caching operations were failing all around, not just within the new code.

Upon further inspection, I found that the problem was not on the servers’ side: all operations worked fine when talking directly to a server via telnet, and some network monitoring revealed that the webservers were not actually talking to the memcached servers at all! So it was a client problem, but a weird one: some innocent caching code in one part of the app — which, by the way, didn’t involve any configuration or setting changes — was somehow breaking cache access in the whole app.

At that point, we had always been using the Memcache extension only, and I suspected there might be a bug in it that was somehow triggered by our latest code. So I decided to try the other extension, Memcached, to avoid this hypothetical bug.

The extensions’ APIs are pretty similar, since they expose the same underlaying functionality, but not identical. That means you can’t just swap new Memcache() with new Memcached() and leave the rest of the code intact. Among the differences are:

Each has different addServer() parameters
Memcached has get() and getMulti(), whereas Memcache only has get(), to which you can pass either a single key or an array of keys
Each has different parameters for set(), add(), etc. because Memcache allows setting flags and Memcached doesn’t

Since Memcache had worked perfectly fine up to that point, and I had no proof of an actual bug in it — nor any guarantees that its quasi-homonymous replacement would be any better — I wasn’t willing to ditch it altogether and rewrite all the code to use the other extension. So we wrote an adapter class to wrap the new extension and use it with our existing code. The idea was to be able to use both interchangeably, switching from one to the other to see whether the bug disappeared when switching to Memcached.

But they didn’t get on well with each other using that simple approach, and new problems came up. Sometimes Memcache would read back a chunk of binary garbage when an object was stored using Memcached, or they would fail to deserialize arrays or numbers when reading objects stored by each other, returning strings instead. I was trying to solve a problem, but was creating new problems instead…

After some struggle, I finally figured things out. The binary garbage was due to compression: by default, Memcached gzips values bigger than 100 bytes. So some objects were compressed and others were not, and that caused Memcache to return some objects as binary garbage. That was easily solved by disabling compression. The serialization issues were due to the fact that both extensions use the flags field (or part of it) to indicate data types, but they use different conventions (storing the type is needed because strings and ints are stored unmodified, but arrays and objects are serialized). The solution was to handle serialization in the adapter (serializing everything except integers) and only pass strings to the extensions’ methods, taking advantage of the fact that both extensions store strings unserialized and with a flags value of zero.

So, with all those fixes in place, we were finally able to use both extensions interchangeably… only to find that the original bug was also present when using Memcached! The same mysterious behaviour! Back to square one…

I had to look somewhere else. What about the latest code? I scrutinized that “innocent” code, the one triggering the bug in the first place, one more time. It performed increment and decrement operations, which we had never used before. That got me thinking, and led me to the memcached protocol documentation, where it said the incr and decr commands only worked on decimal representations of unsigned integers. It turned out we were using them on some values initialized with -1. Bingo! There was clearly a problem there, since those operations were guaranteed to fail on -1. But what did those specific failed operations have to do with system-wide malfunction?

Finally, I found the last piece of the puzzle: it appears that both Memcache and Memcached have checks in place to temporarily block a server that is malfunctioning (around 3 and 15 minutes, respectively). And when the server returns an error message on incr / decr failure, they both seem to misinterpret that as server malfunction, blocking the server for several minutes and causing all operations to fail.

All that trouble caused by a negative number! Well, no. Actually, the real culprit is the unfortunate error handling in Memcached clients, of course.

Memcached and careless preloading

Juan — Thu, 30 Jul 2015 14:32:51 +0000

This is a short story about how I became aware of the dangers of “careless preloading”, while learning a bit about memcached internals along the way.

A few years ago, while working on a high traffic app on the Facebook platform, I ran across a caching bug. All of a sudden, our memcached servers had stopped working — kind of. Stuff could be read, but nothing could be added to the cache: all set operations failed. Actually, it was stranger than that: a few sets did go through.

I had just fixed a broken preload script (which had never actually worked but nobody had ever noticed!) shortly before caching hell broke loose, so that script was my primary suspect. I decided to try the app on a test environment with and without that script, and bingo: without that preload, memcached worked normally, whereas running that preload caused the erratic cache behaviour we were experiencing in production.

So, what’s a preload, anyway? It’s a script that fills the cache with data beforehand, in order to avoid having low performance while the cache is gradually fed objects after each miss. I hadn’t written any preloads, but back then I didn’t stop to think if they were a good or bad idea, either. And it turned out that in many cases they were a bad idea, because they are always a bad idea if done without proper consideration. Ultimately, it boils down to something that has long been known to be… the root of all evil. Yep, I’m talking about premature optimization.

So, when I fixed that preload script (a trivial edit), it… uhm, started working. And it turned out that its job was to select a whole freakin’ table — about 1.5M rows — and load it into memcached. But, hey, that would be, at most, inefficient, right? In the worst case, it might displace useful data replacing it with useless data, but things would fix themselves with usage, right? Wrong! Enter memcached slabs.

For speed and efficiency reasons, Memcached has a custom memory manager, which consists of “slabs”, each of which can be assigned any number of 1MB “pages”, which are in turn split into a number of equally sized “chunks”, each of which may hold an individual object. Slabs hold objects within a specific size range, starting at 88 bytes (I think) and growing exponentially in steps of 1.25x. So for instance there may be a 1280 byte slab, which contains any number of 1MB pages split into many chunks of 1280 bytes, each holding (unless empty) an object whose size will be under 1280 bytes (including key and flags) and above 1024 bytes (the maximum allowed for the previous slab). When you perform a set, Memcached looks at the object size and determines which slab it belongs in, and looks for a free chunk in one of the pages, assigning the slab a new page if needed (i.e. if all its pages are full, or it has no pages at all because no objects of this size were stored before). And once assigned to a slab, a page stays assigned forever and it can’t be reassigned.

That explains the problem. Our preload scripts were executed after we had to resize our cache pool, or restart servers after security updates, or restart the daemon after a mysterious crash, or migrate to a different EC2 instance type, so they acted on an empty cache (no pages assigned). And this script was storing 1.5M objects with sizes in a rather specific range, causing all or most of the pages to be assigned to a few specific slabs, leaving none or too few available for the rest of the slabs. After a short while, the result was that, unless an incoming object’s size happened to match one of the few existing slabs, it was discarded. Regardless of the amount of empty space or stale data, those objects didn’t make it.

So the fix consisted in just removing that preload. The fact that we had never noticed that this particular preload was broken hinted that it wasn’t really necessary after all. And after some investigation and testing, it turned out that in normal operation — that is, caching objects on demand — only around 35k of the 1.5M objects were stored in the cache during the first hour, and there was little to no performance impact during this ramp-up period.

My point is not that preloading itself is inherently bad. Storing those 1.5M objects upfront could have been a good idea in some situation, but it wasn’t our case. My point is: before preloading data — before optimizing anything — make sure it’s necessary. If it is, make sure you’re being selective enough to preload useful data. And keep in mind that careless preloading may not only be useless or inefficient, but possibly harmful, as there may be unforeseen side effects.