There are two PHP extensions to work with Memcached, which go by the somewhat unfortunate names of Memcache and Memcached (note the missing ending ‘d’ in the first one). In this post I’m going to share my experience of using them together, either to migrate from one to the other or use them simultaneously, and I’ll also describe a really strange issue I once ran into, and how to avoid it. As in my previous post, I ran into the problems described below while working on a high traffic Facebook app a few years ago.
One day, after some code changes, our memcached servers started appearing to behave in a strange way. Intermittently, in spans of several minutes, nothing would work. All caching operations — set, get, add, anything — would fail. The problem had clearly started after that last code push, but the pushed code looked innocent enough, and the strangest part was that caching operations were failing all around, not just within the new code.
Upon further inspection, I found that the problem was not on the servers’ side: all operations worked fine when talking directly to a server via telnet, and some network monitoring revealed that the webservers were not actually talking to the memcached servers at all! So it was a client problem, but a weird one: some innocent caching code in one part of the app — which, by the way, didn’t involve any configuration or setting changes — was somehow breaking cache access in the whole app.
At that point, we had always been using the Memcache extension only, and I suspected there might be a bug in it that was somehow triggered by our latest code. So I decided to try the other extension, Memcached, to avoid this hypothetical bug.
The extensions’ APIs are pretty similar, since they expose the same underlaying functionality, but not identical. That means you can’t just swap new Memcache()
with new Memcached()
and leave the rest of the code intact. Among the differences are:
- Each has different
addServer()
parameters - Memcached has
get()
andgetMulti()
, whereas Memcache only hasget()
, to which you can pass either a single key or an array of keys - Each has different parameters for
set()
,add()
, etc. because Memcache allows setting flags and Memcached doesn’t
Since Memcache had worked perfectly fine up to that point, and I had no proof of an actual bug in it — nor any guarantees that its quasi-homonymous replacement would be any better — I wasn’t willing to ditch it altogether and rewrite all the code to use the other extension. So we wrote an adapter class to wrap the new extension and use it with our existing code. The idea was to be able to use both interchangeably, switching from one to the other to see whether the bug disappeared when switching to Memcached.
But they didn’t get on well with each other using that simple approach, and new problems came up. Sometimes Memcache would read back a chunk of binary garbage when an object was stored using Memcached, or they would fail to deserialize arrays or numbers when reading objects stored by each other, returning strings instead. I was trying to solve a problem, but was creating new problems instead…
After some struggle, I finally figured things out. The binary garbage was due to compression: by default, Memcached gzips values bigger than 100 bytes. So some objects were compressed and others were not, and that caused Memcache to return some objects as binary garbage. That was easily solved by disabling compression. The serialization issues were due to the fact that both extensions use the flags field (or part of it) to indicate data types, but they use different conventions (storing the type is needed because strings and ints are stored unmodified, but arrays and objects are serialized). The solution was to handle serialization in the adapter (serializing everything except integers) and only pass strings to the extensions’ methods, taking advantage of the fact that both extensions store strings unserialized and with a flags value of zero.
So, with all those fixes in place, we were finally able to use both extensions interchangeably… only to find that the original bug was also present when using Memcached! The same mysterious behaviour! Back to square one…
I had to look somewhere else. What about the latest code? I scrutinized that “innocent” code, the one triggering the bug in the first place, one more time. It performed increment
and decrement
operations, which we had never used before. That got me thinking, and led me to the memcached protocol documentation, where it said the incr
and decr
commands only worked on decimal representations of unsigned integers. It turned out we were using them on some values initialized with -1. Bingo! There was clearly a problem there, since those operations were guaranteed to fail on -1. But what did those specific failed operations have to do with system-wide malfunction?
Finally, I found the last piece of the puzzle: it appears that both Memcache and Memcached have checks in place to temporarily block a server that is malfunctioning (around 3 and 15 minutes, respectively). And when the server returns an error message on incr
/ decr
failure, they both seem to misinterpret that as server malfunction, blocking the server for several minutes and causing all operations to fail.
All that trouble caused by a negative number! Well, no. Actually, the real culprit is the unfortunate error handling in Memcached clients, of course.