Debugging in Production

If you competed in this year’s CSAW CTF, you may have noticed that the site was pretty sluggish until around 1am EST Sunday. This post is a walkthrough of how I went from noticing this sluggishness, to debugging the issue, to putting in a fix which decreased page load times by over 10x.

The stack

Before we get into the problem, it’s useful to know how CTFd was deployed this year. Here’s a quick list of relevant facts:

Identifying the problem

Before CTF had even launched, I noticed the platform felt slower than it should have been. Time to first byte (TTFB) numbers were in the hundreds of milliseconds on a <5ms RTT connection. This coincidentally led to us discovering that the CTFd production instance was running with 10x fewer worker processes than it should have been (4 instead of 40), but that’s a separate problem that didn’t end up having anything to do with these increadibly high TTFB numbers.

Now during competition, the site was slow, it was still usable (again, TTFBs in the hundreds of milliseconds isn’t bad at all). That is, except for the scoreboard page, which repeatedly timed out loading after >30s which was definitely a problem.

SSHing into the server and running top shows the box was heavily loaded. The 40 gunicorn workers plus Traefik for TLS termination was nearly saturating the box.

Idea 1 - Request Queueing

At this point, the first theory for what was wrong came up: what if 40 workers is still too few, and requests are being queued behind long-running requests (e.g. for the scoreboard) which is causing the high TTFB? This doesn’t quite line up with observed data, as the TTFBs were consistently high, not sporadic as you’d expect to see if requests were being queued. But either way, we wanted to test this and so we did two things:

  1. Look at real-time access logs for the site for any significant number of requests to /scoreboard
  2. Increase the number of gunicorn workers to 100

Logs didn’t reveal nearly any requests to the scoreboard, and increasing the number of gunicorn workers didn’t do anything to responsiveness. In fact, it made it slightly worse since the box was now over-saturated (load average of 100).

This change to 100 workers also caused the box to completely stall while redeploying a few hours later which completely brought the site down for about 5 minutes. Oops.

Getting more data

At this point, we needed more data to figure out what was actually causing the high latency.

Given how little processing CTFd itself should be doing, the incredibly high load averages seemed strange. CTFd was using a bunch of CPU (according to top), but it just shouldn’t be.

Normally low CPU utilization and high load averages can be a sign of IO-wait bound tasks. Installing and running iotop seemed to confirm this, as all gunicorn workers showed high IO utilization, but very little disk IO. This seemed to indicate that CTFd was stalling on network traffic a lot.

Redis and MySQL both have simple statistics that can be easily accessed, and shelling into both boxes showed some startling statistics: redis was handling ~10k ops/second with a 99.98% cache hit ratio, and yet MySQL was also getting hammered with ~5k queries/sec, the vast majority of which (~4k/s) were SELECT statements.

This made no sense. Just eyeballing the web server logs I could tell we weren’t even peaking at over 100req/s (and were probably averaging around 20req/s), and yet there were 10k ops/sec going to redis and another 5k to MySQL. This seems to indicate two issues:

1) There is massive query fan-out in CTFd. 2) It almost seems like the caching mechanism is somehow broken or being ignored.

Redis is built to be a super-fast k/v store, so 10k ops/sec for it is high for our purposes, but next to nothing compared to what it can do, so I focused on the MySQL traffic.

MySQL

MySQL has a plain-text wire protocol, so simply sniffing network traffic is enough to dump queries going to it (assuming you don’t use TLS on your MySQL connections which we (luckily?) don’t since all communication happens over our internal network).

One tcpdump later, and we have some queries. Scrolling through the raw pcap in less, there were three main queries I saw repeated over and over. Two of these were team info and score calculation queries, but the third stuck out: a single SELECT over and over to get the same config key freeze.

Brief aside for those not familiar with CTFd’s admin mechanisms: the competition admin can set a freeze time after which flag submissions will continue to be accepted but won’t count towards a team’s score, effectively freezing their score at the given freeze time

This made no sense. All queries like that should hit redis and be done, and yet here was a query being run probably 1k times per second, each time having intra-VM network overhead. While that overhead is on the order of hundreds of nanoseconds for each query, if we’re running the query hundreds of times for each request, we’re suddenly in the tens of milliseconds TTFB just in query round trip overhead.

With this info in hand, it’s become pretty obvious where our hundred-odd millisecond TTFB is coming from.

Digging into the source

greping through the CTFd repo for freeze, there’s a place where, for whatever reason, the DB is directly queried instead of going through the cache:

https://github.com/CTFd/CTFd/blob/2bdf7b64d64b40308e64a09b9a215494813c446d/CTFd/models/__init__.py#L323

This happens to be in Users.get_score, which may explain why the scoreboard was timing out!

For a quick fix, I threw a cache.memoize(60) around each of the getters in Users and deployed!

And nothing changed. Darn.

Testing locally

At this point I figured I should spin up our stack locally so I could make quick changes without taking down prod. Since everything was dockerized, this took just a few minutes. I grabbed a recent database backup to import to my local DB so I was using prod data, without actually using the prod DB (just in case I accidentally DELETE FROM teams;)

Wanting to see exactly where time was being spent, I turned to Python’s built-in cProfile module which can give you exactly this type of data. A quick Google search later I found a gunicorn config/plugin that would dump out cProfile information for each request: https://gist.github.com/joymax/8ffc63fd901c6de7073c18ae023d5cbc#file-wsgi_profiler_conf-py.

Running this locally showed a bunch of time being spent in redis and socket recv, and at this point I finally realized what was going on: we were just hitting redis too much. The network round trip, although small, is being multiplied hundreds of times each request. And on top of that, the flask-caching module in use had to (un)pickle things every time, adding to the latency.

Changing the flask-caching backend to “simple” showed some improvements, however it was still slower than I’d like, and cProfile showed this to be nearly entirely from pickle overhead.

So to solve this once and for all, I grabbed some 15 year old Python memoization code, tossed it in, and changed the decorator on get_config to this new in-memory memoizer. Testing locally showed TTFBs in the low milliseconds now, just as they should be. And deploying this code to prod showed similar improvements, even making the scoreboard page actually load!

Great success!

Wrapping up

Writing this all up, it’s obvious in hindsight that redis over-usage was the issue. 10k ops per second should have set off more mental alarm bells, but since it was on the same VM, and, well, redis, I hadn’t done the napkin math that would have pretty quickly showed it to be the issue.

While the exact methods I used here may be a bit rudimentary (tcpdumping traffic on a production MySQL server for instance), the basic principles and mental process is mainly what I’m wanting to show here. Hopefully this will help somebody else debug a slow service in the future.

P.S.

Fun fact: there’s a route in CTFd:master as of the time of this post (commit b8c1970) which causes massive query fanout, basically DoSing whatever DB is being used. I’ll leave it as an exercise to the reader to find it :P