Listen to your pools
At Tuist, we’ve been experiencing sporadic database query drops here and there. It became routine for me to start my day sifting through AppSignal errors related to dropped queries, spending hours trying to debug them. At times, I wondered if we were doing something wrong—perhaps our queries were too slow, or our pool wasn’t well-optimized for our traffic patterns. But whose responsibility was it? Our database provider’s? Our cloud provider’s? Ours? It was somewhat annoying, but not something that would significantly degrade our service quality.
Then things started getting worse. The query drops escalated to HTTP request drops, and we became increasingly suspicious of our infrastructure. For those unfamiliar with why pools are necessary on the server side: they’re tools that keep connections warm and ready to execute queries or requests, avoiding the latency of initial handshakes and the overhead of establishing connections for every request. This is particularly important for Tuist, where we have a highly concurrent API surface, especially our cache endpoints. Despite numerous unsuccessful attempts to get our cloud provider to investigate their network infrastructure, everything went sideways. First, the server became unhealthy twice due to running out of memory. Later, it became extremely slow—so slow that some client features would fail. This happened despite Erlang’s fair scheduler, which can schedule work even when the CPU is extremely busy (for example, if you enter an endless loop).
What could have caused this? Suddenly, it all clicked, and we finally understood the pool issue. Our outbound requests were too slow—not just requests to the database for queries, but also those to S3 for uploads. At the database level, where we’d optimized many things so that high concurrent API load doesn’t translate to equivalent database load, we only saw sporadic drops. But in our S3 storage pool, this translated to a few requests executing while thousands waited to be assigned (consuming memory), the CPU struggling to assign them, all while continuing to receive more requests added to the pool. Our server couldn’t respond to requests in time because there were no CPU cycles available.
What started as sporadic drops long ago was already signaling that something was wrong with our network. Sometimes slowness is self-inflicted, but in this case, it wasn’t us—we only realized this once everything fell apart. We had numbers to prove to our cloud provider that something wasn’t working as planned, but we never anticipated that the issue would snowball and take down our entire service.
So yes, when your pool speaks, stop what you’re doing and listen to it. In our case, we switched to another provider following Chris’s suggestion, and things improved instantly. We also configured our pools with lower timeouts to prevent resource clogging, and we’re defining alerts that can automatically declare incidents when we detect upward trends in response times. It’s been a huge learning experience and also a relief to start each day without a handful of dropped connections from the queue.
I feel confident that both our infrastructure and the app running on the amazing Erlang VM are now ready to handle much heavier loads.