Article | Strategy

When Search Breaks Everything: Lessons from a Client's Unexpected App Outage

Reading time: ~ 4 minutes

A few months ago, a client reached out to let us know their app was behaving strangely. The frontend wasn't loading reliably, and users were hitting errors. It wasn't a full blackout, but it was causing enough friction to warrant a serious look. What followed was a methodical investigation into a cascade of backend failures that traced back to a single service: OpenSearch.

This post walks through what happened, why it happened, and what you can do to make sure it doesn't happen to your application.

What we saw first

The first visible symptom was a flood of 500 errors preventing the frontend from loading. Digging in, we started seeing Faraday connection errors pointing toward OpenSearch (our client's Elasticsearch-compatible search layer), which was temporarily unreachable. Mailers were still functioning, which told us the problem wasn't a total infrastructure collapse, but background jobs and the backend API were both down. Users couldn't do much of anything meaningful in the app. Those initial observations helped our team quickly narrow our focus, and we moved into a systematic root cause analysis.

What actually failed

503 and 500 errors traced back to OpenSearch being unavailable. Digging deeper, we found two main causes: "all shards failed" and "cluster_block_exception" errors- both signs of a cluster that had hit an internal problem and couldn't recover on its own.

The underlying infrastructure was a single-node OpenSearch cluster. It's a common cost-saving choice, and for lower-traffic applications it often works fine. But a single node means no redundancy. When that node has a bad day, there's nothing to fall back on.

To make matters worse, OpenSearch logging had been disabled to reduce costs, leaving us flying blind about what had gone wrong internally. We could clearly see the effects; we just couldn't identify the root cause as quickly as we should have. Two cost-saving decisions, each defensible in isolation, combined into a situation where a service failure was both more likely and harder to diagnose.

The ripple effects

The app's search going down took out more than you might expect:

API endpoints failed. Several key backend routes had OpenSearch baked directly into their request path, with no fallback and no graceful error handling. When OpenSearch became unavailable, those routes returned 500s instead of degraded but usable responses.
Delayed jobs backed up. Product update jobs stalled, though the culprit here was actually ElastiCache, not OpenSearch directly. The two failures were related but distinct, which made the initial diagnosis more complicated. Frontend symptoms rarely map cleanly to backend causes.
Mailer jobs kept running. This turned out to be a useful signal. Because mailers were unaffected, we could confidently isolate the problem to specific queue types and their dependencies.

Recovery came from restarting the cron and app server processes, which re-established healthy connections to the services that had come back online. This wasn't a permanent fix, but it got the app functional while we worked on longer-term improvements.

What we learned

None of this was new, but it’s the kind of thing that only shows up once a system has been running for a while. That doesn't make them any less worth noting, especially when the cost of skipping them shows up in production.

Graceful degradation matters more than it's given credit for. The frontend was hard-coupled to OpenSearch in ways it didn't need to be. Search unavailability should have resulted in degraded functionality, not a broken application. When non-critical services fail, the user experience shouldn't collapse with them.
Monitoring gaps slow everything down. Without OpenSearch logging enabled, we lost critical visibility into the service's internal state. We could see that it was down. We just couldn't easily see why it had gone down.
Single points of failure are liabilities, not just risks. A one-node cluster is a bet that nothing will go wrong. It wasn't our design decision, but we should have pushed back more forcefully and made the case for additional nodes. The cost savings were real â€” until the outage made them irrelevant.
Dependency chains are rarely obvious from the outside. Job queue issues linked to ElastiCache rather than directly to OpenSearch were a good reminder that symptom and cause often aren't the same thing. Treating each symptom as a separate diagnostic puzzle helped us avoid over-indexing on one explanation.

How we’d approach this going forward

…with the benefit of hindsight.

Design for Failure

The question to ask is simple: What happens to my app when this service is unavailable? If the answer is "it breaks completely," that's worth fixing.

For search specifically, make the dependency optional. Return fallback content, show a "search temporarily unavailable" message, or degrade gracefully to a basic database query. The goal is to keep the rest of the application usable when one piece has a problem. Apply that same thinking to any non-critical backend dependency. Your app shouldn't be as fragile as its least reliable service.

Improve Monitoring and Alerting

Disabling logging to save costs is understandable. Going completely dark on a critical infrastructure component is not a trade-off we'd recommend.

At a minimum, invest in:

Shard health monitoring for OpenSearch, with alerts on degraded or unassigned shards
Job queue depth dashboards so you can see when background work is backing up before it becomes a user-facing problem
Background worker throughput metrics to catch stalls before they cascade
Alerting on error rate spikes, not just total outages

You don't need a dashboard that tracks every dependency, but you should have enough visibility to never diagnose production incidents in the dark.

Avoid Single-Node Clusters

Even for lower-traffic applications, a single-node cluster is a liability. Node failures and hardware problems happen, and the real question is whether your system can absorb one. Multi-node clusters with shard replication give you something to fall back to. The cost may seem high for a simple lower-traffic app, but it will almost certainly be cheaper than the engineering time, lost user trust, and support burden that comes with explaining downtime.

Build Resilient Job Queues

Separating queues by function was one of the things that helped us isolate this incident faster. Mailer jobs ran on a different queue than product update jobs, so we could tell at a glance which parts of the system were healthy.

A few practices worth adopting:

Segment queues by function (mailers, cache updaters, API sync, etc.) so failures don't contaminate unrelated work
Set queue depth alerts to catch backed-up work early
Treat service restarts as a documented tool, not a last resort. Restarting cron and app workers should be a known, repeatable step in your incident response playbook

Closing thoughts

Outages happen. Systems that depend on external services will, at some point, have those services become unavailable.

The difference between a minor incident and a major one usually comes down to decisions made long before anything went wrong. If your app relies on an outside service like Elasticsearch or OpenSearch, it's worth asking, “What actually happens when that service goes down?” If the whole app breaks, you have some architecture decisions to revisit.

We'd rather help clients ask that question proactively than answer it reactively at 2 am.

Tags: