Simulate Third-Party Downtime

March 1, 2016 by

I spend most of my time at Heroku working on our support tools and services; help.heroku.com is one such example. Heroku’s help application depends on the Platform API to, amongst other things, authenticate users, authorize or deny access, and fetch user data.

So, what happens to tools and services like help.heroku.com during a platform incident? They must remain available to both agents and customers—regardless of the status of the Platform API. There is simply no substitute for communication during an outage.

To ensure this is the case, we use api-maintenance-sim, an app we recently open-sourced, to regularly simulate Platform API incidents.

this-is-fine

Simulating downtime

During a Platform API incident, the API is disabled. All requests receive a 503 (service unavailable) HTTP response. This is a simple behaviour that we can imitate on demand with api-maintenance-sim.

At its core, api-maintainenance-sim responds to every request with a 503 HTTP status as shown below.

run lambda { |env|
  [
    503,
    {"Content-Type"=>"application/json"},
    StringIO.new(%q|{ "id": "maintenance", "message": "Heroku API is temporarily unavailable.\nFor more information, visit: https://status.heroku.com" }|)
  ]
}

Once deployed, we begin the simulation by directing the app we’re testing to use a custom hostname, rather than the default api.heroku.com.

Here’s an example using the platform-api gem.

PlatformAPI.connect_oauth(current_user.oauth_token, url: ENV['PLATFORM_API_URL'])

If PLATFORM_API_URL is not configured, it will default to nil, which the gem will replace with the production URI. If it’s defined, however, you will be using the hostname of your choice.

You can now use config vars to start the simulation.

$ heroku config:set PLATFORM_API_URL=https://my-simulation-app.herokuapp.com

Conclusion

It’s not a matter of if an incident will occur; it’s a matter of when. Running regular simulations is an easy way to improve your applications stability, or at the very least, to understand what failure will mean for your application or service.