Step 1 for any on-call procedure is to get access to respond to an incident. Time to access can make or break your response time, and your team’s morale.
There is no one “right” way to handle on-call access across every organization, or even within a single organization.
Factors like tech stack, release process, time zone distribution, size and experience of your team, area of focus, and security risk all play a big part in determining which pattern is the right tool for the job.
Indent is able to help you manage the full lifecycle of on-call access, so we’ve seen a lot of different ways this is handled across a spectrum of companies. In this blogpost we’ll explore some common patterns we’ve seen from our customers.
Pattern | When to use it | |
---|---|---|
1 | Indefinite | Small teams, senior management, or people who are always on-call |
2 | 1 week | Dedicated teams with known surface area, like IT on-call |
3 | 12 hours | Cross-functional engineers with known but dynamic surface area, like product teams or customer access |
4 | Task-based | Teams with higher security risk implementing least privilege |
5 | Semaphore | Sensitive operations that require only cook in the kitchen |
It can make sense to retain on-call access at all times. This generally applies certain mission-critical personnel and for smaller teams under 10 engineers with low risk data.
Over time, as teams and security risks grow, the risks of permanent access starts to outweigh the benefits. It’s important to have a process in place to review and remove access when it’s no longer needed.
Once you remove access, it's important to grant it quickly when someone needs it. With Indent, you can always grant it again with a single click from Slack.
A common on-call practice for dedicated teams with a well-defined surface area (like IT teams or customer support teams), is to automate the rotation of on-call status within an incident response solution on a weekly cadence.
Right before the rotation, the person ending their on-call shift is responsible for granting on-call permissions via an on-call group to the person who is about to start the next shift.
For the same reason we automate on-call rotation, we should automate on-call permissions rotation – consistency.
One way you can do this is with Indent. You can auto-approve access for a week based on a user’s on-call status in your incident response provider like PagerDuty, Opsgenie, or Incident.io.
It also comes with the benefit that if someone is not on-call but wants to help with an incident, there’s a fast path for them to get the needed access. They can request on-call permissions through Indent which can be on-click approved and automatically provisions. This type of access gives you least privilege by default while accelerating time to resolution.
For cross-functional engineers with known but dynamic surface area, like product teams or teams that need customer access, time-limiting on-call access to 12 hours can make a lot of sense.
Shifting to more granular, time-bound access helps prevent accidents like making changes in production when you meant to make them in testing, and reduce security and compliance risk.
With Indent you can auto-approve for 6, 12, 24, or a custom number of hours based on a variety of factors like Okta Group membership, on-call status, or assignment to an active incident in incident.io, etc. If the clock runs out, but access is still needed, you can request access again with a single click from Slack.
Teams with a higher security risk or that have implemented least privilege should opt for task-based on-call access.
Rather than granting role-based access to a category of tasks (e.g. On-Call Admins), they take the extra step to break down access into groups that mirror domains like:
Using Indent you’re able to grant time-bound or indefinite access to these groups. This style of access can also work for customer access for engineering, sales, and support, to get temporary access to customer accounts.
There are some tasks that are so sensitive, that you might only want one person at a time to have the permissions to perform them e.g. database migration.
Here’s how you could implement that using the Indent API:
import { verify } from '@indent/webhook'import { Request, Response } from 'express'import { IndentAPI } from '@indent/api'const indent = new IndentAPI()export default async function(req: Request, res: Response) {const body = await json(req)await verify({secret: process.env.INDENT_WEBHOOK_SECRET,headers: req.headers,body: req.body})const { events } = bodyconst { resource, actor, event } = events[0]let decisionconst existing = await indent.petition.list({resourceId: resource.id,phase: 'granted'})if (!existing) {return res.json({ events: [] })}return res.json({events: [{event: 'access/deny',actor: { kind: 'bot', id: 'semaphore' },resource: { kind: 'access', id: existing[0].id },reason: `semaphore - already granted: ${existing[0].name}`}]})}
Now once the first person requests access, the second person will be denied access until the first person’s access expires.
There are many ways to handle on-call access and it’s important to find what fits best for your team.
Have a question about optimizing on-call access? Get a demo and schedule time that works for you — we're happy to help!