Patterns for On-Call Access

December 12, 2023 7 months ago

Written by

Indent

Step 1 for any on-call procedure is to get access to respond to an incident. Time to access can make or break your response time, and your team’s morale.

There is no one “right” way to handle on-call access across every organization, or even within a single organization.

Factors like tech stack, release process, time zone distribution, size and experience of your team, area of focus, and security risk all play a big part in determining which pattern is the right tool for the job.

Indent is able to help you manage the full lifecycle of on-call access, so we’ve seen a lot of different ways this is handled across a spectrum of companies. In this blogpost we’ll explore some common patterns we’ve seen from our customers.

Patterns for on-call access

	Pattern	When to use it
1	Indefinite	Small teams, senior management, or people who are always on-call
2	1 week	Dedicated teams with known surface area, like IT on-call
3	12 hours	Cross-functional engineers with known but dynamic surface area, like product teams or customer access
4	Task-based	Teams with higher security risk implementing least privilege
5	Semaphore	Sensitive operations that require only cook in the kitchen

1 Indefinite access for the forever on-call

It can make sense to retain on-call access at all times. This generally applies certain mission-critical personnel and for smaller teams under 10 engineers with low risk data.

CTO
- Responsible for the overall health and performance of the technology stack, indefinite on-call access allows the CTO to offer immediate guidance during emergencies.
VP of Engineering
- Indefinite on-call access enables the VP of Engineering to support and guide teams during critical incidents, ensuring optimal performance and swift resolution of issues.
Head of Infrastructure
- With a focus on scalability, security, and efficiency of the infrastructure, indefinite on-call access allows the Head of Infrastructure to swiftly respond to emergent issues, implementing necessary measures to maintain operational continuity and mitigate risks.
Infrastructure Teams
- Some engineers require indefinite on-call access to address system failures, network outages, and security incidents promptly, ensuring uninterrupted service and minimizing downtime for the entire organization.

Over time, as teams and security risks grow, the risks of permanent access starts to outweigh the benefits. It’s important to have a process in place to review and remove access when it’s no longer needed.

Once you remove access, it's important to grant it quickly when someone needs it. With Indent, you can always grant it again with a single click from Slack.

2 Week-long access for on-call rotation

A common on-call practice for dedicated teams with a well-defined surface area (like IT teams or customer support teams), is to automate the rotation of on-call status within an incident response solution on a weekly cadence.

Right before the rotation, the person ending their on-call shift is responsible for granting on-call permissions via an on-call group to the person who is about to start the next shift.

For the same reason we automate on-call rotation, we should automate on-call permissions rotation – consistency.

One way you can do this is with Indent. You can auto-approve access for a week based on a user’s on-call status in your incident response provider like PagerDuty, Opsgenie, or Incident.io.

It also comes with the benefit that if someone is not on-call but wants to help with an incident, there’s a fast path for them to get the needed access. They can request on-call permissions through Indent which can be on-click approved and automatically provisions. This type of access gives you least privilege by default while accelerating time to resolution.

3 Work-day or per-incident access

For cross-functional engineers with known but dynamic surface area, like product teams or teams that need customer access, time-limiting on-call access to 12 hours can make a lot of sense.

Shifting to more granular, time-bound access helps prevent accidents like making changes in production when you meant to make them in testing, and reduce security and compliance risk.

With Indent you can auto-approve for 6, 12, 24, or a custom number of hours based on a variety of factors like Okta Group membership, on-call status, or assignment to an active incident in incident.io, etc. If the clock runs out, but access is still needed, you can request access again with a single click from Slack.

4 Task-based access

Teams with a higher security risk or that have implemented least privilege should opt for task-based on-call access.

Rather than granting role-based access to a category of tasks (e.g. On-Call Admins), they take the extra step to break down access into groups that mirror domains like:

On-Call Logging Viewer
On-Call Server Admin
On-Call Database Admin

Using Indent you’re able to grant time-bound or indefinite access to these groups. This style of access can also work for customer access for engineering, sales, and support, to get temporary access to customer accounts.

5 Semaphore production access

There are some tasks that are so sensitive, that you might only want one person at a time to have the permissions to perform them e.g. database migration.

Here’s how you could implement that using the Indent API:

import { verify } from '@indent/webhook'
import { Request, Response } from 'express'
import { IndentAPI } from '@indent/api'

const indent = new IndentAPI()

export default async function(req: Request, res: Response) {
  const body = await json(req)

  await verify({
    secret: process.env.INDENT_WEBHOOK_SECRET,
    headers: req.headers,
    body: req.body
  })

  const { events } = body
  const { resource, actor, event } = events[0]
  
  let decision

  const existing = await indent.petition.list({
    resourceId: resource.id,
    phase: 'granted'
  })

  if (!existing) {
    return res.json({ events: [] })
  }

  return res.json({
    events: [{
      event: 'access/deny',
      actor: { kind: 'bot', id: 'semaphore' },
      resource: { kind: 'access', id: existing[0].id },
      reason: `semaphore - already granted: ${existing[0].name}`
    }]
  })
}