Why We Banned Long-Lived Cloud Credentials in Production

Roulette Technologies had achieved product-market fit. The product was working with real customer data, revenue was increasing and the number of engineers on our team had grown from two founders to seven.

And then we discovered that a former backend engineer who had left weeks earlier still had active AWS credentials.

We found this out when we audited our access controls. Her access keys, issued eight months ago when we fixed an urgent production issue, were still valid. She had full access to our primary RDS instance. She could decrypt secrets from Parameter Store. She could update S3 bucket policies.

Editorial context:
Build Roulette documents production-informed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive.

The engineer wasn't malicious. Her credentials weren't compromised. But this is exactly the problem: this situation is real and we had no idea it existed until we audited our access controls.

This is a class of problems teams routinely discover too late which represents a real failure mode in our security posture.

The question we asked ourselves is: How do we ensure that access credentials expire, are associated with real identities and actually give us an audit trail that we can trust?

The Constraint Surface

Before we discuss what we built, let's be clear about what we were working with:

Team Reality

Seven generalist engineers. No security team.
Everyone deployed to production. Everyone needed AWS access.
On-call rotations meant debugging at 2am was a regular occurrence.

Technical Constraints

Services were deployed on ECS Fargate and Lambda
Infrastructure was managed with Terraform
CI/CD pipeline was implemented with GitHub Actions
Primary cloud provider: AWS
Budget: Tight controls, cost accountability per service

Security Reality

SOC 2 Type II audit required in six months
Customer contracts now required security questionnaires
Past breach at competitor had spooked investors

Operational Reality

Engineers required production access
Deployments occurred multiple times per day
Breaking the deployment pipeline was not acceptable
Increasing the dev loop by 10 minutes would be a productivity killer

What We Explicitly Did Not Consider

This problem statement deliberately did not consider the following:

Zero trust architecture with mutual TLS everywhere

Required platform team we didn't have
Would have added months to our timeline
Operational complexity beyond our team's capacity

Hardware security keys for all engineers

Procurement and distribution logistics
Remote team with time zone differences
Cost per engineer not yet justified

Completely locked-down production environment with ticket-based access

Would have broken our debugging workflow
Would have created dependency on approval process
Would have slowed down our incident response times unacceptably

Using a third-party privileged access management platform

Cost is $50K+ per year
Would have added work to our integration queue, delaying other priorities
Would have created vendor lock-in risk

We were trying to find something that fit our reality, our team size and the tools we already had, as well as the fact that our engineers debug production problems at 2am.

The Decision: Short-Lived Credentials Everywhere

We decided that all long-lived credentials would be removed from production systems. Every authentication attempt would be with short-lived credentials that expire after a certain period of time.

For Engineers (Human Access)

AWS SSO with role assumption
Maximum session duration: 8 hours
Re-authentication required daily
MFA enabled at the SSO layer

For Services (Machine Access)

ECS Task Roles for containers
Lambda execution roles for functions
No IAM access keys in environment variables
No secrets in code or configuration files

For CI/CD (Automation Access)

GitHub Actions OIDC provider
Short-lived credentials issued per workflow run
Credentials expire when workflow is complete
No secrets stored in GitHub

Credential Lifetime Policy

Human sessions: 8 hours maximum
Service credentials: Automatic rotation by AWS
CI/CD credentials: Scoped to workflow execution (5-15 minutes)
Emergency break glass: 1 hour maximum, logged and alerted

Implementation Details

Human Access Pattern

Engineers authenticate through AWS SSO. AWS SSO is integrated with our Google Workspace identity provider. Engineers authenticate with their Google Workspace credentials. Once the engineer needs access to the production system, they assume a role that has the necessary access.

# Engineer logs in once per day
aws sso login --profile production

# Credentials are automatically refreshed
aws s3 ls --profile production

Behind the scenes:

AWS SSO verifies identity with Google Workspace
MFA challenge sent to user (Google Authenticator or physical token)
Temporary credentials created to assume role
Credentials stored locally, expire after 8 hours
Credentials auto-refresh if within session window The role determines permissions. We created three base roles:

Role	Permissions	Access Grant
ReadOnly	View resources, read logs, describe infrastructure	Each engineer starts with ReadOnly
Developer	Deploy services, modify non-sensitive config	Developer access is granted on oncall rotation
Admin	Full access, reserved for infrastructure changes	Admin access requires approval and justification

Service Access Pattern

Previously, our ECS task definitions were structured as follows:

{
  "environment": [
    {
      "name": "AWS_ACCESS_KEY_ID",
      "value": "AKIA..."
    },
    {
      "name": "AWS_SECRET_ACCESS_KEY",
      "value": "..."
    }
  ]
}

After:

{
  "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ProductionAPIServiceRole",
  "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ECSTaskExecutionRole"
}

The task role provides access to AWS services. The credentials are provided automatically by the ECS agent. The credentials rotate every hour. The application code does not change. The AWS SDK takes care of refreshing the credentials.

For our Python services:

# Before: Explicit credentials
s3_client = boto3.client(
    's3',
    aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
    aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
)

# After: Automatic credential resolution
s3_client = boto3.client('s3')

The SDK will automatically detect the ECS credential provider endpoint. No code changes are necessary, except for removing the credential configuration.

CI/CD Access Pattern

In our GitHub Actions workflow, we were using the following stored secrets:

# Before
- name: Deploy to production
  env:
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  run: |
    terraform apply

We migrated to OIDC authentication:

# After
- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole
    aws-region: us-east-1

- name: Deploy to production
  run: |
    terraform apply

The OIDC flow works as follows:

GitHub Actions generates a signed token for the workflow run
The token includes repository, branch and workflow metadata
AWS verifies the signature of the token using GitHub's public keys
If valid, AWS provides temporary credentials for the role
The credentials expire when the workflow completes (5-15 minutes)

The IAM role trust policy controls which repositories can use this role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
          "token.actions.githubusercontent.com:sub": "repo:roulette-tech/infrastructure:ref:refs/heads/main"
        }
      }
    }
  ]
}

Only workflows in the infrastructure repository, running on the main branch, can assume this role. A compromised developer laptop is unable to produce valid tokens. A leaked secret cannot be replayed.

What This Actually Changed

Security Improvements

Automatic Expiration

Credentials expire, not through policy, but through design
No process to forget to rotate
Limited blast radius in the event of a breach

Identity Binding

Every action is tied to a real identity
CloudTrail shows who took each action
No sharing of credentials among team members

Reduced Attack Surface

No credentials in environment variables
No secrets in source control
No long-lived keys in CI/CD systems

Audit Trail

SSO login activity is recorded
Role assumptions recorded
Session duration is recorded

Operational Impact

Developer Experience

One login per day instead of managing access keys
No need to be reminded to rotate credentials
MFA is handled at the SSO level, not the service level

Incident Response

Revoke access by disabling SSO account
Credentials expire after 8 hours
No key hunting in emergency situations

Onboarding/Offboarding

New engineer: SSO access and role assignment
Leaving engineer: Disable SSO account, access immediately terminated
No credential cleanup checklist

Cost

AWS SSO: free
No third-party PAM solution needed
Engineering time: ~40 hours for full migration

What We Explicitly Accepted

This approach was a very conservative approach. We did not attempt perfect security. We attempted measurably more secure, with reasonable operational overhead.

Accepted Limitations:

Role-based access is coarse-grained

Three roles may be too broad a categorization
Engineers with Developer role have more access than necessary for a particular task
Fine-grained access would have meant a role per engineer

8-hour sessions are long

Credentials valid for a full day
Compromise during this time window is still possible
Shorter sessions would have hindered debugging

Break-glass accounts are available

Emergency root access available
Stored in 1Password, protected with MFA
Used twice in six months for SSO outages

No network-level isolation

Credentials are not sufficient to prevent lateral movement
We assumed services would have access to each other
Network segmentation would have meant a VPC overhaul

CloudTrail lag

15 minutes before events are logged
No real-time monitoring
Compromise could be underway for minutes before detection

We made this set of trade-offs because it aligns with the size of our team, our needs and our security concerns. Your mileage may vary depending on your team.

Measurement and Validation

We monitored the following metrics to validate the decision.

Before Migration

Number of active IAM access keys: 23
Average age of access keys: 147 days
Number of keys older than 90 days: 14
Number of credential rotation incidents: 0 (never happened)

After Migration (6 months)

Number of active IAM access keys: 0 (except break-glass)
Average session duration: 4.2 hours
Number of failed authentications: 312 (all expired sessions)
Time to revoke access on offboarding employees: < 5 minutes

Security Audit Results

SOC 2 report: "Adequate credential management controls"
Previous report: "Recommendation: Implement credential rotation"
Customer security questionnaire blockers: 0
Investor security review: Passed without any credential concerns

Developer Feedback

Time to onboard AWS access: 15 minutes
"Have to re-auth once per day" (acceptable friction level)
Number of incidents caused by credential problems: 1 (session expired during deploy, retry solved the issue)

When This Approach Breaks Down

This model works for us at Roulette Technologies. It will not work forever.

Signals that it's time to evolve further:

Team size grows beyond 20 engineers

Too many roles, not enough granularity
Too many roles, need more permission levels
Too many roles, need to manage explosion of roles

Compliance needs change

Too long an 8-hour session for PCI DSS compliance
Need to support just-in-time access provisioning
Need to support approvals for access escalation

Multi-cloud environment emerges

AWS SSO is not enough; need support for GCP, Azure
Need support for a unified identity model
Need support for credential management across multiple clouds

Service Mesh emerges with Mutual TLS

Machine identities need more granularity
Need support for certificate authentication
SPIFFE, SPIRE, etc., are a natural fit

Third-party services need access to production environment

Vendors or contractors need temporary access
Time-based access with automatic expiration
Audit trail for non-employee access required

The Next Evolution

When these conditions are met, the next evolution might include:

Dedicated platform team to manage the identity infrastructure
Third-party PAM solution for just-in-time access
Policy as code for permission management
Certificate-based authentication for service-to-service

But these are future problems. For a team of seven engineers with tightly controlled scope, we solved the problem we had with short-lived credentials.

Lessons Learned

Identity, not Credentials

Accessing Google Workspace via identity solved the problem
Single source of truth for who has access
Offboarding is now automatic

Platform Primitives

AWS SSO, ECS Task Roles and OIDC are free
Developing a custom solution would have taken us months
Platform primitives improve over time

Operational Simplicity > Theoretical Security

Perfect security isolation would have broken debugging
Engineers need production access to fix production
Security that prevents work gets bypassed

Measure the Before State

Understanding we had 23 active keys proved the case
Average age of 147 days was damning
Numbers convinced leadership to prioritize migration

Migration is Incremental

Started with new services on task roles
Next, migrated CI/CD (highest risk)
Last, migrated engineer access (most disruptive)

Documentation is more important than you think

Engineers needed runbook on new auth flow
"How do I debug production?" became FAQ
Documentation helped reduce support calls

This Article Is Not About Credentials

This EDR is about:

Designing with constraints, not ideals

We didn't implement zero trust architecture
We implemented what we could with what we had
Good enough today, perfect never

Designing security as an operational reality

Security measures need to work with how we work
Engineers will circumvent security if it gets in the way
Usability is a security feature

Designing security with trade-offs

8-hour sessions are long, but acceptable
Coarse-grained roles are imperfect, but workable
Honesty builds trust

Designing security with a sense of evolution

What we have today won't scale 10x
We defined clear signals for change
Architecture is never finished

Designing security with real-world experiences

Sarah's active credentials were a wake-up call
Abstract security threats won't motivate change
Real-world security exposure drives real-world change

The Final Question

The credential ban wasn't about best practices. The credential ban was about solving a real problem.

If you're a three-person startup, chances are you don't need this right now. If you're a 50-person company, you needed this yesterday. If you're a company of unknown size, ask yourself this question:

Do you know how many active credentials exist in your production environment right now?

If you answered "no" to this question, then you have the same problem we had and if you're finding this answer out at 2am in the middle of an incident, trust us, you'll wish you had solved this problem sooner.