BR102970
Jan 29, 2026
13 min read
Why We Banned Long-Lived Cloud Credentials in Production
Roulette Technologies had achieved productmarket fit. The product was working with real customer data, revenue was increasing and the number of engineers on our team had grown from two founders to seven.
And then we discovered that a former backend engineer who had left weeks earlier still had active AWS credentials.
We found this out when we audited our access controls. Her access keys, issued eight months ago when we fixed an urgent production issue, were still valid. She had full access to our primary RDS instance. She could decrypt secrets from Parameter Store. She could update S3 bucket policies.
Editorial context:
Build Roulette documents productioninformed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive.
The engineer wasn't malicious. Her credentials weren't compromised. But this is exactly the problem: this situation is real and we had no idea it existed until we audited our access controls.
This is a class of problems teams routinely discover too late which represents a real failure mode in our security posture.
The question we asked ourselves is: How do we ensure that access credentials expire, are associated with real identities and actually give us an audit trail that we can trust?
The Constraint Surface
Before we discuss what we built, let's be clear about what we were working with:
Team Reality
Seven generalist engineers. No security team.
Everyone deployed to production. Everyone needed AWS access.
Oncall rotations meant debugging at 2am was a regular occurrence.
Technical Constraints
Services were deployed on ECS Fargate and Lambda
Infrastructure was managed with Terraform
CI/CD pipeline was implemented with GitHub Actions
Primary cloud provider: AWS
Budget: Tight controls, cost accountability per service
Security Reality
SOC 2 Type II audit required in six months
Customer contracts now required security questionnaires
Past breach at competitor had spooked investors
Operational Reality
Engineers required production access
Deployments occurred multiple times per day
Breaking the deployment pipeline was not acceptable
Increasing the dev loop by 10 minutes would be a productivity killer
What We Explicitly Did Not Consider
This problem statement deliberately did not consider the following:
Zero trust architecture with mutual TLS everywhere
Required platform team we didn't have
Would have added months to our timeline
Operational complexity beyond our team's capacity
Hardware security keys for all engineers
Procurement and distribution logistics
Remote team with time zone differences
Cost per engineer not yet justified
Completely lockeddown production environment with ticketbased access
Would have broken our debugging workflow
Would have created dependency on approval process
Would have slowed down our incident response times unacceptably
Using a thirdparty privileged access management platform
Cost is $50K+ per year
Would have added work to our integration queue, delaying other priorities
Would have created vendor lockin risk
We were trying to find something that fit our reality, our team size and the tools we already had, as well as the fact that our engineers debug production problems at 2am.
The Decision: ShortLived Credentials Everywhere
We decided that all longlived credentials would be removed from production systems. Every authentication attempt would be with shortlived credentials that expire after a certain period of time.
For Engineers Human Access
AWS SSO with role assumption
Maximum session duration: 8 hours
Reauthentication required daily
MFA enabled at the SSO layer
For Services Machine Access
ECS Task Roles for containers
Lambda execution roles for functions
No IAM access keys in environment variables
No secrets in code or configuration files
For CI/CD Automation Access
GitHub Actions OIDC provider
Shortlived credentials issued per workflow run
Credentials expire when workflow is complete
No secrets stored in GitHub
Credential Lifetime Policy
Human sessions: 8 hours maximum
Service credentials: Automatic rotation by AWS
CI/CD credentials: Scoped to workflow execution 515 minutes
Emergency break glass: 1 hour maximum, logged and alerted
Implementation Details
Human Access Pattern
Engineers authenticate through AWS SSO. AWS SSO is integrated with our Google Workspace identity provider. Engineers authenticate with their Google Workspace credentials. Once the engineer needs access to the production system, they assume a role that has the necessary access.
Engineer logs in once per day
aws sso login profile production
Credentials are automatically refreshed
aws s3 ls profile production
Behind the scenes:
AWS SSO verifies identity with Google Workspace
MFA challenge sent to user Google Authenticator or physical token
Temporary credentials created to assume role
Credentials stored locally, expire after 8 hours
Credentials autorefresh if within session window
The role determines permissions. We created three base roles:
Role
Permissions
Access Grant
ReadOnly
View resources, read logs, describe infrastructure
Each engineer starts with ReadOnly
Developer
Deploy services, modify nonsensitive config
Developer access is granted on oncall rotation
Admin
Full access, reserved for infrastructure changes
Admin access requires approval and justification
Service Access Pattern
Previously, our ECS task definitions were structured as follows:
{
"environment":
{
"name": "AWS_ACCESS_KEY_ID",
"value": "AKIA..."
},
{
"name": "AWS_SECRET_ACCESS_KEY",
"value": "..."
}
}
After:
{
"taskRoleArn": "arn:aws:iam::ACCOUNT:role/ProductionAPIServiceRole",
"executionRoleArn": "arn:aws:iam::ACCOUNT:role/ECSTaskExecutionRole"
}
The task role provides access to AWS services. The credentials are provided automatically by the ECS agent. The credentials rotate every hour. The application code does not change. The AWS SDK takes care of refreshing the credentials.
For our Python services:
Before: Explicit credentials
s3_client = boto3.client
's3',
aws_access_key_id=os.environ'AWS_ACCESS_KEY_ID',
aws_secret_access_key=os.environ'AWS_SECRET_ACCESS_KEY'
After: Automatic credential resolution
s3_client = boto3.client's3'
The SDK will automatically detect the ECS credential provider endpoint. No code changes are necessary, except for removing the credential configuration.
CI/CD Access Pattern
In our GitHub Actions workflow, we were using the following stored secrets:
Before
name: Deploy to production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
terraform apply
We migrated to OIDC authentication:
After
name: Configure AWS credentials
uses: awsactions/configureawscredentials@v4
with:
roletoassume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole
awsregion: useast1
name: Deploy to production
run: |
terraform apply
The OIDC flow works as follows:
GitHub Actions generates a signed token for the workflow run
The token includes repository, branch and workflow metadata
AWS verifies the signature of the token using GitHub's public keys
If valid, AWS provides temporary credentials for the role
The credentials expire when the workflow completes 515 minutes
The IAM role trust policy controls which repositories can use this role:
{
"Version": "20121017",
"Statement":
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::ACCOUNT:oidcprovider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
"token.actions.githubusercontent.com:sub": "repo:roulettetech/infrastructure:ref:refs/heads/main"
}
}
}
}
Only workflows in the infrastructure repository, running on the main branch, can assume this role. A compromised developer laptop is unable to produce valid tokens. A leaked secret cannot be replayed.
What This Actually Changed
Security Improvements
Automatic Expiration
Credentials expire, not through policy, but through design
No process to forget to rotate
Limited blast radius in the event of a breach
Identity Binding
Every action is tied to a real identity
CloudTrail shows who took each action
No sharing of credentials among team members
Reduced Attack Surface
No credentials in environment variables
No secrets in source control
No longlived keys in CI/CD systems
Audit Trail
SSO login activity is recorded
Role assumptions recorded
Session duration is recorded
Operational Impact
Developer Experience
One login per day instead of managing access keys
No need to be reminded to rotate credentials
MFA is handled at the SSO level, not the service level
Incident Response
Revoke access by disabling SSO account
Credentials expire after 8 hours
No key hunting in emergency situations
Onboarding/Offboarding
New engineer: SSO access and role assignment
Leaving engineer: Disable SSO account, access immediately terminated
No credential cleanup checklist
Cost
AWS SSO: free
No thirdparty PAM solution needed
Engineering time: ~40 hours for full migration
What We Explicitly Accepted
This approach was a very conservative approach. We did not attempt perfect security. We attempted measurably more secure, with reasonable operational overhead.
Accepted Limitations:
Rolebased access is coarsegrained
Three roles may be too broad a categorization
Engineers with Developer role have more access than necessary for a particular task
Finegrained access would have meant a role per engineer
8hour sessions are long
Credentials valid for a full day
Compromise during this time window is still possible
Shorter sessions would have hindered debugging
Breakglass accounts are available
Emergency root access available
Stored in 1Password, protected with MFA
Used twice in six months for SSO outages
No networklevel isolation
Credentials are not sufficient to prevent lateral movement
We assumed services would have access to each other
Network segmentation would have meant a VPC overhaul
CloudTrail lag
15 minutes before events are logged
No realtime monitoring
Compromise could be underway for minutes before detection
We made this set of tradeoffs because it aligns with the size of our team, our needs and our security concerns. Your mileage may vary depending on your team.
Measurement and Validation
We monitored the following metrics to validate the decision.
Before Migration
Number of active IAM access keys: 23
Average age of access keys: 147 days
Number of keys older than 90 days: 14
Number of credential rotation incidents: 0 never happened
After Migration 6 months
Number of active IAM access keys: 0 except breakglass
Average session duration: 4.2 hours
Number of failed authentications: 312 all expired sessions
Time to revoke access on offboarding employees: < 5 minutes
Security Audit Results
SOC 2 report: "Adequate credential management controls"
Previous report: "Recommendation: Implement credential rotation"
Customer security questionnaire blockers: 0
Investor security review: Passed without any credential concerns
Developer Feedback
Time to onboard AWS access: 15 minutes
"Have to reauth once per day" acceptable friction level
Number of incidents caused by credential problems: 1 session expired during deploy, retry solved the issue
When This Approach Breaks Down
This model works for us at Roulette Technologies. It will not work forever.
Signals that it's time to evolve further:
Team size grows beyond 20 engineers
Too many roles, not enough granularity
Too many roles, need more permission levels
Too many roles, need to manage explosion of roles
Compliance needs change
Too long an 8hour session for PCI DSS compliance
Need to support justintime access provisioning
Need to support approvals for access escalation
Multicloud environment emerges
AWS SSO is not enough; need support for GCP, Azure
Need support for a unified identity model
Need support for credential management across multiple clouds
Service Mesh emerges with Mutual TLS
Machine identities need more granularity
Need support for certificate authentication
SPIFFE, SPIRE, etc., are a natural fit
Thirdparty services need access to production environment
Vendors or contractors need temporary access
Timebased access with automatic expiration
Audit trail for nonemployee access required
The Next Evolution
When these conditions are met, the next evolution might include:
Dedicated platform team to manage the identity infrastructure
Thirdparty PAM solution for justintime access
Policy as code for permission management
Certificatebased authentication for servicetoservice
But these are future problems. For a team of seven engineers with tightly controlled scope, we solved the problem we had with shortlived credentials.
Lessons Learned
Identity, not Credentials
Accessing Google Workspace via identity solved the problem
Single source of truth for who has access
Offboarding is now automatic
Platform Primitives
AWS SSO, ECS Task Roles and OIDC are free
Developing a custom solution would have taken us months
Platform primitives improve over time
Operational Simplicity > Theoretical Security
Perfect security isolation would have broken debugging
Engineers need production access to fix production
Security that prevents work gets bypassed
Measure the Before State
Understanding we had 23 active keys proved the case
Average age of 147 days was damning
Numbers convinced leadership to prioritize migration
Migration is Incremental
Started with new services on task roles
Next, migrated CI/CD highest risk
Last, migrated engineer access most disruptive
Documentation is more important than you think
Engineers needed runbook on new auth flow
"How do I debug production?" became FAQ
Documentation helped reduce support calls
This Article Is Not About Credentials
This EDR is about:
Designing with constraints, not ideals
We didn't implement zero trust architecture
We implemented what we could with what we had
Good enough today, perfect never
Designing security as an operational reality
Security measures need to work with how we work
Engineers will circumvent security if it gets in the way
Usability is a security feature
Designing security with tradeoffs
8hour sessions are long, but acceptable
Coarsegrained roles are imperfect, but workable
Honesty builds trust
Designing security with a sense of evolution
What we have today won't scale 10x
We defined clear signals for change
Architecture is never finished
Designing security with realworld experiences
Sarah's active credentials were a wakeup call
Abstract security threats won't motivate change
Realworld security exposure drives realworld change
The Final Question
The credential ban wasn't about best practices. The credential ban was about solving a real problem.
If you're a threeperson startup, chances are you don't need this right now. If you're a 50person company, you needed this yesterday. If you're a company of unknown size, ask yourself this question:
Do you know how many active credentials exist in your production environment right now?
If you answered "no" to this question, then you have the same problem we had and if you're finding this answer out at 2am in the middle of an incident, trust us, you'll wish you had solved this problem sooner.
Short Lived Credentials
AWS Identity
Pragmatic Security