Roulette Technologies had achieved product-market fit. The product was working with real customer data, revenue was increasing and the number of engineers on our team had grown from two founders to seven.
And then we discovered that a former backend engineer who had left weeks earlier still had active AWS credentials.
We found this out when we audited our access controls. Her access keys, issued eight months ago when we fixed an urgent production issue, were still valid. She had full access to our primary RDS instance. She could decrypt secrets from Parameter Store. She could update S3 bucket policies.
Editorial context:
Build Roulette documents production-informed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive.
The engineer wasn't malicious. Her credentials weren't compromised. But this is exactly the problem: this situation is real and we had no idea it existed until we audited our access controls.
This is a class of problems teams routinely discover too late which represents a real failure mode in our security posture.

The question we asked ourselves is: How do we ensure that access credentials expire, are associated with real identities and actually give us an audit trail that we can trust?
The Constraint Surface
Before we discuss what we built, let's be clear about what we were working with:
Team Reality
- Seven generalist engineers. No security team.
- Everyone deployed to production. Everyone needed AWS access.
- On-call rotations meant debugging at 2am was a regular occurrence.
Technical Constraints
- Services were deployed on ECS Fargate and Lambda
- Infrastructure was managed with Terraform
- CI/CD pipeline was implemented with GitHub Actions
- Primary cloud provider: AWS
- Budget: Tight controls, cost accountability per service
Security Reality
- SOC 2 Type II audit required in six months
- Customer contracts now required security questionnaires
- Past breach at competitor had spooked investors
Operational Reality
- Engineers required production access
- Deployments occurred multiple times per day
- Breaking the deployment pipeline was not acceptable
- Increasing the dev loop by 10 minutes would be a productivity killer
What We Explicitly Did Not Consider
This problem statement deliberately did not consider the following:
Zero trust architecture with mutual TLS everywhere
- Required platform team we didn't have
- Would have added months to our timeline
- Operational complexity beyond our team's capacity
Hardware security keys for all engineers
- Procurement and distribution logistics
- Remote team with time zone differences
- Cost per engineer not yet justified
Completely locked-down production environment with ticket-based access
- Would have broken our debugging workflow
- Would have created dependency on approval process
- Would have slowed down our incident response times unacceptably
Using a third-party privileged access management platform
- Cost is $50K+ per year
- Would have added work to our integration queue, delaying other priorities
- Would have created vendor lock-in risk
We were trying to find something that fit our reality, our team size and the tools we already had, as well as the fact that our engineers debug production problems at 2am.
The Decision: Short-Lived Credentials Everywhere
We decided that all long-lived credentials would be removed from production systems. Every authentication attempt would be with short-lived credentials that expire after a certain period of time.
For Engineers (Human Access)
- AWS SSO with role assumption
- Maximum session duration: 8 hours
- Re-authentication required daily
- MFA enabled at the SSO layer
For Services (Machine Access)
- ECS Task Roles for containers
- Lambda execution roles for functions
- No IAM access keys in environment variables
- No secrets in code or configuration files
For CI/CD (Automation Access)
- GitHub Actions OIDC provider
- Short-lived credentials issued per workflow run
- Credentials expire when workflow is complete
- No secrets stored in GitHub
Credential Lifetime Policy
- Human sessions: 8 hours maximum
- Service credentials: Automatic rotation by AWS
- CI/CD credentials: Scoped to workflow execution (5-15 minutes)
- Emergency break glass: 1 hour maximum, logged and alerted
Implementation Details
Human Access Pattern
Engineers authenticate through AWS SSO. AWS SSO is integrated with our Google Workspace identity provider. Engineers authenticate with their Google Workspace credentials. Once the engineer needs access to the production system, they assume a role that has the necessary access.
# Engineer logs in once per day
aws sso login --profile production
# Credentials are automatically refreshed
aws s3 ls --profile production
Behind the scenes:
- AWS SSO verifies identity with Google Workspace
- MFA challenge sent to user (Google Authenticator or physical token)
- Temporary credentials created to assume role
- Credentials stored locally, expire after 8 hours
- Credentials auto-refresh if within session window
The role determines permissions. We created three base roles:
| Role | Permissions | Access Grant |
|---|---|---|
| ReadOnly | View resources, read logs, describe infrastructure | Each engineer starts with ReadOnly |
| Developer | Deploy services, modify non-sensitive config | Developer access is granted on oncall rotation |
| Admin | Full access, reserved for infrastructure changes | Admin access requires approval and justification |
Service Access Pattern
Previously, our ECS task definitions were structured as follows:
{
"environment": [
{
"name": "AWS_ACCESS_KEY_ID",
"value": "AKIA..."
},
{
"name": "AWS_SECRET_ACCESS_KEY",
"value": "..."
}
]
}
After:
{
"taskRoleArn": "arn:aws:iam::ACCOUNT:role/ProductionAPIServiceRole",
"executionRoleArn": "arn:aws:iam::ACCOUNT:role/ECSTaskExecutionRole"
}
The task role provides access to AWS services. The credentials are provided automatically by the ECS agent. The credentials rotate every hour. The application code does not change. The AWS SDK takes care of refreshing the credentials.
For our Python services:
# Before: Explicit credentials
s3_client = boto3.client(
's3',
aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
)
# After: Automatic credential resolution
s3_client = boto3.client('s3')
The SDK will automatically detect the ECS credential provider endpoint. No code changes are necessary, except for removing the credential configuration.
CI/CD Access Pattern
In our GitHub Actions workflow, we were using the following stored secrets:
# Before
- name: Deploy to production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
terraform apply
We migrated to OIDC authentication:
# After
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole
aws-region: us-east-1
- name: Deploy to production
run: |
terraform apply
The OIDC flow works as follows:
- GitHub Actions generates a signed token for the workflow run
- The token includes repository, branch and workflow metadata
- AWS verifies the signature of the token using GitHub's public keys
- If valid, AWS provides temporary credentials for the role
- The credentials expire when the workflow completes (5-15 minutes)
The IAM role trust policy controls which repositories can use this role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::ACCOUNT:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
"token.actions.githubusercontent.com:sub": "repo:roulette-tech/infrastructure:ref:refs/heads/main"
}
}
}
]
}
Only workflows in the
infrastructurerepository, running on themainbranch, can assume this role. A compromised developer laptop is unable to produce valid tokens. A leaked secret cannot be replayed.

What This Actually Changed
Security Improvements
Automatic Expiration
- Credentials expire, not through policy, but through design
- No process to forget to rotate
- Limited blast radius in the event of a breach
Identity Binding
- Every action is tied to a real identity
- CloudTrail shows who took each action
- No sharing of credentials among team members
Reduced Attack Surface
- No credentials in environment variables
- No secrets in source control
- No long-lived keys in CI/CD systems
Audit Trail
- SSO login activity is recorded
- Role assumptions recorded
- Session duration is recorded
Operational Impact
Developer Experience
- One login per day instead of managing access keys
- No need to be reminded to rotate credentials
- MFA is handled at the SSO level, not the service level
Incident Response
- Revoke access by disabling SSO account
- Credentials expire after 8 hours
- No key hunting in emergency situations
Onboarding/Offboarding
- New engineer: SSO access and role assignment
- Leaving engineer: Disable SSO account, access immediately terminated
- No credential cleanup checklist
Cost
- AWS SSO: free
- No third-party PAM solution needed
- Engineering time: ~40 hours for full migration
What We Explicitly Accepted
This approach was a very conservative approach. We did not attempt perfect security. We attempted measurably more secure, with reasonable operational overhead.
Accepted Limitations:
Role-based access is coarse-grained
- Three roles may be too broad a categorization
- Engineers with Developer role have more access than necessary for a particular task
- Fine-grained access would have meant a role per engineer
8-hour sessions are long
- Credentials valid for a full day
- Compromise during this time window is still possible
- Shorter sessions would have hindered debugging
Break-glass accounts are available
- Emergency root access available
- Stored in 1Password, protected with MFA
- Used twice in six months for SSO outages
No network-level isolation
- Credentials are not sufficient to prevent lateral movement
- We assumed services would have access to each other
- Network segmentation would have meant a VPC overhaul
CloudTrail lag
- 15 minutes before events are logged
- No real-time monitoring
- Compromise could be underway for minutes before detection
We made this set of trade-offs because it aligns with the size of our team, our needs and our security concerns. Your mileage may vary depending on your team.
Measurement and Validation
We monitored the following metrics to validate the decision.
Before Migration
- Number of active IAM access keys: 23
- Average age of access keys: 147 days
- Number of keys older than 90 days: 14
- Number of credential rotation incidents: 0 (never happened)
After Migration (6 months)
- Number of active IAM access keys: 0 (except break-glass)
- Average session duration: 4.2 hours
- Number of failed authentications: 312 (all expired sessions)
- Time to revoke access on offboarding employees: < 5 minutes
Security Audit Results
- SOC 2 report: "Adequate credential management controls"
- Previous report: "Recommendation: Implement credential rotation"
- Customer security questionnaire blockers: 0
- Investor security review: Passed without any credential concerns

Developer Feedback
- Time to onboard AWS access: 15 minutes
- "Have to re-auth once per day" (acceptable friction level)
- Number of incidents caused by credential problems: 1 (session expired during deploy, retry solved the issue)
When This Approach Breaks Down
This model works for us at Roulette Technologies. It will not work forever.
Signals that it's time to evolve further:
Team size grows beyond 20 engineers
- Too many roles, not enough granularity
- Too many roles, need more permission levels
- Too many roles, need to manage explosion of roles
Compliance needs change
- Too long an 8-hour session for PCI DSS compliance
- Need to support just-in-time access provisioning
- Need to support approvals for access escalation
Multi-cloud environment emerges
- AWS SSO is not enough; need support for GCP, Azure
- Need support for a unified identity model
- Need support for credential management across multiple clouds
Service Mesh emerges with Mutual TLS
- Machine identities need more granularity
- Need support for certificate authentication
- SPIFFE, SPIRE, etc., are a natural fit
Third-party services need access to production environment
- Vendors or contractors need temporary access
- Time-based access with automatic expiration
- Audit trail for non-employee access required
The Next Evolution
When these conditions are met, the next evolution might include:
- Dedicated platform team to manage the identity infrastructure
- Third-party PAM solution for just-in-time access
- Policy as code for permission management
- Certificate-based authentication for service-to-service
But these are future problems. For a team of seven engineers with tightly controlled scope, we solved the problem we had with short-lived credentials.
Lessons Learned
Identity, not Credentials
- Accessing Google Workspace via identity solved the problem
- Single source of truth for who has access
- Offboarding is now automatic
Platform Primitives
- AWS SSO, ECS Task Roles and OIDC are free
- Developing a custom solution would have taken us months
- Platform primitives improve over time
Operational Simplicity > Theoretical Security
- Perfect security isolation would have broken debugging
- Engineers need production access to fix production
- Security that prevents work gets bypassed
Measure the Before State
- Understanding we had 23 active keys proved the case
- Average age of 147 days was damning
- Numbers convinced leadership to prioritize migration
Migration is Incremental
- Started with new services on task roles
- Next, migrated CI/CD (highest risk)
- Last, migrated engineer access (most disruptive)
Documentation is more important than you think
- Engineers needed runbook on new auth flow
- "How do I debug production?" became FAQ
- Documentation helped reduce support calls
This Article Is Not About Credentials
This EDR is about:
Designing with constraints, not ideals
- We didn't implement zero trust architecture
- We implemented what we could with what we had
- Good enough today, perfect never
Designing security as an operational reality
- Security measures need to work with how we work
- Engineers will circumvent security if it gets in the way
- Usability is a security feature
Designing security with trade-offs
- 8-hour sessions are long, but acceptable
- Coarse-grained roles are imperfect, but workable
- Honesty builds trust
Designing security with a sense of evolution
- What we have today won't scale 10x
- We defined clear signals for change
- Architecture is never finished
Designing security with real-world experiences
- Sarah's active credentials were a wake-up call
- Abstract security threats won't motivate change
- Real-world security exposure drives real-world change
The Final Question
The credential ban wasn't about best practices. The credential ban was about solving a real problem.
If you're a three-person startup, chances are you don't need this right now. If you're a 50-person company, you needed this yesterday. If you're a company of unknown size, ask yourself this question:
Do you know how many active credentials exist in your production environment right now?
If you answered "no" to this question, then you have the same problem we had and if you're finding this answer out at 2am in the middle of an incident, trust us, you'll wish you had solved this problem sooner.