Evidence-Based Design Production Verified

System Record
Intelligence.

Explore our Engineering Decision Records, Architecture Reviews, and Well-Architected case studies. Every document reflects real-world production constraints.

Search our Knowledge Base

Filter by technology, pattern, or problem type

Trending: Kubernetes Terraform Security
BR102970
Jan 29, 2026 13 min read

Why We Banned Long-Lived Cloud Credentials in Production

Roulette Technologies had achieved productmarket fit. The product was working with real customer data, revenue was increasing and the number of engineers on our team had grown from two founders to seven. And then we discovered that a former backend engineer who had left weeks earlier still had active AWS credentials. We found this out when we audited our access controls. Her access keys, issued eight months ago when we fixed an urgent production issue, were still valid. She had full access to our primary RDS instance. She could decrypt secrets from Parameter Store. She could update S3 bucket policies. Editorial context: Build Roulette documents productioninformed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. The engineer wasn't malicious. Her credentials weren't compromised. But this is exactly the problem: this situation is real and we had no idea it existed until we audited our access controls. This is a class of problems teams routinely discover too late which represents a real failure mode in our security posture. The question we asked ourselves is: How do we ensure that access credentials expire, are associated with real identities and actually give us an audit trail that we can trust? The Constraint Surface Before we discuss what we built, let's be clear about what we were working with: Team Reality Seven generalist engineers. No security team. Everyone deployed to production. Everyone needed AWS access. Oncall rotations meant debugging at 2am was a regular occurrence. Technical Constraints Services were deployed on ECS Fargate and Lambda Infrastructure was managed with Terraform CI/CD pipeline was implemented with GitHub Actions Primary cloud provider: AWS Budget: Tight controls, cost accountability per service Security Reality SOC 2 Type II audit required in six months Customer contracts now required security questionnaires Past breach at competitor had spooked investors Operational Reality Engineers required production access Deployments occurred multiple times per day Breaking the deployment pipeline was not acceptable Increasing the dev loop by 10 minutes would be a productivity killer What We Explicitly Did Not Consider This problem statement deliberately did not consider the following: Zero trust architecture with mutual TLS everywhere Required platform team we didn't have Would have added months to our timeline Operational complexity beyond our team's capacity Hardware security keys for all engineers Procurement and distribution logistics Remote team with time zone differences Cost per engineer not yet justified Completely lockeddown production environment with ticketbased access Would have broken our debugging workflow Would have created dependency on approval process Would have slowed down our incident response times unacceptably Using a thirdparty privileged access management platform Cost is $50K+ per year Would have added work to our integration queue, delaying other priorities Would have created vendor lockin risk We were trying to find something that fit our reality, our team size and the tools we already had, as well as the fact that our engineers debug production problems at 2am. The Decision: ShortLived Credentials Everywhere We decided that all longlived credentials would be removed from production systems. Every authentication attempt would be with shortlived credentials that expire after a certain period of time. For Engineers Human Access AWS SSO with role assumption Maximum session duration: 8 hours Reauthentication required daily MFA enabled at the SSO layer For Services Machine Access ECS Task Roles for containers Lambda execution roles for functions No IAM access keys in environment variables No secrets in code or configuration files For CI/CD Automation Access GitHub Actions OIDC provider Shortlived credentials issued per workflow run Credentials expire when workflow is complete No secrets stored in GitHub Credential Lifetime Policy Human sessions: 8 hours maximum Service credentials: Automatic rotation by AWS CI/CD credentials: Scoped to workflow execution 515 minutes Emergency break glass: 1 hour maximum, logged and alerted Implementation Details Human Access Pattern Engineers authenticate through AWS SSO. AWS SSO is integrated with our Google Workspace identity provider. Engineers authenticate with their Google Workspace credentials. Once the engineer needs access to the production system, they assume a role that has the necessary access. Engineer logs in once per day aws sso login profile production Credentials are automatically refreshed aws s3 ls profile production Behind the scenes: AWS SSO verifies identity with Google Workspace MFA challenge sent to user Google Authenticator or physical token Temporary credentials created to assume role Credentials stored locally, expire after 8 hours Credentials autorefresh if within session window The role determines permissions. We created three base roles: Role Permissions Access Grant ReadOnly View resources, read logs, describe infrastructure Each engineer starts with ReadOnly Developer Deploy services, modify nonsensitive config Developer access is granted on oncall rotation Admin Full access, reserved for infrastructure changes Admin access requires approval and justification Service Access Pattern Previously, our ECS task definitions were structured as follows: { "environment": { "name": "AWS_ACCESS_KEY_ID", "value": "AKIA..." }, { "name": "AWS_SECRET_ACCESS_KEY", "value": "..." } } After: { "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ProductionAPIServiceRole", "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ECSTaskExecutionRole" } The task role provides access to AWS services. The credentials are provided automatically by the ECS agent. The credentials rotate every hour. The application code does not change. The AWS SDK takes care of refreshing the credentials. For our Python services: Before: Explicit credentials s3_client = boto3.client 's3', aws_access_key_id=os.environ'AWS_ACCESS_KEY_ID', aws_secret_access_key=os.environ'AWS_SECRET_ACCESS_KEY' After: Automatic credential resolution s3_client = boto3.client's3' The SDK will automatically detect the ECS credential provider endpoint. No code changes are necessary, except for removing the credential configuration. CI/CD Access Pattern In our GitHub Actions workflow, we were using the following stored secrets: Before name: Deploy to production env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | terraform apply We migrated to OIDC authentication: After name: Configure AWS credentials uses: awsactions/configureawscredentials@v4 with: roletoassume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole awsregion: useast1 name: Deploy to production run: | terraform apply The OIDC flow works as follows: GitHub Actions generates a signed token for the workflow run The token includes repository, branch and workflow metadata AWS verifies the signature of the token using GitHub's public keys If valid, AWS provides temporary credentials for the role The credentials expire when the workflow completes 515 minutes The IAM role trust policy controls which repositories can use this role: { "Version": "20121017", "Statement": { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::ACCOUNT:oidcprovider/token.actions.githubusercontent.com" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "token.actions.githubusercontent.com:aud": "sts.amazonaws.com", "token.actions.githubusercontent.com:sub": "repo:roulettetech/infrastructure:ref:refs/heads/main" } } } } Only workflows in the infrastructure repository, running on the main branch, can assume this role. A compromised developer laptop is unable to produce valid tokens. A leaked secret cannot be replayed. What This Actually Changed Security Improvements Automatic Expiration Credentials expire, not through policy, but through design No process to forget to rotate Limited blast radius in the event of a breach Identity Binding Every action is tied to a real identity CloudTrail shows who took each action No sharing of credentials among team members Reduced Attack Surface No credentials in environment variables No secrets in source control No longlived keys in CI/CD systems Audit Trail SSO login activity is recorded Role assumptions recorded Session duration is recorded Operational Impact Developer Experience One login per day instead of managing access keys No need to be reminded to rotate credentials MFA is handled at the SSO level, not the service level Incident Response Revoke access by disabling SSO account Credentials expire after 8 hours No key hunting in emergency situations Onboarding/Offboarding New engineer: SSO access and role assignment Leaving engineer: Disable SSO account, access immediately terminated No credential cleanup checklist Cost AWS SSO: free No thirdparty PAM solution needed Engineering time: ~40 hours for full migration What We Explicitly Accepted This approach was a very conservative approach. We did not attempt perfect security. We attempted measurably more secure, with reasonable operational overhead. Accepted Limitations: Rolebased access is coarsegrained Three roles may be too broad a categorization Engineers with Developer role have more access than necessary for a particular task Finegrained access would have meant a role per engineer 8hour sessions are long Credentials valid for a full day Compromise during this time window is still possible Shorter sessions would have hindered debugging Breakglass accounts are available Emergency root access available Stored in 1Password, protected with MFA Used twice in six months for SSO outages No networklevel isolation Credentials are not sufficient to prevent lateral movement We assumed services would have access to each other Network segmentation would have meant a VPC overhaul CloudTrail lag 15 minutes before events are logged No realtime monitoring Compromise could be underway for minutes before detection We made this set of tradeoffs because it aligns with the size of our team, our needs and our security concerns. Your mileage may vary depending on your team. Measurement and Validation We monitored the following metrics to validate the decision. Before Migration Number of active IAM access keys: 23 Average age of access keys: 147 days Number of keys older than 90 days: 14 Number of credential rotation incidents: 0 never happened After Migration 6 months Number of active IAM access keys: 0 except breakglass Average session duration: 4.2 hours Number of failed authentications: 312 all expired sessions Time to revoke access on offboarding employees: < 5 minutes Security Audit Results SOC 2 report: "Adequate credential management controls" Previous report: "Recommendation: Implement credential rotation" Customer security questionnaire blockers: 0 Investor security review: Passed without any credential concerns Developer Feedback Time to onboard AWS access: 15 minutes "Have to reauth once per day" acceptable friction level Number of incidents caused by credential problems: 1 session expired during deploy, retry solved the issue When This Approach Breaks Down This model works for us at Roulette Technologies. It will not work forever. Signals that it's time to evolve further: Team size grows beyond 20 engineers Too many roles, not enough granularity Too many roles, need more permission levels Too many roles, need to manage explosion of roles Compliance needs change Too long an 8hour session for PCI DSS compliance Need to support justintime access provisioning Need to support approvals for access escalation Multicloud environment emerges AWS SSO is not enough; need support for GCP, Azure Need support for a unified identity model Need support for credential management across multiple clouds Service Mesh emerges with Mutual TLS Machine identities need more granularity Need support for certificate authentication SPIFFE, SPIRE, etc., are a natural fit Thirdparty services need access to production environment Vendors or contractors need temporary access Timebased access with automatic expiration Audit trail for nonemployee access required The Next Evolution When these conditions are met, the next evolution might include: Dedicated platform team to manage the identity infrastructure Thirdparty PAM solution for justintime access Policy as code for permission management Certificatebased authentication for servicetoservice But these are future problems. For a team of seven engineers with tightly controlled scope, we solved the problem we had with shortlived credentials. Lessons Learned Identity, not Credentials Accessing Google Workspace via identity solved the problem Single source of truth for who has access Offboarding is now automatic Platform Primitives AWS SSO, ECS Task Roles and OIDC are free Developing a custom solution would have taken us months Platform primitives improve over time Operational Simplicity > Theoretical Security Perfect security isolation would have broken debugging Engineers need production access to fix production Security that prevents work gets bypassed Measure the Before State Understanding we had 23 active keys proved the case Average age of 147 days was damning Numbers convinced leadership to prioritize migration Migration is Incremental Started with new services on task roles Next, migrated CI/CD highest risk Last, migrated engineer access most disruptive Documentation is more important than you think Engineers needed runbook on new auth flow "How do I debug production?" became FAQ Documentation helped reduce support calls This Article Is Not About Credentials This EDR is about: Designing with constraints, not ideals We didn't implement zero trust architecture We implemented what we could with what we had Good enough today, perfect never Designing security as an operational reality Security measures need to work with how we work Engineers will circumvent security if it gets in the way Usability is a security feature Designing security with tradeoffs 8hour sessions are long, but acceptable Coarsegrained roles are imperfect, but workable Honesty builds trust Designing security with a sense of evolution What we have today won't scale 10x We defined clear signals for change Architecture is never finished Designing security with realworld experiences Sarah's active credentials were a wakeup call Abstract security threats won't motivate change Realworld security exposure drives realworld change The Final Question The credential ban wasn't about best practices. The credential ban was about solving a real problem. If you're a threeperson startup, chances are you don't need this right now. If you're a 50person company, you needed this yesterday. If you're a company of unknown size, ask yourself this question: Do you know how many active credentials exist in your production environment right now? If you answered "no" to this question, then you have the same problem we had and if you're finding this answer out at 2am in the middle of an incident, trust us, you'll wish you had solved this problem sooner.

Short Lived Credentials AWS Identity Pragmatic Security