Evidence-Based Design Production Verified

System Record
Intelligence.

Explore our Engineering Decision Records, Architecture Reviews, and Well-Architected case studies. Every document reflects real-world production constraints.

Search our Knowledge Base

Filter by technology, pattern, or problem type

BR102970

Jan 29, 2026 13 min read

Why We Banned Long-Lived Cloud Credentials in Production

Roulette Technologies had achieved productmarket fit. The product was working with real customer data, revenue was increasing and the number of engineers on our team had grown from two founders to seven. And then we discovered that a former backend engineer who had left weeks earlier still had active AWS credentials. We found this out when we audited our access controls. Her access keys, issued eight months ago when we fixed an urgent production issue, were still valid. She had full access to our primary RDS instance. She could decrypt secrets from Parameter Store. She could update S3 bucket policies. Editorial context: Build Roulette documents productioninformed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. The engineer wasn't malicious. Her credentials weren't compromised. But this is exactly the problem: this situation is real and we had no idea it existed until we audited our access controls. This is a class of problems teams routinely discover too late which represents a real failure mode in our security posture. The question we asked ourselves is: How do we ensure that access credentials expire, are associated with real identities and actually give us an audit trail that we can trust? The Constraint Surface Before we discuss what we built, let's be clear about what we were working with: Team Reality Seven generalist engineers. No security team. Everyone deployed to production. Everyone needed AWS access. Oncall rotations meant debugging at 2am was a regular occurrence. Technical Constraints Services were deployed on ECS Fargate and Lambda Infrastructure was managed with Terraform CI/CD pipeline was implemented with GitHub Actions Primary cloud provider: AWS Budget: Tight controls, cost accountability per service Security Reality SOC 2 Type II audit required in six months Customer contracts now required security questionnaires Past breach at competitor had spooked investors Operational Reality Engineers required production access Deployments occurred multiple times per day Breaking the deployment pipeline was not acceptable Increasing the dev loop by 10 minutes would be a productivity killer What We Explicitly Did Not Consider This problem statement deliberately did not consider the following: Zero trust architecture with mutual TLS everywhere Required platform team we didn't have Would have added months to our timeline Operational complexity beyond our team's capacity Hardware security keys for all engineers Procurement and distribution logistics Remote team with time zone differences Cost per engineer not yet justified Completely lockeddown production environment with ticketbased access Would have broken our debugging workflow Would have created dependency on approval process Would have slowed down our incident response times unacceptably Using a thirdparty privileged access management platform Cost is $50K+ per year Would have added work to our integration queue, delaying other priorities Would have created vendor lockin risk We were trying to find something that fit our reality, our team size and the tools we already had, as well as the fact that our engineers debug production problems at 2am. The Decision: ShortLived Credentials Everywhere We decided that all longlived credentials would be removed from production systems. Every authentication attempt would be with shortlived credentials that expire after a certain period of time. For Engineers Human Access AWS SSO with role assumption Maximum session duration: 8 hours Reauthentication required daily MFA enabled at the SSO layer For Services Machine Access ECS Task Roles for containers Lambda execution roles for functions No IAM access keys in environment variables No secrets in code or configuration files For CI/CD Automation Access GitHub Actions OIDC provider Shortlived credentials issued per workflow run Credentials expire when workflow is complete No secrets stored in GitHub Credential Lifetime Policy Human sessions: 8 hours maximum Service credentials: Automatic rotation by AWS CI/CD credentials: Scoped to workflow execution 515 minutes Emergency break glass: 1 hour maximum, logged and alerted Implementation Details Human Access Pattern Engineers authenticate through AWS SSO. AWS SSO is integrated with our Google Workspace identity provider. Engineers authenticate with their Google Workspace credentials. Once the engineer needs access to the production system, they assume a role that has the necessary access. Engineer logs in once per day aws sso login profile production Credentials are automatically refreshed aws s3 ls profile production Behind the scenes: AWS SSO verifies identity with Google Workspace MFA challenge sent to user Google Authenticator or physical token Temporary credentials created to assume role Credentials stored locally, expire after 8 hours Credentials autorefresh if within session window The role determines permissions. We created three base roles: Role Permissions Access Grant ReadOnly View resources, read logs, describe infrastructure Each engineer starts with ReadOnly Developer Deploy services, modify nonsensitive config Developer access is granted on oncall rotation Admin Full access, reserved for infrastructure changes Admin access requires approval and justification Service Access Pattern Previously, our ECS task definitions were structured as follows: { "environment": { "name": "AWS_ACCESS_KEY_ID", "value": "AKIA..." }, { "name": "AWS_SECRET_ACCESS_KEY", "value": "..." } } After: { "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ProductionAPIServiceRole", "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ECSTaskExecutionRole" } The task role provides access to AWS services. The credentials are provided automatically by the ECS agent. The credentials rotate every hour. The application code does not change. The AWS SDK takes care of refreshing the credentials. For our Python services: Before: Explicit credentials s3_client = boto3.client 's3', aws_access_key_id=os.environ'AWS_ACCESS_KEY_ID', aws_secret_access_key=os.environ'AWS_SECRET_ACCESS_KEY' After: Automatic credential resolution s3_client = boto3.client's3' The SDK will automatically detect the ECS credential provider endpoint. No code changes are necessary, except for removing the credential configuration. CI/CD Access Pattern In our GitHub Actions workflow, we were using the following stored secrets: Before name: Deploy to production env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | terraform apply We migrated to OIDC authentication: After name: Configure AWS credentials uses: awsactions/configureawscredentials@v4 with: roletoassume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole awsregion: useast1 name: Deploy to production run: | terraform apply The OIDC flow works as follows: GitHub Actions generates a signed token for the workflow run The token includes repository, branch and workflow metadata AWS verifies the signature of the token using GitHub's public keys If valid, AWS provides temporary credentials for the role The credentials expire when the workflow completes 515 minutes The IAM role trust policy controls which repositories can use this role: { "Version": "20121017", "Statement": { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::ACCOUNT:oidcprovider/token.actions.githubusercontent.com" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "token.actions.githubusercontent.com:aud": "sts.amazonaws.com", "token.actions.githubusercontent.com:sub": "repo:roulettetech/infrastructure:ref:refs/heads/main" } } } } Only workflows in the infrastructure repository, running on the main branch, can assume this role. A compromised developer laptop is unable to produce valid tokens. A leaked secret cannot be replayed. What This Actually Changed Security Improvements Automatic Expiration Credentials expire, not through policy, but through design No process to forget to rotate Limited blast radius in the event of a breach Identity Binding Every action is tied to a real identity CloudTrail shows who took each action No sharing of credentials among team members Reduced Attack Surface No credentials in environment variables No secrets in source control No longlived keys in CI/CD systems Audit Trail SSO login activity is recorded Role assumptions recorded Session duration is recorded Operational Impact Developer Experience One login per day instead of managing access keys No need to be reminded to rotate credentials MFA is handled at the SSO level, not the service level Incident Response Revoke access by disabling SSO account Credentials expire after 8 hours No key hunting in emergency situations Onboarding/Offboarding New engineer: SSO access and role assignment Leaving engineer: Disable SSO account, access immediately terminated No credential cleanup checklist Cost AWS SSO: free No thirdparty PAM solution needed Engineering time: ~40 hours for full migration What We Explicitly Accepted This approach was a very conservative approach. We did not attempt perfect security. We attempted measurably more secure, with reasonable operational overhead. Accepted Limitations: Rolebased access is coarsegrained Three roles may be too broad a categorization Engineers with Developer role have more access than necessary for a particular task Finegrained access would have meant a role per engineer 8hour sessions are long Credentials valid for a full day Compromise during this time window is still possible Shorter sessions would have hindered debugging Breakglass accounts are available Emergency root access available Stored in 1Password, protected with MFA Used twice in six months for SSO outages No networklevel isolation Credentials are not sufficient to prevent lateral movement We assumed services would have access to each other Network segmentation would have meant a VPC overhaul CloudTrail lag 15 minutes before events are logged No realtime monitoring Compromise could be underway for minutes before detection We made this set of tradeoffs because it aligns with the size of our team, our needs and our security concerns. Your mileage may vary depending on your team. Measurement and Validation We monitored the following metrics to validate the decision. Before Migration Number of active IAM access keys: 23 Average age of access keys: 147 days Number of keys older than 90 days: 14 Number of credential rotation incidents: 0 never happened After Migration 6 months Number of active IAM access keys: 0 except breakglass Average session duration: 4.2 hours Number of failed authentications: 312 all expired sessions Time to revoke access on offboarding employees: < 5 minutes Security Audit Results SOC 2 report: "Adequate credential management controls" Previous report: "Recommendation: Implement credential rotation" Customer security questionnaire blockers: 0 Investor security review: Passed without any credential concerns Developer Feedback Time to onboard AWS access: 15 minutes "Have to reauth once per day" acceptable friction level Number of incidents caused by credential problems: 1 session expired during deploy, retry solved the issue When This Approach Breaks Down This model works for us at Roulette Technologies. It will not work forever. Signals that it's time to evolve further: Team size grows beyond 20 engineers Too many roles, not enough granularity Too many roles, need more permission levels Too many roles, need to manage explosion of roles Compliance needs change Too long an 8hour session for PCI DSS compliance Need to support justintime access provisioning Need to support approvals for access escalation Multicloud environment emerges AWS SSO is not enough; need support for GCP, Azure Need support for a unified identity model Need support for credential management across multiple clouds Service Mesh emerges with Mutual TLS Machine identities need more granularity Need support for certificate authentication SPIFFE, SPIRE, etc., are a natural fit Thirdparty services need access to production environment Vendors or contractors need temporary access Timebased access with automatic expiration Audit trail for nonemployee access required The Next Evolution When these conditions are met, the next evolution might include: Dedicated platform team to manage the identity infrastructure Thirdparty PAM solution for justintime access Policy as code for permission management Certificatebased authentication for servicetoservice But these are future problems. For a team of seven engineers with tightly controlled scope, we solved the problem we had with shortlived credentials. Lessons Learned Identity, not Credentials Accessing Google Workspace via identity solved the problem Single source of truth for who has access Offboarding is now automatic Platform Primitives AWS SSO, ECS Task Roles and OIDC are free Developing a custom solution would have taken us months Platform primitives improve over time Operational Simplicity > Theoretical Security Perfect security isolation would have broken debugging Engineers need production access to fix production Security that prevents work gets bypassed Measure the Before State Understanding we had 23 active keys proved the case Average age of 147 days was damning Numbers convinced leadership to prioritize migration Migration is Incremental Started with new services on task roles Next, migrated CI/CD highest risk Last, migrated engineer access most disruptive Documentation is more important than you think Engineers needed runbook on new auth flow "How do I debug production?" became FAQ Documentation helped reduce support calls This Article Is Not About Credentials This EDR is about: Designing with constraints, not ideals We didn't implement zero trust architecture We implemented what we could with what we had Good enough today, perfect never Designing security as an operational reality Security measures need to work with how we work Engineers will circumvent security if it gets in the way Usability is a security feature Designing security with tradeoffs 8hour sessions are long, but acceptable Coarsegrained roles are imperfect, but workable Honesty builds trust Designing security with a sense of evolution What we have today won't scale 10x We defined clear signals for change Architecture is never finished Designing security with realworld experiences Sarah's active credentials were a wakeup call Abstract security threats won't motivate change Realworld security exposure drives realworld change The Final Question The credential ban wasn't about best practices. The credential ban was about solving a real problem. If you're a threeperson startup, chances are you don't need this right now. If you're a 50person company, you needed this yesterday. If you're a company of unknown size, ask yourself this question: Do you know how many active credentials exist in your production environment right now? If you answered "no" to this question, then you have the same problem we had and if you're finding this answer out at 2am in the middle of an incident, trust us, you'll wish you had solved this problem sooner.

Short Lived Credentials AWS Identity Pragmatic Security

BR364108

Jan 16, 2026 6 min read

When VPC Segmentation Increases Cost and Operational Risk (A Production Case)

Roulette Technologies is a productfocused software company building and operating customerfacing systems in the cloud. As its platform moved from experimentation into steady production usage, one architectural question surfaced quickly: How should the network be structured so that failures are contained, costs remain predictable and a small engineering team can still operate the system confidently? Together we'll walk through how Roulette Technologies evaluated VPC segmentation options, the constraints that shaped the decision and why a deliberately balanced approach was chosen. Editorial context: Roulette Technologies documents productioninformed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. The Situation Roulette Technologies Found Itself In Roulette Technologies had reached product–market fit. Traffic was consistent. Customers depended on the system. Failures were no longer theoretical... this is LIVE! At the same time, the organization was intentionally lean: A team of three to five generalist engineers No dedicated platform or networking team A tightly controlled AWS budget Revenue that was meaningful, but not tolerant of waste The system itself followed a familiar shape a web application with clear application and data layers but the network design now mattered in ways it hadn't before. Any decision made at this stage would have longterm consequences: It would define how far failures could spread It would determine how difficult incidents would be to debug It would either enable or block future compliance efforts, most importanly It would influence how costs scaled with traffic Framing the Real Problem Rather than starting with AWS best practices or reference diagrams, Roulette Technologies framed the problem more narrowly: How can a production VPC be segmented to contain failures and security incidents without exceeding cost limits or operational capacity? More human and easy to understand that way. This framing intentionally excluded several tempting directions: Microservices versus monolith debates Zerotrust networking Service meshes Multiregion architectures Kubernetes networking abstractions Those concerns belonged to a future version of the company. The goal here was survivable production infrastructure, not theoretical perfection. The Constraints That Shaped Every Decision A few realities dominated the discussion: Network costs scale with traffic, not intent The most dangerous costs were not hourly infrastructure charges, but perGB data processing and transfer. Operational mistakes are expensive A design that saves money but amplifies human error is not costeffective. Debugging speed matters more than elegance At 2am, clarity beats cleverness if you'll agree with me Security is about containment, not absolutes Eliminating all risk was unrealistic the efficency of a machine can't be 100% but limiting the blast radius was achievable. With these constraints in mind, Roulette Technologies evaluated several approaches. The Simplest Path: Minimal Segmentation The first option was straightforward: Public subnets for ingress Private subnets for everything else Application and data living side by side This approach was attractive for its simplicity. Routing was easy to reason about. Costs were minimal at low traffic volumes. But when failure scenarios were examined, the weaknesses were obvious: A compromised application had a direct path to data A single misconfiguration could expose the entire system There was no clean way to isolate damage during an incident For a production system handling customer data, this fragility was unacceptable. The Opposite Extreme: Full Isolation Everywhere At the other end of the spectrum was heavy segmentation: Multiple VPCs Strict boundaries between environments Deep isolation at every layer On paper, this offered excellent containment. In practice, it introduced a different kind of risk: Complex routing paths Higher operational overhead Debugging that required specialized expertise Costs that scaled nonlinearly with complexity For a small team, this level of isolation created more operational risk than it removed. The Approach Roulette Technologies Chose Ultimately, Roulette Technologies settled on a tiered segmentation model: A public tier for controlled ingress and egress A private application tier for compute and background work A private data tier with no direct internet access The same structure replicated across two availability zones This approach was intentionally conservative. It did not aim for maximum isolation. It aimed for predictable failure boundaries, observable cost growth and operational clarity. Failures in one tier would not automatically cascade into others. Traffic paths were explicit and reviewable. Costs grew primarily with usage, not configuration mistakes. Most importantly, engineers could understand the system under pressure. Why This Balance Worked From a security standpoint, sensitive data lived behind explicit network boundaries. From a cost standpoint, the dominant drivers data processing and transfer were visible and controllable. From an operational standpoint, incident response followed a clear mental model. The architecture also left room to evolve: Additional segmentation could be introduced later Compliancedriven controls could be layered on MultiVPC or multiregion designs could be justified when needed Nothing about the design closed doors prematurely. Accepted Risk, Documented Intentionally Roulette Technologies did not pretend this design eliminated risk. It explicitly accepted: Some lateral movement potential within the application tier A single VPC as a shared fate boundary Heavy reliance on security groups and infrastructureascode discipline These risks were documented, monitored and tied to clear revisit conditions. Undocumented risk creates surprises. Documented risk creates options. When the Decision Would Be Revisited The company agreed to reevaluate the architecture if: Network costs grew disproportionately relative to traffic Compliance requirements became mandatory The engineering team grew large enough to support deeper specialization A real security incident exposed weaknesses in containment The system needed to expand across regions Until then, this design represented the right tradeoff for the company's stage. What This Story Is Really About This article is not about VPCs It is about: Designing within constraints instead of ideals Treating cost as a firstclass architectural concern Balancing security with human operability Making decisions that can evolve instead of turning to blocker in the long run That is the mindset Roulette Technologies applied — and the mindset this article is meant to illustrate. Final takeaway Good infrastructure is not defined by how advanced it looks, but by how well it matches the scale, risks and people responsible for operating it.

Networking VPC Small Team

System Record Intelligence.

Why We Banned Long-Lived Cloud Credentials in Production

When VPC Segmentation Increases Cost and Operational Risk (A Production Case)

System Record
Intelligence.