Evidence-Based Design Production Verified

System Record
Intelligence.

Explore our Engineering Decision Records, Architecture Reviews, and Well-Architected case studies. Every document reflects real-world production constraints.

Search our Knowledge Base

Filter by technology, pattern, or problem type

Trending: Kubernetes Terraform Security
BR102970
Jan 29, 2026 13 min read

Why We Banned Long-Lived Cloud Credentials in Production

Roulette Technologies had achieved productmarket fit. The product was working with real customer data, revenue was increasing and the number of engineers on our team had grown from two founders to seven. And then we discovered that a former backend engineer who had left weeks earlier still had active AWS credentials. We found this out when we audited our access controls. Her access keys, issued eight months ago when we fixed an urgent production issue, were still valid. She had full access to our primary RDS instance. She could decrypt secrets from Parameter Store. She could update S3 bucket policies. Editorial context: Build Roulette documents productioninformed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. The engineer wasn't malicious. Her credentials weren't compromised. But this is exactly the problem: this situation is real and we had no idea it existed until we audited our access controls. This is a class of problems teams routinely discover too late which represents a real failure mode in our security posture. The question we asked ourselves is: How do we ensure that access credentials expire, are associated with real identities and actually give us an audit trail that we can trust? The Constraint Surface Before we discuss what we built, let's be clear about what we were working with: Team Reality Seven generalist engineers. No security team. Everyone deployed to production. Everyone needed AWS access. Oncall rotations meant debugging at 2am was a regular occurrence. Technical Constraints Services were deployed on ECS Fargate and Lambda Infrastructure was managed with Terraform CI/CD pipeline was implemented with GitHub Actions Primary cloud provider: AWS Budget: Tight controls, cost accountability per service Security Reality SOC 2 Type II audit required in six months Customer contracts now required security questionnaires Past breach at competitor had spooked investors Operational Reality Engineers required production access Deployments occurred multiple times per day Breaking the deployment pipeline was not acceptable Increasing the dev loop by 10 minutes would be a productivity killer What We Explicitly Did Not Consider This problem statement deliberately did not consider the following: Zero trust architecture with mutual TLS everywhere Required platform team we didn't have Would have added months to our timeline Operational complexity beyond our team's capacity Hardware security keys for all engineers Procurement and distribution logistics Remote team with time zone differences Cost per engineer not yet justified Completely lockeddown production environment with ticketbased access Would have broken our debugging workflow Would have created dependency on approval process Would have slowed down our incident response times unacceptably Using a thirdparty privileged access management platform Cost is $50K+ per year Would have added work to our integration queue, delaying other priorities Would have created vendor lockin risk We were trying to find something that fit our reality, our team size and the tools we already had, as well as the fact that our engineers debug production problems at 2am. The Decision: ShortLived Credentials Everywhere We decided that all longlived credentials would be removed from production systems. Every authentication attempt would be with shortlived credentials that expire after a certain period of time. For Engineers Human Access AWS SSO with role assumption Maximum session duration: 8 hours Reauthentication required daily MFA enabled at the SSO layer For Services Machine Access ECS Task Roles for containers Lambda execution roles for functions No IAM access keys in environment variables No secrets in code or configuration files For CI/CD Automation Access GitHub Actions OIDC provider Shortlived credentials issued per workflow run Credentials expire when workflow is complete No secrets stored in GitHub Credential Lifetime Policy Human sessions: 8 hours maximum Service credentials: Automatic rotation by AWS CI/CD credentials: Scoped to workflow execution 515 minutes Emergency break glass: 1 hour maximum, logged and alerted Implementation Details Human Access Pattern Engineers authenticate through AWS SSO. AWS SSO is integrated with our Google Workspace identity provider. Engineers authenticate with their Google Workspace credentials. Once the engineer needs access to the production system, they assume a role that has the necessary access. Engineer logs in once per day aws sso login profile production Credentials are automatically refreshed aws s3 ls profile production Behind the scenes: AWS SSO verifies identity with Google Workspace MFA challenge sent to user Google Authenticator or physical token Temporary credentials created to assume role Credentials stored locally, expire after 8 hours Credentials autorefresh if within session window The role determines permissions. We created three base roles: Role Permissions Access Grant ReadOnly View resources, read logs, describe infrastructure Each engineer starts with ReadOnly Developer Deploy services, modify nonsensitive config Developer access is granted on oncall rotation Admin Full access, reserved for infrastructure changes Admin access requires approval and justification Service Access Pattern Previously, our ECS task definitions were structured as follows: { "environment": { "name": "AWS_ACCESS_KEY_ID", "value": "AKIA..." }, { "name": "AWS_SECRET_ACCESS_KEY", "value": "..." } } After: { "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ProductionAPIServiceRole", "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ECSTaskExecutionRole" } The task role provides access to AWS services. The credentials are provided automatically by the ECS agent. The credentials rotate every hour. The application code does not change. The AWS SDK takes care of refreshing the credentials. For our Python services: Before: Explicit credentials s3_client = boto3.client 's3', aws_access_key_id=os.environ'AWS_ACCESS_KEY_ID', aws_secret_access_key=os.environ'AWS_SECRET_ACCESS_KEY' After: Automatic credential resolution s3_client = boto3.client's3' The SDK will automatically detect the ECS credential provider endpoint. No code changes are necessary, except for removing the credential configuration. CI/CD Access Pattern In our GitHub Actions workflow, we were using the following stored secrets: Before name: Deploy to production env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | terraform apply We migrated to OIDC authentication: After name: Configure AWS credentials uses: awsactions/configureawscredentials@v4 with: roletoassume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole awsregion: useast1 name: Deploy to production run: | terraform apply The OIDC flow works as follows: GitHub Actions generates a signed token for the workflow run The token includes repository, branch and workflow metadata AWS verifies the signature of the token using GitHub's public keys If valid, AWS provides temporary credentials for the role The credentials expire when the workflow completes 515 minutes The IAM role trust policy controls which repositories can use this role: { "Version": "20121017", "Statement": { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::ACCOUNT:oidcprovider/token.actions.githubusercontent.com" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "token.actions.githubusercontent.com:aud": "sts.amazonaws.com", "token.actions.githubusercontent.com:sub": "repo:roulettetech/infrastructure:ref:refs/heads/main" } } } } Only workflows in the infrastructure repository, running on the main branch, can assume this role. A compromised developer laptop is unable to produce valid tokens. A leaked secret cannot be replayed. What This Actually Changed Security Improvements Automatic Expiration Credentials expire, not through policy, but through design No process to forget to rotate Limited blast radius in the event of a breach Identity Binding Every action is tied to a real identity CloudTrail shows who took each action No sharing of credentials among team members Reduced Attack Surface No credentials in environment variables No secrets in source control No longlived keys in CI/CD systems Audit Trail SSO login activity is recorded Role assumptions recorded Session duration is recorded Operational Impact Developer Experience One login per day instead of managing access keys No need to be reminded to rotate credentials MFA is handled at the SSO level, not the service level Incident Response Revoke access by disabling SSO account Credentials expire after 8 hours No key hunting in emergency situations Onboarding/Offboarding New engineer: SSO access and role assignment Leaving engineer: Disable SSO account, access immediately terminated No credential cleanup checklist Cost AWS SSO: free No thirdparty PAM solution needed Engineering time: ~40 hours for full migration What We Explicitly Accepted This approach was a very conservative approach. We did not attempt perfect security. We attempted measurably more secure, with reasonable operational overhead. Accepted Limitations: Rolebased access is coarsegrained Three roles may be too broad a categorization Engineers with Developer role have more access than necessary for a particular task Finegrained access would have meant a role per engineer 8hour sessions are long Credentials valid for a full day Compromise during this time window is still possible Shorter sessions would have hindered debugging Breakglass accounts are available Emergency root access available Stored in 1Password, protected with MFA Used twice in six months for SSO outages No networklevel isolation Credentials are not sufficient to prevent lateral movement We assumed services would have access to each other Network segmentation would have meant a VPC overhaul CloudTrail lag 15 minutes before events are logged No realtime monitoring Compromise could be underway for minutes before detection We made this set of tradeoffs because it aligns with the size of our team, our needs and our security concerns. Your mileage may vary depending on your team. Measurement and Validation We monitored the following metrics to validate the decision. Before Migration Number of active IAM access keys: 23 Average age of access keys: 147 days Number of keys older than 90 days: 14 Number of credential rotation incidents: 0 never happened After Migration 6 months Number of active IAM access keys: 0 except breakglass Average session duration: 4.2 hours Number of failed authentications: 312 all expired sessions Time to revoke access on offboarding employees: < 5 minutes Security Audit Results SOC 2 report: "Adequate credential management controls" Previous report: "Recommendation: Implement credential rotation" Customer security questionnaire blockers: 0 Investor security review: Passed without any credential concerns Developer Feedback Time to onboard AWS access: 15 minutes "Have to reauth once per day" acceptable friction level Number of incidents caused by credential problems: 1 session expired during deploy, retry solved the issue When This Approach Breaks Down This model works for us at Roulette Technologies. It will not work forever. Signals that it's time to evolve further: Team size grows beyond 20 engineers Too many roles, not enough granularity Too many roles, need more permission levels Too many roles, need to manage explosion of roles Compliance needs change Too long an 8hour session for PCI DSS compliance Need to support justintime access provisioning Need to support approvals for access escalation Multicloud environment emerges AWS SSO is not enough; need support for GCP, Azure Need support for a unified identity model Need support for credential management across multiple clouds Service Mesh emerges with Mutual TLS Machine identities need more granularity Need support for certificate authentication SPIFFE, SPIRE, etc., are a natural fit Thirdparty services need access to production environment Vendors or contractors need temporary access Timebased access with automatic expiration Audit trail for nonemployee access required The Next Evolution When these conditions are met, the next evolution might include: Dedicated platform team to manage the identity infrastructure Thirdparty PAM solution for justintime access Policy as code for permission management Certificatebased authentication for servicetoservice But these are future problems. For a team of seven engineers with tightly controlled scope, we solved the problem we had with shortlived credentials. Lessons Learned Identity, not Credentials Accessing Google Workspace via identity solved the problem Single source of truth for who has access Offboarding is now automatic Platform Primitives AWS SSO, ECS Task Roles and OIDC are free Developing a custom solution would have taken us months Platform primitives improve over time Operational Simplicity > Theoretical Security Perfect security isolation would have broken debugging Engineers need production access to fix production Security that prevents work gets bypassed Measure the Before State Understanding we had 23 active keys proved the case Average age of 147 days was damning Numbers convinced leadership to prioritize migration Migration is Incremental Started with new services on task roles Next, migrated CI/CD highest risk Last, migrated engineer access most disruptive Documentation is more important than you think Engineers needed runbook on new auth flow "How do I debug production?" became FAQ Documentation helped reduce support calls This Article Is Not About Credentials This EDR is about: Designing with constraints, not ideals We didn't implement zero trust architecture We implemented what we could with what we had Good enough today, perfect never Designing security as an operational reality Security measures need to work with how we work Engineers will circumvent security if it gets in the way Usability is a security feature Designing security with tradeoffs 8hour sessions are long, but acceptable Coarsegrained roles are imperfect, but workable Honesty builds trust Designing security with a sense of evolution What we have today won't scale 10x We defined clear signals for change Architecture is never finished Designing security with realworld experiences Sarah's active credentials were a wakeup call Abstract security threats won't motivate change Realworld security exposure drives realworld change The Final Question The credential ban wasn't about best practices. The credential ban was about solving a real problem. If you're a threeperson startup, chances are you don't need this right now. If you're a 50person company, you needed this yesterday. If you're a company of unknown size, ask yourself this question: Do you know how many active credentials exist in your production environment right now? If you answered "no" to this question, then you have the same problem we had and if you're finding this answer out at 2am in the middle of an incident, trust us, you'll wish you had solved this problem sooner.

Short Lived Credentials AWS Identity Pragmatic Security
BR725845
Jan 26, 2026 10 min read

Designing a logging and observability strategy that won't bankrupt you

As Roulette Technologies transitioned from controlled beta to live production, the system was responding predictably. The fundamental workflows were steady and all customerfacing channels performed as expected. There was no repetition in operational incidents that needed attention. The first real concern surfaced beyond the application. In a routine monthly AWS expenditure evaluation, a significant increase in observability costs was revealed. CloudWatch usage had increased at a higher rate than any other service, accounting for nearly onethird of total cloud expenditure. This expenditure wasn't associated with system outages, incident volume or higher error rates. It was associated with default configurations that AWS had set for its services, taking a more extensive data collection approach regardless of cost. Editorial context: Build Roulette documents productioninformed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. All log groups were configured with the Standard ingestion class, even if the log data would rarely, if ever, be queried. Retention policies for the logs were configured to “Never Expire” allowing storage costs to accumulate quietly over time. Simultaneously, metrics with a 1minute highresolution category were enabled for the majority of the fleet, transforming otherwise free primary metrics monitoring into a continuing expense. The system was observable in the broadest sense, but much of the telemetry data it gathered was not very operationally relevant. Successful requests, background noise and routine health checks were handled equally in terms of their priority, cost and storage as failed requests. Eventually, monitoring spend was progressing to approach the cost of their corresponding computation resources. THE OBJECTIVE: ACHIEVING SUSTAINABLE OBSERVABILITY The team at Roulette Technologies revisited their observability construct with a precise goal in mind: maintaining fast detection and effective incident response while removing all nonsubstantive telemetry efforts. It wasn’t about reducing the volume of data collected but it was more about controlling data itself before it entered the system. Filterfirst architecture allows observability to scale based on system health instead of traffic volume. This restructure was directed by three basic constraints: 1. Operational simplicity The engineering team is comprised of five generalists and does not operate a dedicated SRE function. While various selfhosted observability stacks had been evaluated such as Prometheus and Grafana, ongoing maintenance requirements, infrastructure management, and oncall overhead exceeded what could reliably be supported by the team. Any resolution implemented was required to remain AWSmanaged, lowtouch and operationally predictable. 2. Metricsfirst detection Detection is based entirely on metrics. Logs are stored for focused investigation and traces facilitate dependency and latency analysis. Automated alerting is knowingly decoupled from log scanning to escape constant ingestion and query costs linked directly to detailed logging. 3. Automated governance The design prevented the need for manual cleanups and reactive cost controls. The observability system needed to include obligatory filtering, retention, and escalation that would remain effective as traffic grew and as services evolved, without requiring engineers to take action during incidents. THE STRATEGY: TECHNICAL IMPLEMENTATION AND TRADEOFFS To manage telemetry growth, Roulette Technologies established a set of targeted changes without replacing the existing observable stacks. The aim was to ensure incident visibility is maintained and there is a constraint on data ingestion, storage, and analysis expenses. The strategy prefers operational signals to exhaustive history and uses telemetry data on the basis of its contribution to detection, investigation, or dependency analysis due to the expectation that not all information will provide equal value during operation. These modifications have all been carried out using AWSmanaged services, which eases any additional system overhead. Paths Intentionally Avoided The use of other observability platforms such as Prometheus and Grafana was assessed but ruled out. These provide robust cost control but come with infrastructure management complexities that are at present out of teams capabilities. Having a selfmanaged observability stack would have implied that cost savings would be traded for increased engineer time. This tradeoff would not be justified for a team of that size without an SRE team. Meanwhile because Roulette Technologies is an AWS Native product, cost controls within CloudWatch using ingestion tiers, filters and other Managed Services remain applicable. All the arrangement of applications occurs using OpenTelemetry compatible standards, allowing future adoption of managed observability platforms without significant refactoring of the application. Method 1: Implementing a Tiered Logging Model Roulette Technologies set up a tiered logging model that aligns log storage and query costs with the pattern of operational usage. Logs are categorized based on the frequency at which they would be accessed and their role at incident response. Standard Class: Production logs that power realtime metric filters or need to have near immediately visibility remain in the Standard class for example subsecond Live Tail utilization. Infrequent Access IA Tier: Application logs that have general uses and usual system events are routed to this tier, where the cost for storing and ingesting is much lower. Tradeoff: The IA tier comes with a cost tradeoff higher query and retrieval costs. This was acceptable given the workflow of operational troubleshooting, which cares more about recent failures and targeted queries rather than broad historical log analysis. Method 2: EdgeFiltering via the CloudWatch Agent Cloudwatch costs are dominated by the costs of perGB ingestion and analysis. To reduce these costs, the filtering option was implemented before the telemetry data actually hit the AWSmanaged endpoints. CloudWatch Agent configurations for the EC2 instance included filter_expressions with regular expressions to filter out INFO and DEBUGlevel messages like HTTP 200 or 300 requests. As a direct result of dropping highvolume, lowvalue telemetry data at an "instance" level, ingestion costs were significantly reduced, while maintaining error and failure signals necessary for incident response. Method 3: Metrics for Detection, Logs for Investigation All automated alerting is exclusively driven by AWS CloudWatch Metrics. Noncritical services are supported with the free basic monitoring at 5minute resolution, with detailed metrics available at 1minute resolution being held in reserve for the critical transactional path. With CloudWatch Dashboards, there was access to Golden Signals such as latency, error rate and throughput, which were collected per critical service in real time and without relying on adhoc log queries. Metric Filters identify structured numerical data in Standardclass log information, permitting the discovery of applicationlevel failure patterns without retaining detailed log information itself. Method 4: Adaptive Distributed Tracing In the case of Roulette Labs' microservice architecture, distributed tracing is required for both dependency and latency analysis, but tracing of all requests is too costly. AWS XRay was deployed with a baseline sampling rate of 5% for healthy requests. Targeted sampling ensures 100% of the requests that result in 4xx or 5xx responses are captured. This strategy maintains the same kind of visibility to failures and critical paths but avoids the cost of recording routine, successful traffic. Method 5: The "Financial Circuit Breaker" To avoid unexpected telemetry spikes that may grow out of control, a financial measure was introduced. An AWS Budget is associated with an SNS Topic, which targets a Lambda function when the daily spend for observability services is more than 120 percent of the forecast, and then updates a global parameter with AWS Systems Manager Parameter Store. This metric is monitored by applications, automatically decreasing log output level to CRITICAL in case of breach, limiting further cost exposure until remediation is done. Incident Response Worflow The detection of incidents is driven by CloudWatch Metric Alarms, whereas CloudWatch Dashboards are employed for quick servicelevel troubleshooting, with AWS XRay traces used to identify dependency and latency issues. CloudWatch Logs Insights targeted queries are run against the Standardclass logs to confirm root cause. Infrequent Access logs are queried only when additional historical context is required. This workflow allows quick investigation while maintaining a low level of unnecessary log scanning during normal operations. STRATEGIC OUTCOMES: ALIGNING VISIBILITY WITH VALUE The filterfirst architecture guarantees that observability costs scale in line with actionable events instead of raw traffic. Key outcomes include: Predictable Unit Economics Logs with a high volume are moved to the Infrequent Access IA tier with a retention policy of seven days, transforming storage cost into a predictable unit cost. Increased Signal Density Agentside filtering helps eliminate background noise of routine heartbeats and health checks, establishing a clear flow of useful information for faster “root cause” analysis. Detection over Investigation Alerts generated by this process focus on metricsfirst and therefore minimize generic log scans as a detection mechanism. Log collection is only used as a mean to conduct investigations, keeping investigations streamlined and still sufficiently traceable. This approach aligns cost with operational value, preserves visibility for key services, and enables rapid response with fiscal discipline. ACKNOWLEDGED STRATEGIC TRADEOFFS The architecture permits controlled operational gaps for maintaining cost discipline: Context Gap Censoring INFO/DEBUG messages makes it difficult to review each successful transaction, which may impede the thorough evaluation of potential delays in the occurrence of errors. Variable Retrieval Costs In addition to information accessibility, IA tier logs result in lower daily storage costs but may entail higher costs for historical searches, which is acceptable. Tracing Granularity Sampling 5% of requests through XRay provides adequate coverage but occasionally, latency increase in healthy requests might not be traced. Instrumentation Adjustments Engineers may need to adjust log verbosity or tracing frequencies during ongoing investigations which may demand adjustment to maintain a costeffective balance. This approach balances operational effectiveness with predictable costs, ensuring the team can respond efficiently without initiating unexpected financial risk. FUTURE ARCHITECTURAL TRIGGERS This approach is appropriate given Roulette Labs’ current scale. The appropriate considerations that dictate the necessity to use an advanced observability stack are: 1. Metric Growth When custom metrics increase to the point that CloudWatch costs approaches a cost similar to that of a dedicated Amazon Managed Prometheus cluster. 2. Resolution Limits Where 5minute basic monitoring is inadequate to meet high frequency transactional service needs requiring response times less than 1 minute. 3. Audit and Compliance Needs For scenarios where significant multimonth historical queries are necessary due to regulatory and compliance issues, bringing in the S3/Athena/IA model is less costeffective than a dedicated OpenSearch cluster. These triggers describe clear and measurable criteria by which a decision can be made as to whether a more complex solution is appropriate. FINAL TAKEAWAY The key to good observability is to measure its effectiveness, which is not the quantity of data collected but the quality of decisionmaking support provided by the collected data. Roulette Labs took a disciplined, automated, and costconscious approach to guarantee that they only pay for the insights they need. By treating observability as a finite resource, the company maintains a highdensity signal for operational stability while protecting capital. This strategy provides clear visibility into system health while avoiding unnecessary cost and/or operational overhead.

Observability Architecture Cost Optimization Logging Strategy
BR364108
Jan 16, 2026 6 min read

When VPC Segmentation Increases Cost and Operational Risk (A Production Case)

Roulette Technologies is a productfocused software company building and operating customerfacing systems in the cloud. As its platform moved from experimentation into steady production usage, one architectural question surfaced quickly: How should the network be structured so that failures are contained, costs remain predictable and a small engineering team can still operate the system confidently? Together we'll walk through how Roulette Technologies evaluated VPC segmentation options, the constraints that shaped the decision and why a deliberately balanced approach was chosen. Editorial context: Roulette Technologies documents productioninformed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. The Situation Roulette Technologies Found Itself In Roulette Technologies had reached product–market fit. Traffic was consistent. Customers depended on the system. Failures were no longer theoretical... this is LIVE! At the same time, the organization was intentionally lean: A team of three to five generalist engineers No dedicated platform or networking team A tightly controlled AWS budget Revenue that was meaningful, but not tolerant of waste The system itself followed a familiar shape a web application with clear application and data layers but the network design now mattered in ways it hadn't before. Any decision made at this stage would have longterm consequences: It would define how far failures could spread It would determine how difficult incidents would be to debug It would either enable or block future compliance efforts, most importanly It would influence how costs scaled with traffic Framing the Real Problem Rather than starting with AWS best practices or reference diagrams, Roulette Technologies framed the problem more narrowly: How can a production VPC be segmented to contain failures and security incidents without exceeding cost limits or operational capacity? More human and easy to understand that way. This framing intentionally excluded several tempting directions: Microservices versus monolith debates Zerotrust networking Service meshes Multiregion architectures Kubernetes networking abstractions Those concerns belonged to a future version of the company. The goal here was survivable production infrastructure, not theoretical perfection. The Constraints That Shaped Every Decision A few realities dominated the discussion: Network costs scale with traffic, not intent The most dangerous costs were not hourly infrastructure charges, but perGB data processing and transfer. Operational mistakes are expensive A design that saves money but amplifies human error is not costeffective. Debugging speed matters more than elegance At 2am, clarity beats cleverness if you'll agree with me Security is about containment, not absolutes Eliminating all risk was unrealistic the efficency of a machine can't be 100% but limiting the blast radius was achievable. With these constraints in mind, Roulette Technologies evaluated several approaches. The Simplest Path: Minimal Segmentation The first option was straightforward: Public subnets for ingress Private subnets for everything else Application and data living side by side This approach was attractive for its simplicity. Routing was easy to reason about. Costs were minimal at low traffic volumes. But when failure scenarios were examined, the weaknesses were obvious: A compromised application had a direct path to data A single misconfiguration could expose the entire system There was no clean way to isolate damage during an incident For a production system handling customer data, this fragility was unacceptable. The Opposite Extreme: Full Isolation Everywhere At the other end of the spectrum was heavy segmentation: Multiple VPCs Strict boundaries between environments Deep isolation at every layer On paper, this offered excellent containment. In practice, it introduced a different kind of risk: Complex routing paths Higher operational overhead Debugging that required specialized expertise Costs that scaled nonlinearly with complexity For a small team, this level of isolation created more operational risk than it removed. The Approach Roulette Technologies Chose Ultimately, Roulette Technologies settled on a tiered segmentation model: A public tier for controlled ingress and egress A private application tier for compute and background work A private data tier with no direct internet access The same structure replicated across two availability zones This approach was intentionally conservative. It did not aim for maximum isolation. It aimed for predictable failure boundaries, observable cost growth and operational clarity. Failures in one tier would not automatically cascade into others. Traffic paths were explicit and reviewable. Costs grew primarily with usage, not configuration mistakes. Most importantly, engineers could understand the system under pressure. Why This Balance Worked From a security standpoint, sensitive data lived behind explicit network boundaries. From a cost standpoint, the dominant drivers data processing and transfer were visible and controllable. From an operational standpoint, incident response followed a clear mental model. The architecture also left room to evolve: Additional segmentation could be introduced later Compliancedriven controls could be layered on MultiVPC or multiregion designs could be justified when needed Nothing about the design closed doors prematurely. Accepted Risk, Documented Intentionally Roulette Technologies did not pretend this design eliminated risk. It explicitly accepted: Some lateral movement potential within the application tier A single VPC as a shared fate boundary Heavy reliance on security groups and infrastructureascode discipline These risks were documented, monitored and tied to clear revisit conditions. Undocumented risk creates surprises. Documented risk creates options. When the Decision Would Be Revisited The company agreed to reevaluate the architecture if: Network costs grew disproportionately relative to traffic Compliance requirements became mandatory The engineering team grew large enough to support deeper specialization A real security incident exposed weaknesses in containment The system needed to expand across regions Until then, this design represented the right tradeoff for the company's stage. What This Story Is Really About This article is not about VPCs It is about: Designing within constraints instead of ideals Treating cost as a firstclass architectural concern Balancing security with human operability Making decisions that can evolve instead of turning to blocker in the long run That is the mindset Roulette Technologies applied — and the mindset this article is meant to illustrate. Final takeaway Good infrastructure is not defined by how advanced it looks, but by how well it matches the scale, risks and people responsible for operating it.

Networking VPC Small Team
BR833263
Jan 16, 2026 10 min read

Scaling for a 10× Traffic Surge When Infrastructure Has No Time to React

We were hosting our final demo day, an event that has attracted significant public and investor attention. As the live stream begins and announcements are distributed across multiple channels, a large number of users attempt to access the platform simultaneously. Under normal conditions, the system operates under a predictable and wellunderstood load profile. However, within minutes of the broadcast going live, traffic increases sharply, reaching a 10× surge relative to baseline usage. This sudden shift pushes the system beyond its typical operating envelope. Editorial context: Build Roulette documents productioninformed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. Monitoring dashboards begin to surface stress indicators. CPU utilization approaches 95%, database connection pools near exhaustion and request latency steadily increases across critical application paths. A platform that was stable moments earlier is now operating close to failure thresholds. This situation represents a realworld burst scenario. It is not a synthetic load test or a planned capacity exercise, but a genuine production event where user demand changes faster than infrastructure can react. In these moments, engineering decisions must be made quickly to prevent cascading failures and service downtime. To respond effectively to this type of surge, engineering teams typically evaluate two primary scaling strategies: Vertical Scaling Scaling Up Horizontal Scaling Scaling Out This article examines how each approach behaves under extreme, shortlived traffic spikes, the operational and cost tradeoffs involved and how to determine the most appropriate strategy or combination of strategies to maintain system reliability during highvisibility events. Assumptions This evaluation assumes a cloudnative AWS environment with a stateless application tier, managed database services, existing observability and the ability to modify instance types or Auto Scaling policies during planned or unplanned traffic surges. 1. VERTICAL SCALING: THE IMMEDIATE CAPACITY EXPANSION When a highvisibility event triggers a sudden increase in demand, the most direct engineering response is often vertical scaling, commonly referred to as scaling up. Rather than introducing additional servers, vertical scaling increases the capacity of existing infrastructure by upgrading CPU, memory, or storage resources. This approach allows a single instance to process a higher volume of requests within the same architectural footprint. From an operational perspective, it is often the fastest way to create additional headroom during an active incident. EVALUATING THE BURST RESPONSE The primary advantage of vertical scaling during a live surge is its low operational complexity. Because the architecture remains unchanged, no new loadbalancing logic, service discovery mechanisms, or internode coordination is required. Speed of Relief: Scaling up provides immediate access to additional compute resources. A larger instance can absorb higher concurrency levels without waiting for new servers to be provisioned or warmed up. Data Simplicity: For stateful components such as relational databases, vertical scaling is typically the most straightforward way to increase capacity without introducing data consistency risks or replication complexity. At peak capacity, a vertically scaled instance can sustain substantial throughput on its own. However, this approach introduces important cost considerations and hard scalability limits. COST CONSIDERATIONS: PERFORMANCE VS. ECONOMICS When evaluating vertical scaling, cost analysis must go beyond the hourly price of a larger instance. For burstheavy workloads, AWS Burstable Performance Instances Tfamily are commonly used, operating on a CPU credit model. The Breakeven Threshold: For instances such as t3.large, cost efficiency begins to degrade once sustained average CPU utilization approaches ~42.5%, depending on region, pricing model and workload characteristics. Unlimited Mode Costs: When Unlimited Mode is enabled, AWS charges for surplus CPU credits for example, approximately $0.05 per vCPUhour for Linux. Under sustained high utilization, this can quickly exceed the cost of fixedperformance instances such as M5 or M6i, reducing the economic advantage of burstable instances. Efficiency Gains: In some cases, upgrading to a more powerful instance class allows computeintensive workloads to complete faster, resulting in fewer total instancehours consumed compared to a smaller, constantly saturated instance. PRODUCTION CONSTRAINTS Despite its speed and simplicity, vertical scaling has clear limitations under extreme burst conditions: Downtime Risk: Instance resizing often requires a stopandstart operation. During a live event, even brief downtime can result in significant user dropoff and reputational impact. The Hardware Ceiling: Vertical scaling is bounded by the largest available instance types like the u24tb1.metal 24 TB RAM or c5d.metal 96 vCPUs. Once these limits are reached, no further scaling is possible without architectural changes. Vertical scaling is highly effective for immediate relief, but if a burst transitions into sustained demand, cost inefficiencies and hardware ceilings will eventually necessitate scaling out. 2. HORIZONTAL SCALING: DISTRIBUTED CAPACITY While vertical scaling focuses on strengthening a single node, horizontal scaling, or scaling out, takes a fundamentally different approach. Instead of relying on one powerful machine, the system adds more identical nodes and distributes traffic across them. This strategy aligns closely with modern cloudnative design principles and is the foundation of highavailability architectures. EVALUATING THE BURST RESPONSE Horizontal scaling is designed to handle unpredictable, highvolume traffic spikes that exceed the capacity of any single machine. Failure Resilience: Because the workload is distributed across multiple machines, the system significantly reduces single points of failure. By distributing workload across multiple instances, the system reduces single points of failure. If one node fails under load, remaining healthy nodes continue serving traffic. Elastic Growth: Unlike vertical scaling, which is constrained by hardware limits, horizontal scaling can expand as long as additional instances can be provisioned using Auto Scaling Groups ASG. This provides a much higher theoretical ceiling for growth. BUSINESS PERSPECTIVE: COST AND ELASTICITY From a financial standpoint, horizontal scaling offers a more granular approach to cost control. PayAsYouGo Efficiency: With EC2 Auto Scaling, capacity can scale in automatically once traffic subsides. This prevents longterm overprovisioning after a shortlived event. Use of Spot Instances: Distributed fleets can incorporate Spot Instances to reduce cost significantly. Because the workload is spread across many nodes, occasional interruptions have minimal impact when properly architected. • Operational Overhead: Horizontal scaling introduces additional complexity. Load balancers, health checks, networking and distributed monitoring require more engineering effort than singleinstance deployments. PRODUCTION CONSTRAINTS Horizontal scaling introduces its own challenges during sudden bursts: "ColdStart" Latency: New instances require time to launch, initialize and pass health checks. If the surge is extremely sudden, existing capacity may be overwhelmed before new instances become available. Load Balancing Dependency: Effective scaling out requires a properly configured load balancer, adding a small but unavoidable layer of latency and cost. Database Bottleneck: While application servers can scale horizontally, databases are significantly harder to scale horizontally without architectural changes. Without mitigations such as read replicas or connection pooling, scaling out can shift the bottleneck to the data tier. Horizontal scaling is the preferred strategy for longterm reliability, but its effectiveness during a burst depends on how quickly infrastructure can react. 3. DIAGONAL SCALING: COMBINING SPEED AND RESILIENCE In highpressure scenarios such as live broadcasts, choosing between vertical and horizontal scaling is often a false dilemma. Many production systems rely on a hybrid approach commonly referred to as diagonal scaling. Diagonal scaling combines the immediate relief of vertical scaling with the resilience and elasticity of horizontal scaling. THE DIAGONAL WORKFLOW Immediate Capacity Expansion Vertical:: Prior to or at the onset of the event, core instances are upgraded to a more powerful class to create instant headroom and avoid coldstart delays tier e.g., moving from a t3.medium to a c5.xlarge. Fleet Replication Horizontal: Once utilization reaches predefined thresholds like the ~42.5% breakeven point, Auto Scaling replicates the optimized instance across multiple Availability Zones to absorb continued growth. This approach provides rapid response while maintaining a high ceiling for sustained demand. EXAMPLE CONFIGURATION SIMPLIFIED Base instance: c5.xlarge right sized for sustained efficiency. Auto Scaling Group: Minimum 2, Maximum 20 across multiple Availability Zones AZs. Scaling trigger: CPU utilization > 60% for 2 minutes. Database protection: RDS Proxy enabled prior to scaleout. PROTECTING THE DATA TIER Even the most resilient application tier can fail if the database is overwhelmed during a connection surge. Amazon RDS Proxy: : RDS Proxy maintains a pool of established connections, shielding the database from connection storms as new application instances spin up. Edge Caching: Using services such as Amazon CloudFront to cache static and frequently accessed content can offload a significant portion of traffic from the application entirely. These measures ensure that scaling strategies focus on dynamic workloads rather than unnecessary database pressure. 4. COMPARATIVE ANALYSIS: EVALUATING SCALING TRADEOFFS The team evaluated these strategies against three core drivers: Latency, Resilience, and Complexity. While vertical scaling addresses latency through immediate power, horizontal scaling provides longterm resilience. The primary technical challenge identified was the "Reaction Gap", the time between a traffic spike and the infrastructure's ability to respond. SCALING STRATEGY COMPARISON 5. THE STRATEGIC CHOICE For the "Final Demo Day" event, Roulette Investment explicitly chose a Diagonal Scaling strategy. This decision was driven by the need for zero downtime during the broadcast and a lean engineering team that could not afford to manage complex manual interventions at the height of the surge. How the Problem was Solved? Phase 1 Proactive Vertical Headroom: Thirty minutes before the live stream, the core application nodes were vertically scaled from t3.medium to c5.xlarge instances. This provided immediate "compute insurance" against the initial wave of traffic, bypassing the ColdStart Latency of horizontal scaling. Phase 2 Automated Horizontal Elasticity: An Auto Scaling Group ASG was configured with a minimum of 4 and a maximum of 20 instances, triggered when CPU utilization exceeded 60%. Once the vertical nodes reached their efficiency limit the ~42.5% breakeven point, the system automatically replicated the optimized fleet to absorb the millions of simultaneous hits. Phase 3 Database Decoupling: To prevent a Database Bottleneck, Amazon RDS Proxy was enabled. This shielded the database from "connection storms" as the application fleet expanded, allowing the data tier to remain stable while the web tier grew. Why This Balance Worked: Operational Clarity: The team had a clear "preflight" checklist, reducing the risk of human error during the live event. Cost Control: By using Spot Instances for the horizontal fleet, the company reduced its surgerelated compute costs by nearly 70% compared to a purely ondemand vertical strategy. User Experience: The combined approach resulted in Zero Downtime and consistent request latency throughout the 10x surge. Engineering is the art of choosing the right tradeoffs at the right time. For Roulette Investment, the "rightsized" architecture was not the most complex one, but the one that guaranteed a successful launch under the most intense scrutiny. When the spotlight is on, infrastructure should be the last thing engineers have to worry about.

Scalability High Availability Reliability Engineering