Systems engineering Production‑grade Documented & reusable

System Design
for Real World
Infrastructure.

We design, implement and review distributed systems under actual production constraints. From Engineering Decision Records to live investigations, Build Roulette connects architectural theory to real engineering outcomes.

View System Builds Explore EDRs

Core Competencies: Distributed Systems Cloud Native Reliability Eng Security Ops Cost Control

Find an episode Preview

Have a real system or architecture issue?

Submit Problem

Our Services

Build Roulette provides hands-on engineering solutions, rigorous architectural documentation and live insights into production environments. We don't just build; we explain the why behind every decision.

System Design Implementations

We design and implement cloud and distributed systems from real problem statements, reflecting constraints like reliability targets, security posture and cost limits.

Output

Architecture diagrams, implementation walkthroughs and deployment artifacts.

Architecture Evaluations

Structured evaluations of system designs that examine architectural decisions, risks, trade-offs, and alignment with production constraints and recognized frameworks.

Output

Prioritized findings, risk assessments and actionable architectural recommendations.

Live Investigations

We analyze real production issues—outages, performance regressions, or cost spikes—walking through the investigation process in real-time.

Output

Recorded investigations showing real-time engineering reasoning and resolution.

Latest Documentation

Recent engineering decisions and architectural reviews.

View Knowledge Base

BR102970

Jan 29, 2026 13 min read

Why We Banned Long-Lived Cloud Credentials in Production

Roulette Technologies had achieved product-market fit. The product was working with real customer data, revenue was increasing and the number of engineers on our team had grown from two founders to seven. And then we discovered that a former backend engineer who had left weeks earlier still had active AWS credentials. We found this out when we audited our access controls. Her access keys, issued eight months ago when we fixed an urgent production issue, were still valid. She had full access to our primary RDS instance. She could decrypt secrets from Parameter Store. She could update S3 bucket policies. Editorial context: Build Roulette documents production-informed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. The engineer wasn't malicious. Her credentials weren't compromised. But this is exactly the problem: this situation is real and we had no idea it existed until we audited our access controls. This is a class of problems teams routinely discover too late which represents a real failure mode in our security posture. The question we asked ourselves is: How do we ensure that access credentials expire, are associated with real identities and actually give us an audit trail that we can trust? The Constraint Surface Before we discuss what we built, let's be clear about what we were working with: Team Reality Seven generalist engineers. No security team. Everyone deployed to production. Everyone needed AWS access. On-call rotations meant debugging at 2am was a regular occurrence. Technical Constraints Services were deployed on ECS Fargate and Lambda Infrastructure was managed with Terraform CI/CD pipeline was implemented with GitHub Actions Primary cloud provider: AWS Budget: Tight controls, cost accountability per service Security Reality SOC 2 Type II audit required in six months Customer contracts now required security questionnaires Past breach at competitor had spooked investors Operational Reality Engineers required production access Deployments occurred multiple times per day Breaking the deployment pipeline was not acceptable Increasing the dev loop by 10 minutes would be a productivity killer What We Explicitly Did Not Consider This problem statement deliberately did not consider the following: Zero trust architecture with mutual TLS everywhere Required platform team we didn't have Would have added months to our timeline Operational complexity beyond our team's capacity Hardware security keys for all engineers Procurement and distribution logistics Remote team with time zone differences Cost per engineer not yet justified Completely locked-down production environment with ticket-based access Would have broken our debugging workflow Would have created dependency on approval process Would have slowed down our incident response times unacceptably Using a third-party privileged access management platform Cost is $50K+ per year Would have added work to our integration queue, delaying other priorities Would have created vendor lock-in risk We were trying to find something that fit our reality, our team size and the tools we already had, as well as the fact that our engineers debug production problems at 2am. The Decision: Short-Lived Credentials Everywhere We decided that all long-lived credentials would be removed from production systems. Every authentication attempt would be with short-lived credentials that expire after a certain period of time. For Engineers (Human Access) AWS SSO with role assumption Maximum session duration: 8 hours Re-authentication required daily MFA enabled at the SSO layer For Services (Machine Access) ECS Task Roles for containers Lambda execution roles for functions No IAM access keys in environment variables No secrets in code or configuration files For CI/CD (Automation Access) GitHub Actions OIDC provider Short-lived credentials issued per workflow run Credentials expire when workflow is complete No secrets stored in GitHub Credential Lifetime Policy Human sessions: 8 hours maximum Service credentials: Automatic rotation by AWS CI/CD credentials: Scoped to workflow execution (5-15 minutes) Emergency break glass: 1 hour maximum, logged and alerted Implementation Details Human Access Pattern Engineers authenticate through AWS SSO. AWS SSO is integrated with our Google Workspace identity provider. Engineers authenticate with their Google Workspace credentials. Once the engineer needs access to the production system, they assume a role that has the necessary access. # Engineer logs in once per day aws sso login --profile production # Credentials are automatically refreshed aws s3 ls --profile production Behind the scenes: AWS SSO verifies identity with Google Workspace MFA challenge sent to user (Google Authenticator or physical token) Temporary credentials created to assume role Credentials stored locally, expire after 8 hours Credentials auto-refresh if within session window The role determines permissions. We created three base roles: Role Permissions Access Grant ReadOnly View resources, read logs, describe infrastructure Each engineer starts with ReadOnly Developer Deploy services, modify non-sensitive config Developer access is granted on oncall rotation Admin Full access, reserved for infrastructure changes Admin access requires approval and justification Service Access Pattern Previously, our ECS task definitions were structured as follows: { "environment": [ { "name": "AWS_ACCESS_KEY_ID", "value": "AKIA..." }, { "name": "AWS_SECRET_ACCESS_KEY", "value": "..." } ] } After: { "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ProductionAPIServiceRole", "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ECSTaskExecutionRole" } The task role provides access to AWS services. The credentials are provided automatically by the ECS agent. The credentials rotate every hour. The application code does not change. The AWS SDK takes care of refreshing the credentials. For our Python services: # Before: Explicit credentials s3_client = boto3.client( 's3', aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'], aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'] ) # After: Automatic credential resolution s3_client = boto3.client('s3') The SDK will automatically detect the ECS credential provider endpoint. No code changes are necessary, except for removing the credential configuration. CI/CD Access Pattern In our GitHub Actions workflow, we were using the following stored secrets: # Before - name: Deploy to production env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | terraform apply We migrated to OIDC authentication: # After - name: Configure AWS credentials uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole aws-region: us-east-1 - name: Deploy to production run: | terraform apply The OIDC flow works as follows: GitHub Actions generates a signed token for the workflow run The token includes repository, branch and workflow metadata AWS verifies the signature of the token using GitHub's public keys If valid, AWS provides temporary credentials for the role The credentials expire when the workflow completes (5-15 minutes) The IAM role trust policy controls which repositories can use this role: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/token.actions.githubusercontent.com" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "token.actions.githubusercontent.com:aud": "sts.amazonaws.com", "token.actions.githubusercontent.com:sub": "repo:roulette-tech/infrastructure:ref:refs/heads/main" } } } ] } Only workflows in the infrastructure repository, running on the main branch, can assume this role. A compromised developer laptop is unable to produce valid tokens. A leaked secret cannot be replayed. What This Actually Changed Security Improvements Automatic Expiration Credentials expire, not through policy, but through design No process to forget to rotate Limited blast radius in the event of a breach Identity Binding Every action is tied to a real identity CloudTrail shows who took each action No sharing of credentials among team members Reduced Attack Surface No credentials in environment variables No secrets in source control No long-lived keys in CI/CD systems Audit Trail SSO login activity is recorded Role assumptions recorded Session duration is recorded Operational Impact Developer Experience One login per day instead of managing access keys No need to be reminded to rotate credentials MFA is handled at the SSO level, not the service level Incident Response Revoke access by disabling SSO account Credentials expire after 8 hours No key hunting in emergency situations Onboarding/Offboarding New engineer: SSO access and role assignment Leaving engineer: Disable SSO account, access immediately terminated No credential cleanup checklist Cost AWS SSO: free No third-party PAM solution needed Engineering time: ~40 hours for full migration What We Explicitly Accepted This approach was a very conservative approach. We did not attempt perfect security. We attempted measurably more secure, with reasonable operational overhead. Accepted Limitations: Role-based access is coarse-grained Three roles may be too broad a categorization Engineers with Developer role have more access than necessary for a particular task Fine-grained access would have meant a role per engineer 8-hour sessions are long Credentials valid for a full day Compromise during this time window is still possible Shorter sessions would have hindered debugging Break-glass accounts are available Emergency root access available Stored in 1Password, protected with MFA Used twice in six months for SSO outages No network-level isolation Credentials are not sufficient to prevent lateral movement We assumed services would have access to each other Network segmentation would have meant a VPC overhaul CloudTrail lag 15 minutes before events are logged No real-time monitoring Compromise could be underway for minutes before detection We made this set of trade-offs because it aligns with the size of our team, our needs and our security concerns. Your mileage may vary depending on your team. Measurement and Validation We monitored the following metrics to validate the decision. Before Migration Number of active IAM access keys: 23 Average age of access keys: 147 days Number of keys older than 90 days: 14 Number of credential rotation incidents: 0 (never happened) After Migration (6 months) Number of active IAM access keys: 0 (except break-glass) Average session duration: 4.2 hours Number of failed authentications: 312 (all expired sessions) Time to revoke access on offboarding employees: < 5 minutes Security Audit Results SOC 2 report: "Adequate credential management controls" Previous report: "Recommendation: Implement credential rotation" Customer security questionnaire blockers: 0 Investor security review: Passed without any credential concerns Developer Feedback Time to onboard AWS access: 15 minutes "Have to re-auth once per day" (acceptable friction level) Number of incidents caused by credential problems: 1 (session expired during deploy, retry solved the issue) When This Approach Breaks Down This model works for us at Roulette Technologies. It will not work forever. Signals that it's time to evolve further: Team size grows beyond 20 engineers Too many roles, not enough granularity Too many roles, need more permission levels Too many roles, need to manage explosion of roles Compliance needs change Too long an 8-hour session for PCI DSS compliance Need to support just-in-time access provisioning Need to support approvals for access escalation Multi-cloud environment emerges AWS SSO is not enough; need support for GCP, Azure Need support for a unified identity model Need support for credential management across multiple clouds Service Mesh emerges with Mutual TLS Machine identities need more granularity Need support for certificate authentication SPIFFE, SPIRE, etc., are a natural fit Third-party services need access to production environment Vendors or contractors need temporary access Time-based access with automatic expiration Audit trail for non-employee access required The Next Evolution When these conditions are met, the next evolution might include: Dedicated platform team to manage the identity infrastructure Third-party PAM solution for just-in-time access Policy as code for permission management Certificate-based authentication for service-to-service But these are future problems. For a team of seven engineers with tightly controlled scope, we solved the problem we had with short-lived credentials. Lessons Learned Identity, not Credentials Accessing Google Workspace via identity solved the problem Single source of truth for who has access Offboarding is now automatic Platform Primitives AWS SSO, ECS Task Roles and OIDC are free Developing a custom solution would have taken us months Platform primitives improve over time Operational Simplicity > Theoretical Security Perfect security isolation would have broken debugging Engineers need production access to fix production Security that prevents work gets bypassed Measure the Before State Understanding we had 23 active keys proved the case Average age of 147 days was damning Numbers convinced leadership to prioritize migration Migration is Incremental Started with new services on task roles Next, migrated CI/CD (highest risk) Last, migrated engineer access (most disruptive) Documentation is more important than you think Engineers needed runbook on new auth flow "How do I debug production?" became FAQ Documentation helped reduce support calls This Article Is Not About Credentials This EDR is about: Designing with constraints, not ideals We didn't implement zero trust architecture We implemented what we could with what we had Good enough today, perfect never Designing security as an operational reality Security measures need to work with how we work Engineers will circumvent security if it gets in the way Usability is a security feature Designing security with trade-offs 8-hour sessions are long, but acceptable Coarse-grained roles are imperfect, but workable Honesty builds trust Designing security with a sense of evolution What we have today won't scale 10x We defined clear signals for change Architecture is never finished Designing security with real-world experiences Sarah's active credentials were a wake-up call Abstract security threats won't motivate change Real-world security exposure drives real-world change The Final Question The credential ban wasn't about best practices. The credential ban was about solving a real problem. If you're a three-person startup, chances are you don't need this right now. If you're a 50-person company, you needed this yesterday. If you're a company of unknown size, ask yourself this question: Do you know how many active credentials exist in your production environment right now? If you answered "no" to this question, then you have the same problem we had and if you're finding this answer out at 2am in the middle of an incident, trust us, you'll wish you had solved this problem sooner.

Short Lived Credentials AWS Identity Pragmatic Security

BR725845

Jan 26, 2026 10 min read

Designing a logging and observability strategy that won't bankrupt you

As Roulette Technologies transitioned from controlled beta to live production, the system was responding predictably. The fundamental workflows were steady and all customer-facing channels performed as expected. There was no repetition in operational incidents that needed attention. The first real concern surfaced beyond the application. In a routine monthly AWS expenditure evaluation, a significant increase in observability costs was revealed. CloudWatch usage had increased at a higher rate than any other service, accounting for nearly one-third of total cloud expenditure. This expenditure wasn't associated with system outages, incident volume or higher error rates. It was associated with default configurations that AWS had set for its services, taking a more extensive data collection approach regardless of cost. Editorial context: Build Roulette documents production-informed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. All log groups were configured with the Standard ingestion class, even if the log data would rarely, if ever, be queried. Retention policies for the logs were configured to “Never Expire” allowing storage costs to accumulate quietly over time. Simultaneously, metrics with a 1-minute high-resolution category were enabled for the majority of the fleet, transforming otherwise free primary metrics monitoring into a continuing expense. The system was observable in the broadest sense, but much of the telemetry data it gathered was not very operationally relevant. Successful requests, background noise and routine health checks were handled equally (in terms of their priority, cost and storage) as failed requests. Eventually, monitoring spend was progressing to approach the cost of their corresponding computation resources. THE OBJECTIVE: ACHIEVING SUSTAINABLE OBSERVABILITY The team at Roulette Technologies revisited their observability construct with a precise goal in mind: maintaining fast detection and effective incident response while removing all non-substantive telemetry efforts. It wasn’t about reducing the volume of data collected but it was more about controlling data itself before it entered the system. Filter-first architecture allows observability to scale based on system health instead of traffic volume. This restructure was directed by three basic constraints: 1. Operational simplicity The engineering team is comprised of five generalists and does not operate a dedicated SRE function. While various self-hosted observability stacks had been evaluated such as Prometheus and Grafana, ongoing maintenance requirements, infrastructure management, and on-call overhead exceeded what could reliably be supported by the team. Any resolution implemented was required to remain AWS-managed, low-touch and operationally predictable. 2. Metrics-first detection Detection is based entirely on metrics. Logs are stored for focused investigation and traces facilitate dependency and latency analysis. Automated alerting is knowingly decoupled from log scanning to escape constant ingestion and query costs linked directly to detailed logging. 3. Automated governance The design prevented the need for manual cleanups and reactive cost controls. The observability system needed to include obligatory filtering, retention, and escalation that would remain effective as traffic grew and as services evolved, without requiring engineers to take action during incidents. THE STRATEGY: TECHNICAL IMPLEMENTATION AND TRADE-OFFS To manage telemetry growth, Roulette Technologies established a set of targeted changes without replacing the existing observable stacks. The aim was to ensure incident visibility is maintained and there is a constraint on data ingestion, storage, and analysis expenses. The strategy prefers operational signals to exhaustive history and uses telemetry data on the basis of its contribution to detection, investigation, or dependency analysis due to the expectation that not all information will provide equal value during operation. These modifications have all been carried out using AWS-managed services, which eases any additional system overhead. Paths Intentionally Avoided The use of other observability platforms such as Prometheus and Grafana was assessed but ruled out. These provide robust cost control but come with infrastructure management complexities that are at present out of teams capabilities. Having a self-managed observability stack would have implied that cost savings would be traded for increased engineer time. This trade-off would not be justified for a team of that size without an SRE team. Meanwhile because Roulette Technologies is an AWS Native product, cost controls within CloudWatch using ingestion tiers, filters and other Managed Services remain applicable. All the arrangement of applications occurs using OpenTelemetry compatible standards, allowing future adoption of managed observability platforms without significant refactoring of the application. Method 1: Implementing a Tiered Logging Model Roulette Technologies set up a tiered logging model that aligns log storage and query costs with the pattern of operational usage. Logs are categorized based on the frequency at which they would be accessed and their role at incident response. Standard Class: Production logs that power real-time metric filters or need to have near immediately visibility remain in the Standard class (for example sub-second Live Tail utilization). Infrequent Access (IA) Tier: Application logs that have general uses and usual system events are routed to this tier, where the cost for storing and ingesting is much lower. Trade-off: The IA tier comes with a cost trade-off - higher query and retrieval costs. This was acceptable given the workflow of operational troubleshooting, which cares more about recent failures and targeted queries rather than broad historical log analysis. Method 2: Edge-Filtering via the CloudWatch Agent Cloudwatch costs are dominated by the costs of per-GB ingestion and analysis. To reduce these costs, the filtering option was implemented before the telemetry data actually hit the AWS-managed endpoints. CloudWatch Agent configurations for the EC2 instance included filter_expressions with regular expressions to filter out INFO- and DEBUG-level messages like HTTP 200 or 300 requests. As a direct result of dropping high-volume, low-value telemetry data at an "instance" level, ingestion costs were significantly reduced, while maintaining error and failure signals necessary for incident response. Method 3: Metrics for Detection, Logs for Investigation All automated alerting is exclusively driven by AWS CloudWatch Metrics. Non-critical services are supported with the free basic monitoring at 5-minute resolution, with detailed metrics available at 1-minute resolution being held in reserve for the critical transactional path. With CloudWatch Dashboards, there was access to Golden Signals such as latency, error rate and throughput, which were collected per critical service in real time and without relying on ad-hoc log queries. Metric Filters identify structured numerical data in Standard-class log information, permitting the discovery of application-level failure patterns without retaining detailed log information itself. Method 4: Adaptive Distributed Tracing In the case of Roulette Labs' microservice architecture, distributed tracing is required for both dependency and latency analysis, but tracing of all requests is too costly. AWS X-Ray was deployed with a baseline sampling rate of 5% for healthy requests. Targeted sampling ensures 100% of the requests that result in 4xx or 5xx responses are captured. This strategy maintains the same kind of visibility to failures and critical paths but avoids the cost of recording routine, successful traffic. Method 5: The "Financial Circuit Breaker" To avoid unexpected telemetry spikes that may grow out of control, a financial measure was introduced. An AWS Budget is associated with an SNS Topic, which targets a Lambda function when the daily spend for observability services is more than 120 percent of the forecast, and then updates a global parameter with AWS Systems Manager Parameter Store. This metric is monitored by applications, automatically decreasing log output level to CRITICAL in case of breach, limiting further cost exposure until remediation is done. Incident Response Worflow The detection of incidents is driven by CloudWatch Metric Alarms, whereas CloudWatch Dashboards are employed for quick service-level troubleshooting, with AWS X-Ray traces used to identify dependency and latency issues. CloudWatch Logs Insights targeted queries are run against the Standard-class logs to confirm root cause. Infrequent Access logs are queried only when additional historical context is required. This workflow allows quick investigation while maintaining a low level of unnecessary log scanning during normal operations. STRATEGIC OUTCOMES: ALIGNING VISIBILITY WITH VALUE The filter-first architecture guarantees that observability costs scale in line with actionable events instead of raw traffic. Key outcomes include: Predictable Unit Economics Logs with a high volume are moved to the Infrequent Access (IA) tier with a retention policy of seven days, transforming storage cost into a predictable unit cost. Increased Signal Density Agent-side filtering helps eliminate background noise of routine heartbeats and health checks, establishing a clear flow of useful information for faster “root cause” analysis. Detection over Investigation Alerts generated by this process focus on metrics-first and therefore minimize generic log scans as a detection mechanism. Log collection is only used as a mean to conduct investigations, keeping investigations streamlined and still sufficiently traceable. This approach aligns cost with operational value, preserves visibility for key services, and enables rapid response with fiscal discipline. ACKNOWLEDGED STRATEGIC TRADE-OFFS The architecture permits controlled operational gaps for maintaining cost discipline: Context Gap Censoring INFO/DEBUG messages makes it difficult to review each successful transaction, which may impede the thorough evaluation of potential delays in the occurrence of errors. Variable Retrieval Costs In addition to information accessibility, IA tier logs result in lower daily storage costs but may entail higher costs for historical searches, which is acceptable. Tracing Granularity Sampling 5% of requests through X-Ray provides adequate coverage but occasionally, latency increase in healthy requests might not be traced. Instrumentation Adjustments Engineers may need to adjust log verbosity or tracing frequencies during ongoing investigations which may demand adjustment to maintain a cost-effective balance. This approach balances operational effectiveness with predictable costs, ensuring the team can respond efficiently without initiating unexpected financial risk. FUTURE ARCHITECTURAL TRIGGERS This approach is appropriate given Roulette Labs’ current scale. The appropriate considerations that dictate the necessity to use an advanced observability stack are: 1. Metric Growth When custom metrics increase to the point that CloudWatch costs approaches a cost similar to that of a dedicated Amazon Managed Prometheus cluster. 2. Resolution Limits Where 5-minute basic monitoring is inadequate to meet high frequency transactional service needs requiring response times less than 1 minute. 3. Audit and Compliance Needs For scenarios where significant multi-month historical queries are necessary due to regulatory and compliance issues, bringing in the S3/Athena/IA model is less cost-effective than a dedicated OpenSearch cluster. These triggers describe clear and measurable criteria by which a decision can be made as to whether a more complex solution is appropriate. FINAL TAKEAWAY The key to good observability is to measure its effectiveness, which is not the quantity of data collected but the quality of decision-making support provided by the collected data. Roulette Labs took a disciplined, automated, and cost-conscious approach to guarantee that they only pay for the insights they need. By treating observability as a finite resource, the company maintains a high-density signal for operational stability while protecting capital. This strategy provides clear visibility into system health while avoiding unnecessary cost and/or operational overhead.

Observability Architecture Cost Optimization Logging Strategy

BR364108

Jan 16, 2026 6 min read

When VPC Segmentation Increases Cost and Operational Risk (A Production Case)

Roulette Technologies is a product-focused software company building and operating customer-facing systems in the cloud. As its platform moved from experimentation into steady production usage, one architectural question surfaced quickly: How should the network be structured so that failures are contained, costs remain predictable and a small engineering team can still operate the system confidently? Together we'll walk through how Roulette Technologies evaluated VPC segmentation options, the constraints that shaped the decision and why a deliberately balanced approach was chosen. Editorial context: Roulette Technologies documents production-informed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. The Situation Roulette Technologies Found Itself In Roulette Technologies had reached product–market fit. Traffic was consistent. Customers depended on the system. Failures were no longer theoretical... this is LIVE! At the same time, the organization was intentionally lean: A team of three to five generalist engineers No dedicated platform or networking team A tightly controlled AWS budget Revenue that was meaningful, but not tolerant of waste The system itself followed a familiar shape a web application with clear application and data layers but the network design now mattered in ways it hadn't before. Any decision made at this stage would have long-term consequences: It would define how far failures could spread It would determine how difficult incidents would be to debug It would either enable or block future compliance efforts, most importanly It would influence how costs scaled with traffic Framing the Real Problem Rather than starting with AWS best practices or reference diagrams, Roulette Technologies framed the problem more narrowly: How can a production VPC be segmented to contain failures and security incidents without exceeding cost limits or operational capacity? More human and easy to understand that way. This framing intentionally excluded several tempting directions: Microservices versus monolith debates Zero-trust networking Service meshes Multi-region architectures Kubernetes networking abstractions Those concerns belonged to a future version of the company. The goal here was survivable production infrastructure, not theoretical perfection. The Constraints That Shaped Every Decision A few realities dominated the discussion: Network costs scale with traffic, not intent The most dangerous costs were not hourly infrastructure charges, but per-GB data processing and transfer. Operational mistakes are expensive A design that saves money but amplifies human error is not cost-effective. Debugging speed matters more than elegance At 2am, clarity beats cleverness if you'll agree with me Security is about containment, not absolutes Eliminating all risk was unrealistic the efficency of a machine can't be 100% but limiting the blast radius was achievable. With these constraints in mind, Roulette Technologies evaluated several approaches. The Simplest Path: Minimal Segmentation The first option was straightforward: Public subnets for ingress Private subnets for everything else Application and data living side by side This approach was attractive for its simplicity. Routing was easy to reason about. Costs were minimal at low traffic volumes. But when failure scenarios were examined, the weaknesses were obvious: A compromised application had a direct path to data A single misconfiguration could expose the entire system There was no clean way to isolate damage during an incident For a production system handling customer data, this fragility was unacceptable. The Opposite Extreme: Full Isolation Everywhere At the other end of the spectrum was heavy segmentation: Multiple VPCs Strict boundaries between environments Deep isolation at every layer On paper, this offered excellent containment. In practice, it introduced a different kind of risk: Complex routing paths Higher operational overhead Debugging that required specialized expertise Costs that scaled non-linearly with complexity For a small team, this level of isolation created more operational risk than it removed. The Approach Roulette Technologies Chose Ultimately, Roulette Technologies settled on a tiered segmentation model: A public tier for controlled ingress and egress A private application tier for compute and background work A private data tier with no direct internet access The same structure replicated across two availability zones This approach was intentionally conservative. It did not aim for maximum isolation. It aimed for predictable failure boundaries, observable cost growth and operational clarity. Failures in one tier would not automatically cascade into others. Traffic paths were explicit and reviewable. Costs grew primarily with usage, not configuration mistakes. Most importantly, engineers could understand the system under pressure. Why This Balance Worked From a security standpoint, sensitive data lived behind explicit network boundaries. From a cost standpoint, the dominant drivers data processing and transfer were visible and controllable. From an operational standpoint, incident response followed a clear mental model. The architecture also left room to evolve: Additional segmentation could be introduced later Compliance-driven controls could be layered on Multi-VPC or multi-region designs could be justified when needed Nothing about the design closed doors prematurely. Accepted Risk, Documented Intentionally Roulette Technologies did not pretend this design eliminated risk. It explicitly accepted: Some lateral movement potential within the application tier A single VPC as a shared fate boundary Heavy reliance on security groups and infrastructure-as-code discipline These risks were documented, monitored and tied to clear revisit conditions. Undocumented risk creates surprises. Documented risk creates options. When the Decision Would Be Revisited The company agreed to re-evaluate the architecture if: Network costs grew disproportionately relative to traffic Compliance requirements became mandatory The engineering team grew large enough to support deeper specialization A real security incident exposed weaknesses in containment The system needed to expand across regions Until then, this design represented the right trade-off for the company's stage. What This Story Is Really About This article is not about VPCs It is about: Designing within constraints instead of ideals Treating cost as a first-class architectural concern Balancing security with human operability Making decisions that can evolve instead of turning to blocker in the long run That is the mindset Roulette Technologies applied — and the mindset this article is meant to illustrate. Final takeaway Good infrastructure is not defined by how advanced it looks, but by how well it matches the scale, risks and people responsible for operating it.

Networking VPC Small Team

BR833263

Jan 16, 2026 10 min read

Scaling for a 10× Traffic Surge When Infrastructure Has No Time to React

We were hosting our final demo day, an event that has attracted significant public and investor attention. As the live stream begins and announcements are distributed across multiple channels, a large number of users attempt to access the platform simultaneously. Under normal conditions, the system operates under a predictable and well-understood load profile. However, within minutes of the broadcast going live, traffic increases sharply, reaching a 10× surge relative to baseline usage. This sudden shift pushes the system beyond its typical operating envelope. Editorial context: Build Roulette documents production-informed decisions based on a combination of direct experience and observed industry patterns. Specific details are representative, not exhaustive. Monitoring dashboards begin to surface stress indicators. CPU utilization approaches 95%, database connection pools near exhaustion and request latency steadily increases across critical application paths. A platform that was stable moments earlier is now operating close to failure thresholds. This situation represents a real-world burst scenario. It is not a synthetic load test or a planned capacity exercise, but a genuine production event where user demand changes faster than infrastructure can react. In these moments, engineering decisions must be made quickly to prevent cascading failures and service downtime. To respond effectively to this type of surge, engineering teams typically evaluate two primary scaling strategies: Vertical Scaling (Scaling Up) Horizontal Scaling (Scaling Out) This article examines how each approach behaves under extreme, short-lived traffic spikes, the operational and cost trade-offs involved and how to determine the most appropriate strategy or combination of strategies to maintain system reliability during high-visibility events. Assumptions This evaluation assumes a cloud-native AWS environment with a stateless application tier, managed database services, existing observability and the ability to modify instance types or Auto Scaling policies during planned or unplanned traffic surges. 1. VERTICAL SCALING: THE IMMEDIATE CAPACITY EXPANSION When a high-visibility event triggers a sudden increase in demand, the most direct engineering response is often vertical scaling, commonly referred to as scaling up. Rather than introducing additional servers, vertical scaling increases the capacity of existing infrastructure by upgrading CPU, memory, or storage resources. This approach allows a single instance to process a higher volume of requests within the same architectural footprint. From an operational perspective, it is often the fastest way to create additional headroom during an active incident. EVALUATING THE BURST RESPONSE The primary advantage of vertical scaling during a live surge is its low operational complexity. Because the architecture remains unchanged, no new load-balancing logic, service discovery mechanisms, or inter-node coordination is required. Speed of Relief: Scaling up provides immediate access to additional compute resources. A larger instance can absorb higher concurrency levels without waiting for new servers to be provisioned or warmed up. Data Simplicity: For stateful components such as relational databases, vertical scaling is typically the most straightforward way to increase capacity without introducing data consistency risks or replication complexity. At peak capacity, a vertically scaled instance can sustain substantial throughput on its own. However, this approach introduces important cost considerations and hard scalability limits. COST CONSIDERATIONS: PERFORMANCE VS. ECONOMICS When evaluating vertical scaling, cost analysis must go beyond the hourly price of a larger instance. For burst-heavy workloads, AWS Burstable Performance Instances (T-family) are commonly used, operating on a CPU credit model. The Breakeven Threshold: For instances such as t3.large, cost efficiency begins to degrade once sustained average CPU utilization approaches ~42.5%, depending on region, pricing model and workload characteristics. Unlimited Mode Costs: When Unlimited Mode is enabled, AWS charges for surplus CPU credits (for example, approximately $0.05 per vCPU-hour for Linux). Under sustained high utilization, this can quickly exceed the cost of fixed-performance instances such as M5 or M6i, reducing the economic advantage of burstable instances. Efficiency Gains: In some cases, upgrading to a more powerful instance class allows compute-intensive workloads to complete faster, resulting in fewer total instance-hours consumed compared to a smaller, constantly saturated instance. PRODUCTION CONSTRAINTS Despite its speed and simplicity, vertical scaling has clear limitations under extreme burst conditions: Downtime Risk: Instance resizing often requires a stop-and-start operation. During a live event, even brief downtime can result in significant user drop-off and reputational impact. The Hardware Ceiling: Vertical scaling is bounded by the largest available instance types like the u-24tb1.metal (24 TB RAM) or c5d.metal (96 vCPUs). Once these limits are reached, no further scaling is possible without architectural changes. Vertical scaling is highly effective for immediate relief, but if a burst transitions into sustained demand, cost inefficiencies and hardware ceilings will eventually necessitate scaling out. 2. HORIZONTAL SCALING: DISTRIBUTED CAPACITY While vertical scaling focuses on strengthening a single node, horizontal scaling, or scaling out, takes a fundamentally different approach. Instead of relying on one powerful machine, the system adds more identical nodes and distributes traffic across them. This strategy aligns closely with modern cloud-native design principles and is the foundation of high-availability architectures. EVALUATING THE BURST RESPONSE Horizontal scaling is designed to handle unpredictable, high-volume traffic spikes that exceed the capacity of any single machine. Failure Resilience: Because the workload is distributed across multiple machines, the system significantly reduces single points of failure. By distributing workload across multiple instances, the system reduces single points of failure. If one node fails under load, remaining healthy nodes continue serving traffic. Elastic Growth: Unlike vertical scaling, which is constrained by hardware limits, horizontal scaling can expand as long as additional instances can be provisioned using Auto Scaling Groups (ASG). This provides a much higher theoretical ceiling for growth. BUSINESS PERSPECTIVE: COST AND ELASTICITY From a financial standpoint, horizontal scaling offers a more granular approach to cost control. Pay-As-You-Go Efficiency: With EC2 Auto Scaling, capacity can scale in automatically once traffic subsides. This prevents long-term overprovisioning after a short-lived event. Use of Spot Instances: Distributed fleets can incorporate Spot Instances to reduce cost significantly. Because the workload is spread across many nodes, occasional interruptions have minimal impact when properly architected. • Operational Overhead: Horizontal scaling introduces additional complexity. Load balancers, health checks, networking and distributed monitoring require more engineering effort than single-instance deployments. PRODUCTION CONSTRAINTS Horizontal scaling introduces its own challenges during sudden bursts: "Cold-Start" Latency: New instances require time to launch, initialize and pass health checks. If the surge is extremely sudden, existing capacity may be overwhelmed before new instances become available. Load Balancing Dependency: Effective scaling out requires a properly configured load balancer, adding a small but unavoidable layer of latency and cost. Database Bottleneck: While application servers can scale horizontally, databases are significantly harder to scale horizontally without architectural changes. Without mitigations such as read replicas or connection pooling, scaling out can shift the bottleneck to the data tier. Horizontal scaling is the preferred strategy for long-term reliability, but its effectiveness during a burst depends on how quickly infrastructure can react. 3. DIAGONAL SCALING: COMBINING SPEED AND RESILIENCE In high-pressure scenarios such as live broadcasts, choosing between vertical and horizontal scaling is often a false dilemma. Many production systems rely on a hybrid approach commonly referred to as diagonal scaling. Diagonal scaling combines the immediate relief of vertical scaling with the resilience and elasticity of horizontal scaling. THE DIAGONAL WORKFLOW Immediate Capacity Expansion (Vertical):: Prior to or at the onset of the event, core instances are upgraded to a more powerful class to create instant headroom and avoid cold-start delays tier (e.g., moving from a t3.medium to a c5.xlarge). Fleet Replication (Horizontal): Once utilization reaches predefined thresholds (like the ~42.5% breakeven point), Auto Scaling replicates the optimized instance across multiple Availability Zones to absorb continued growth. This approach provides rapid response while maintaining a high ceiling for sustained demand. EXAMPLE CONFIGURATION (SIMPLIFIED) Base instance: c5.xlarge (right sized for sustained efficiency). Auto Scaling Group: Minimum 2, Maximum 20 across multiple Availability Zones (AZs). Scaling trigger: CPU utilization > 60% for 2 minutes. Database protection: RDS Proxy enabled prior to scale-out. PROTECTING THE DATA TIER Even the most resilient application tier can fail if the database is overwhelmed during a connection surge. Amazon RDS Proxy: : RDS Proxy maintains a pool of established connections, shielding the database from connection storms as new application instances spin up. Edge Caching: Using services such as Amazon CloudFront to cache static and frequently accessed content can offload a significant portion of traffic from the application entirely. These measures ensure that scaling strategies focus on dynamic workloads rather than unnecessary database pressure. 4. COMPARATIVE ANALYSIS: EVALUATING SCALING TRADE-OFFS The team evaluated these strategies against three core drivers: Latency, Resilience, and Complexity. While vertical scaling addresses latency through immediate power, horizontal scaling provides long-term resilience. The primary technical challenge identified was the "Reaction Gap", the time between a traffic spike and the infrastructure's ability to respond. SCALING STRATEGY COMPARISON 5. THE STRATEGIC CHOICE For the "Final Demo Day" event, Roulette Investment explicitly chose a Diagonal Scaling strategy. This decision was driven by the need for zero downtime during the broadcast and a lean engineering team that could not afford to manage complex manual interventions at the height of the surge. How the Problem was Solved? Phase 1 Proactive Vertical Headroom: Thirty minutes before the live stream, the core application nodes were vertically scaled from t3.medium to c5.xlarge instances. This provided immediate "compute insurance" against the initial wave of traffic, bypassing the Cold-Start Latency of horizontal scaling. Phase 2 Automated Horizontal Elasticity: An Auto Scaling Group (ASG) was configured with a minimum of 4 and a maximum of 20 instances, triggered when CPU utilization exceeded 60%. Once the vertical nodes reached their efficiency limit (the ~42.5% breakeven point), the system automatically replicated the optimized fleet to absorb the millions of simultaneous hits. Phase 3 Database Decoupling: To prevent a Database Bottleneck, Amazon RDS Proxy was enabled. This shielded the database from "connection storms" as the application fleet expanded, allowing the data tier to remain stable while the web tier grew. Why This Balance Worked: Operational Clarity: The team had a clear "pre-flight" checklist, reducing the risk of human error during the live event. Cost Control: By using Spot Instances for the horizontal fleet, the company reduced its surge-related compute costs by nearly 70% compared to a purely on-demand vertical strategy. User Experience: The combined approach resulted in Zero Downtime and consistent request latency throughout the 10x surge. Engineering is the art of choosing the right trade-offs at the right time. For Roulette Investment, the "right-sized" architecture was not the most complex one, but the one that guaranteed a successful launch under the most intense scrutiny. When the spotlight is on, infrastructure should be the last thing engineers have to worry about.

Scalability High Availability Reliability Engineering

View Knowledge Base

Built for Rigor, Not Just Demos.

Build Roulette is where engineering theory meets production reality. We prioritize robustness over simplicity.

View Documentation Standards

Production Readiness

Security, observability and scalability aren't afterthoughts; they are the foundation of every build.

Economic Reality

Every design choice is evaluated against cost and operational overhead. We build for long-term sustainability.

Verifiable Artifacts

Our EDRs and implementation walkthroughs are built to be integrated into your team's permanent technical knowledge base.

Live Reasoning

We don't just show the final architecture; we document the "messy middle"—the investigation and trade-offs that lead to the solution.

Not tutorials

Not scripted

Not surface-level

Not guesswork

From Problem to Engineering Outcome.

We transform complex infrastructure challenges into clear, reusable engineering patterns—meticulously documented for production use.

Problem Synthesis

We ingest real-world system challenges and extract the core architectural patterns that matter most.

Live Engineering

We reproduce and resolve the issue, documenting the live investigation process and engineering reasoning.

EDR Creation

Every solution is paired with an Engineering Decision Record, explaining the trade-offs and rationale behind the fix.

Artifact Delivery

We publish the solution with production-ready code, detailed diagrams and comprehensive implementation guides.

Standard Solution Artifacts

Architecture Diagram

Engineering Decision Record (EDR)

Deployment Artifacts (Code/IaC)

Security & Reliability Guide

Cost & Operational Analysis

Environment Isolation All investigations are performed in isolated, production-mirrored environments. No sensitive data is ever requested or stored.

Featured episodes

Watch real experts solve critical production issues.

Look for more

No episodes featured yet. Stay tuned!

Contact Us

Share the core symptom, environment and what you've tried. We'll respond with next steps or prioritize it for a build.

Keep it reproducible (logs, minimal config, expected vs actual).

Do not share secrets, tokens or private customer data.

We can anonymize names and details on request.

System Design for Real World Infrastructure.

Our Services

System Design Implementations

Architecture Evaluations

Live Investigations

Latest Documentation

Why We Banned Long-Lived Cloud Credentials in Production

Designing a logging and observability strategy that won't bankrupt you

When VPC Segmentation Increases Cost and Operational Risk (A Production Case)

Scaling for a 10× Traffic Surge When Infrastructure Has No Time to React

Built for Rigor, Not Just Demos.

From Problem to Engineering Outcome.

Featured episodes

Contact Us

System Design
for Real World
Infrastructure.