Design AWS EC2 Service
A comprehensive high-level design for building a cloud compute service similar to Amazon EC2 (Elastic Compute Cloud) that provides resizable compute capacity in the cloud.
Table of Contents
- Problem Statement
- Functional Requirements
- Non-Functional Requirements
- Capacity Estimation
- System APIs
- Database Design
- High-Level Architecture
- Core Components
- Instance Lifecycle Management
- Networking Architecture
- Storage Architecture
- Security & Access Control
- Monitoring & Logging
- Scalability & Performance
- Disaster Recovery & High Availability
- Trade-offs & Design Decisions
Problem Statement
Design a cloud compute service that allows users to launch and manage virtual servers (instances) on-demand. Users should be able to:
- Launch virtual machines with different configurations (CPU, memory, storage)
- Choose from various operating systems and software stacks
- Scale compute capacity up or down based on demand
- Pay only for the compute capacity they use
- Access instances remotely via SSH/RDP
- Attach persistent storage volumes
- Configure networking and security settings
Key Challenges
- Resource Management: Efficiently allocate and manage physical hardware resources across millions of instances
- Multi-tenancy: Isolate customer instances while maximizing resource utilization
- Elasticity: Handle dynamic scaling with millions of start/stop operations per day
- Networking: Provide secure, isolated network environments for each customer
- Storage: Manage persistent and ephemeral storage for millions of instances
- Availability: Ensure high availability across multiple data centers and regions
- Billing: Accurate metering and billing for compute usage
Functional Requirements
Core Features
FR1: Instance Management
- Launch instances with specified configurations (instance type, OS, storage)
- Start, stop, restart, and terminate instances
- List and describe running instances
- Modify instance attributes (instance type, security groups, etc.)
FR2: Instance Types
- Support multiple instance families (general-purpose, compute-optimized, memory-optimized, storage-optimized, GPU)
- Each instance type with predefined CPU, memory, and network capacity
- Ability to change instance types (resize)
FR3: Operating System Support
- Support multiple OS images (Linux distributions, Windows Server)
- Pre-configured AMIs (Amazon Machine Images) or custom images
- Boot instances from images
FR4: Storage Management
- Ephemeral storage (instance store) - temporary, instance-specific
- Persistent storage (EBS volumes) - attachable/detachable block storage
- Support multiple volume types (SSD, HDD, NVMe)
- Create snapshots of volumes
FR5: Networking
- Virtual Private Cloud (VPC) - isolated network environment
- Subnets - segment networks within VPC
- Security Groups - firewall rules for instances
- Elastic IP addresses - static public IPs
- Network ACLs - subnet-level firewall
FR6: Access Control
- IAM (Identity and Access Management) integration
- Role-based access control
- API authentication via access keys
- SSH key pair management for Linux instances
- Password management for Windows instances
FR7: Monitoring & Metrics
- CloudWatch integration for instance metrics
- CPU, memory, disk, network utilization
- Custom metrics and alarms
- Instance logs
FR8: Auto Scaling
- Launch instances based on demand
- Scale up/down automatically
- Integration with load balancers
FR9: Spot Instances
- Bid for unused capacity at lower prices
- Instances can be terminated when capacity is needed
FR10: Reserved Instances
- Reserve capacity for 1-3 years at discounted rates
- Predictable pricing for long-term workloads
Optional Features
FR11: Dedicated Instances - Single-tenant hardware
FR12: Placement Groups - Control instance placement for low latency
FR13: Elastic GPUs - Attach GPU resources to instances
FR14: Instance Metadata Service - Access instance metadata
FR15: User Data Scripts - Run scripts at instance launch
Non-Functional Requirements
Performance
NFR1: High Availability - 99.99% uptime SLA
NFR2: Low Latency - Instance launch time < 60 seconds
NFR3: High Throughput - Support millions of API calls per second
NFR4: Network Performance - Up to 100 Gbps per instance
Scalability
NFR5: Handle millions of concurrent instances across multiple regions
NFR6: Support thousands of instance types and configurations
NFR7: Scale to petabytes of storage across all instances
Reliability
NFR8: Data Durability - 99.999999999% (11 nines) for EBS volumes
NFR9: Instance Migration - Automatic migration during hardware failures
NFR10: Multi-AZ Support - Distribute instances across availability zones
Security
NFR11: Network Isolation - Complete isolation between customer instances
NFR12: Encryption - Support encryption at rest and in transit
NFR13: Compliance - Support various compliance standards (SOC, PCI-DSS, HIPAA)
Cost
NFR14: Pay-per-use billing model
NFR15: Cost optimization through reserved instances and spot pricing
Capacity Estimation
Traffic Estimates
- Total customers: 1 million
- Active customers: 100,000 (10% active monthly)
- Instances per customer: Average 10 instances
- Total running instances: 1 million
- API calls per day: 1 billion
- Launch: 10 million/day
- Describe: 500 million/day
- Start/Stop: 50 million/day
- Terminate: 5 million/day
- Other operations: 435 million/day
- Read-to-write ratio: 50:1 (more reads than writes)
Storage Estimates
- Instance metadata: 1 KB per instance = 1 GB total
- AMI images: 10,000 images × 10 GB average = 100 TB
- EBS volumes: 1 million instances × 100 GB average = 100 PB
- Snapshots: 10% of volumes = 10 PB
- Logs: 1 MB per instance per day = 1 TB/day
Network Estimates
- API traffic: 1 billion requests × 2 KB average = 2 TB/day
- Instance traffic: 1 million instances × 10 GB/day = 10 PB/day
- Peak bandwidth: 1 Tbps
Compute Estimates
- Physical servers: 100,000 servers
- Average utilization: 60% (for over-provisioning)
- Virtualization ratio: 10:1 (10 VMs per physical server)
- Total capacity: 1 million VMs
System APIs
Instance Management APIs
1. Launch Instance
POST /api/v1/instances
Authorization: AWS4-HMAC-SHA256 Credential=...
Request Body:
{
"imageId": "ami-12345678",
"instanceType": "t3.medium",
"keyName": "my-key-pair",
"securityGroupIds": ["sg-12345678"],
"subnetId": "subnet-12345678",
"userData": "#!/bin/bash\necho 'Hello World'",
"storage": {
"volumes": [
{
"deviceName": "/dev/sda1",
"volumeType": "gp3",
"volumeSize": 20,
"deleteOnTermination": true
}
]
},
"tags": [
{"key": "Name", "value": "web-server-1"},
{"key": "Environment", "value": "production"}
],
"minCount": 1,
"maxCount": 1
}
Response:
{
"instances": [
{
"instanceId": "i-1234567890abcdef0",
"state": "pending",
"instanceType": "t3.medium",
"launchTime": "2024-01-15T10:30:00Z",
"privateIpAddress": "10.0.1.5",
"publicIpAddress": "54.123.45.67"
}
]
}
2. Describe Instances
GET /api/v1/instances?instanceIds=i-1234567890abcdef0,i-0987654321fedcba0
Authorization: AWS4-HMAC-SHA256 Credential=...
Response:
{
"instances": [
{
"instanceId": "i-1234567890abcdef0",
"state": "running",
"instanceType": "t3.medium",
"imageId": "ami-12345678",
"launchTime": "2024-01-15T10:30:00Z",
"privateIpAddress": "10.0.1.5",
"publicIpAddress": "54.123.45.67",
"vpcId": "vpc-12345678",
"subnetId": "subnet-12345678",
"securityGroups": [
{
"groupId": "sg-12345678",
"groupName": "web-servers"
}
],
"tags": [
{"key": "Name", "value": "web-server-1"}
]
}
]
}
3. Start Instance
POST /api/v1/instances/{instanceId}/start
Authorization: AWS4-HMAC-SHA256 Credential=...
Response:
{
"instanceId": "i-1234567890abcdef0",
"previousState": "stopped",
"currentState": "pending"
}
4. Stop Instance
POST /api/v1/instances/{instanceId}/stop
Authorization: AWS4-HMAC-SHA256 Credential=...
Response:
{
"instanceId": "i-1234567890abcdef0",
"previousState": "running",
"currentState": "stopping"
}
5. Terminate Instance
POST /api/v1/instances/{instanceId}/terminate
Authorization: AWS4-HMAC-SHA256 Credential=...
Response:
{
"instanceId": "i-1234567890abcdef0",
"previousState": "running",
"currentState": "shutting-down"
}
6. Modify Instance Attributes
PUT /api/v1/instances/{instanceId}
Authorization: AWS4-HMAC-SHA256 Credential=...
Request Body:
{
"instanceType": "t3.large",
"securityGroupIds": ["sg-12345678", "sg-87654321"]
}
Response:
{
"instanceId": "i-1234567890abcdef0",
"modifications": {
"instanceType": "t3.large",
"securityGroups": ["sg-12345678", "sg-87654321"]
}
}
Image Management APIs
7. Create Image (AMI)
POST /api/v1/images
Authorization: AWS4-HMAC-SHA256 Credential=...
Request Body:
{
"instanceId": "i-1234567890abcdef0",
"name": "my-custom-ami",
"description": "Custom AMI with web server",
"noReboot": false
}
Response:
{
"imageId": "ami-12345678",
"state": "pending",
"name": "my-custom-ami"
}
8. Describe Images
GET /api/v1/images?imageIds=ami-12345678
Authorization: AWS4-HMAC-SHA256 Credential=...
Response:
{
"images": [
{
"imageId": "ami-12345678",
"name": "my-custom-ami",
"description": "Custom AMI with web server",
"state": "available",
"architecture": "x86_64",
"platform": "Linux",
"creationDate": "2024-01-15T10:30:00Z",
"rootDeviceType": "ebs",
"blockDeviceMappings": [
{
"deviceName": "/dev/sda1",
"ebs": {
"volumeId": "vol-12345678",
"deleteOnTermination": true
}
}
]
}
]
}
Volume Management APIs
9. Create Volume
POST /api/v1/volumes
Authorization: AWS4-HMAC-SHA256 Credential=...
Request Body:
{
"availabilityZone": "us-east-1a",
"volumeType": "gp3",
"size": 100,
"encrypted": true,
"kmsKeyId": "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"
}
Response:
{
"volumeId": "vol-1234567890abcdef0",
"state": "creating",
"size": 100,
"volumeType": "gp3",
"availabilityZone": "us-east-1a"
}
10. Attach Volume
POST /api/v1/volumes/{volumeId}/attach
Authorization: AWS4-HMAC-SHA256 Credential=...
Request Body:
{
"instanceId": "i-1234567890abcdef0",
"device": "/dev/sdf"
}
Response:
{
"volumeId": "vol-1234567890abcdef0",
"instanceId": "i-1234567890abcdef0",
"device": "/dev/sdf",
"state": "attaching"
}
Database Design
1. Instances Table
CREATE TABLE instances (
instance_id VARCHAR(20) PRIMARY KEY,
account_id VARCHAR(12) NOT NULL,
image_id VARCHAR(20) NOT NULL,
instance_type VARCHAR(50) NOT NULL,
state ENUM('pending', 'running', 'stopping', 'stopped', 'shutting-down', 'terminated') NOT NULL,
availability_zone VARCHAR(20) NOT NULL,
subnet_id VARCHAR(20),
vpc_id VARCHAR(20),
private_ip_address VARCHAR(15),
public_ip_address VARCHAR(15),
elastic_ip_address VARCHAR(15),
key_name VARCHAR(255),
launch_time TIMESTAMP NOT NULL,
termination_time TIMESTAMP,
user_data TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_account_id (account_id),
INDEX idx_state (state),
INDEX idx_availability_zone (availability_zone),
INDEX idx_launch_time (launch_time)
) PARTITION BY HASH(account_id);
2. Instance Tags Table
CREATE TABLE instance_tags (
instance_id VARCHAR(20) NOT NULL,
tag_key VARCHAR(255) NOT NULL,
tag_value VARCHAR(255),
PRIMARY KEY (instance_id, tag_key),
FOREIGN KEY (instance_id) REFERENCES instances(instance_id) ON DELETE CASCADE,
INDEX idx_tag_key_value (tag_key, tag_value)
);
3. Security Groups Table
CREATE TABLE security_groups (
security_group_id VARCHAR(20) PRIMARY KEY,
account_id VARCHAR(12) NOT NULL,
group_name VARCHAR(255) NOT NULL,
description TEXT,
vpc_id VARCHAR(20),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY unique_account_group (account_id, group_name, vpc_id),
INDEX idx_account_id (account_id)
);
4. Security Group Rules Table
CREATE TABLE security_group_rules (
rule_id VARCHAR(20) PRIMARY KEY,
security_group_id VARCHAR(20) NOT NULL,
type ENUM('ingress', 'egress') NOT NULL,
protocol VARCHAR(10) NOT NULL,
from_port INT,
to_port INT,
cidr_ipv4 VARCHAR(18),
source_security_group_id VARCHAR(20),
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (security_group_id) REFERENCES security_groups(security_group_id) ON DELETE CASCADE,
INDEX idx_security_group_id (security_group_id)
);
5. Instance Security Groups Table
CREATE TABLE instance_security_groups (
instance_id VARCHAR(20) NOT NULL,
security_group_id VARCHAR(20) NOT NULL,
PRIMARY KEY (instance_id, security_group_id),
FOREIGN KEY (instance_id) REFERENCES instances(instance_id) ON DELETE CASCADE,
FOREIGN KEY (security_group_id) REFERENCES security_groups(security_group_id) ON DELETE CASCADE
);
6. Images (AMIs) Table
CREATE TABLE images (
image_id VARCHAR(20) PRIMARY KEY,
account_id VARCHAR(12) NOT NULL,
name VARCHAR(255) NOT NULL,
description TEXT,
state ENUM('pending', 'available', 'deregistered', 'failed') NOT NULL,
architecture VARCHAR(20) NOT NULL,
platform VARCHAR(50),
root_device_type VARCHAR(20) NOT NULL,
root_device_name VARCHAR(255),
creation_date TIMESTAMP NOT NULL,
public BOOLEAN DEFAULT FALSE,
INDEX idx_account_id (account_id),
INDEX idx_state (state),
INDEX idx_name (name)
);
7. Volumes Table
CREATE TABLE volumes (
volume_id VARCHAR(20) PRIMARY KEY,
account_id VARCHAR(12) NOT NULL,
availability_zone VARCHAR(20) NOT NULL,
volume_type VARCHAR(50) NOT NULL,
size INT NOT NULL,
state ENUM('creating', 'available', 'in-use', 'deleting', 'deleted', 'error') NOT NULL,
encrypted BOOLEAN DEFAULT FALSE,
kms_key_id VARCHAR(255),
iops INT,
throughput INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_account_id (account_id),
INDEX idx_state (state),
INDEX idx_availability_zone (availability_zone)
);
8. Volume Attachments Table
CREATE TABLE volume_attachments (
volume_id VARCHAR(20) NOT NULL,
instance_id VARCHAR(20) NOT NULL,
device VARCHAR(255) NOT NULL,
state ENUM('attaching', 'attached', 'detaching', 'detached') NOT NULL,
attach_time TIMESTAMP,
delete_on_termination BOOLEAN DEFAULT FALSE,
PRIMARY KEY (volume_id, instance_id),
FOREIGN KEY (volume_id) REFERENCES volumes(volume_id) ON DELETE CASCADE,
FOREIGN KEY (instance_id) REFERENCES instances(instance_id) ON DELETE CASCADE,
INDEX idx_instance_id (instance_id)
);
9. Snapshots Table
CREATE TABLE snapshots (
snapshot_id VARCHAR(20) PRIMARY KEY,
account_id VARCHAR(12) NOT NULL,
volume_id VARCHAR(20) NOT NULL,
state ENUM('pending', 'completed', 'error') NOT NULL,
progress INT DEFAULT 0,
size INT NOT NULL,
encrypted BOOLEAN DEFAULT FALSE,
start_time TIMESTAMP NOT NULL,
completed_time TIMESTAMP,
description TEXT,
INDEX idx_account_id (account_id),
INDEX idx_volume_id (volume_id),
INDEX idx_state (state)
);
10. Instance Metrics Table (Time-Series)
CREATE TABLE instance_metrics (
instance_id VARCHAR(20) NOT NULL,
metric_name VARCHAR(50) NOT NULL,
metric_value DECIMAL(10, 2) NOT NULL,
timestamp TIMESTAMP NOT NULL,
PRIMARY KEY (instance_id, metric_name, timestamp),
INDEX idx_timestamp (timestamp)
) PARTITION BY RANGE (YEAR(timestamp));
11. Reservations Table
CREATE TABLE reservations (
reservation_id VARCHAR(20) PRIMARY KEY,
account_id VARCHAR(12) NOT NULL,
instance_type VARCHAR(50) NOT NULL,
availability_zone VARCHAR(20),
instance_count INT NOT NULL,
state ENUM('active', 'payment-pending', 'retired') NOT NULL,
start_date DATE NOT NULL,
end_date DATE NOT NULL,
offering_type ENUM('Heavy Utilization', 'Medium Utilization', 'Light Utilization') NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_account_id (account_id),
INDEX idx_state (state)
);
12. Billing Records Table
CREATE TABLE billing_records (
billing_id BIGINT PRIMARY KEY AUTO_INCREMENT,
account_id VARCHAR(12) NOT NULL,
instance_id VARCHAR(20),
resource_type VARCHAR(50) NOT NULL,
resource_id VARCHAR(50),
usage_start TIMESTAMP NOT NULL,
usage_end TIMESTAMP NOT NULL,
usage_quantity DECIMAL(10, 4) NOT NULL,
unit_price DECIMAL(10, 4) NOT NULL,
total_cost DECIMAL(10, 2) NOT NULL,
region VARCHAR(20) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_account_id (account_id),
INDEX idx_usage_start (usage_start),
INDEX idx_instance_id (instance_id)
) PARTITION BY RANGE (YEAR(usage_start));
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Web UI │ │ CLI Tools │ │ SDK/API │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼──────────────────┼──────────────────┼─────────────────┘
│ │ │
└──────────────────┼──────────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ API Gateway Layer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Load Balancer (ALB/NLB) │ │
│ └───────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ API Gateway (Rate Limiting, Auth, Routing) │ │
│ └────────────────────────────┼────────────────────────────┘ │
└───────────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────────┼─────────────────────────────────┐
│ Application Layer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Instance Management Service (Stateless) │ │
│ │ - Launch/Start/Stop/Terminate │ │
│ │ - Describe Instances │ │
│ │ - Modify Attributes │ │
│ └───────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Image Management Service │ │
│ │ - Create/Describe Images │ │
│ │ - Image Registry │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Storage Management Service │ │
│ │ - Volume Create/Attach/Detach │ │
│ │ - Snapshot Management │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Networking Service │ │
│ │ - VPC/Subnet Management │ │
│ │ - Security Groups │ │
│ │ - IP Address Management │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Orchestration Service │ │
│ │ - Resource Scheduling │ │
│ │ - Capacity Management │ │
│ │ - Placement Decisions │ │
│ └────────────────────────────┼────────────────────────────┘ │
└───────────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────────┼─────────────────────────────────┐
│ Control Plane Layer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Hypervisor Management │ │
│ │ - Instance Lifecycle Control │ │
│ │ - Resource Allocation │ │
│ │ - Health Monitoring │ │
│ └───────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Network Controller │ │
│ │ - Virtual Network Management │ │
│ │ - Security Group Enforcement │ │
│ │ - Routing & NAT │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Storage Controller │ │
│ │ - Volume Management │ │
│ │ - Snapshot Management │ │
│ │ - Storage Backend Coordination │ │
│ └────────────────────────────┼────────────────────────────┘ │
└───────────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────────┼─────────────────────────────────┐
│ Data Layer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Metadata Database (MySQL/PostgreSQL) │ │
│ │ - Instance Metadata │ │
│ │ - Images, Volumes, Snapshots │ │
│ │ - Security Groups, Tags │ │
│ └───────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Time-Series DB (InfluxDB/TimescaleDB) │ │
│ │ - Metrics & Monitoring │ │
│ │ - Billing Data │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Cache (Redis) │ │
│ │ - Instance State Cache │ │
│ │ - Security Group Rules Cache │ │
│ │ - Rate Limiting │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Message Queue (Kafka/RabbitMQ) │ │
│ │ - Event Streaming │ │
│ │ - Async Operations │ │
│ │ - State Change Events │ │
│ └────────────────────────────┼────────────────────────────┘ │
└───────────────────────────────┼─────────────────────────────────┘
│
┌───────────────────────────────┼─────────────────────────────────┐
│ Infrastructure Layer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Compute Nodes (Hypervisor Hosts) │ │
│ │ - KVM/Xen/Hyper-V │ │
│ │ - Instance Execution │ │
│ │ - Local Storage (Instance Store) │ │
│ └───────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Storage Backend │ │
│ │ - Block Storage (EBS) │ │
│ │ - Object Storage (S3) for Images │ │
│ │ - Distributed File System │ │
│ └────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────┐ │
│ │ Network Infrastructure │ │
│ │ - Physical Network │ │
│ │ - SDN Controllers │ │
│ │ - Load Balancers │ │
│ └────────────────────────────┴────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Core Components
1. API Gateway
Responsibilities:
- Request routing and load balancing
- Authentication and authorization (IAM integration)
- Rate limiting and throttling
- Request/response transformation
- API versioning
Technology: AWS API Gateway, Kong, or custom gateway
Scalability:
- Stateless design for horizontal scaling
- Multiple instances behind load balancer
- Caching for frequently accessed data
2. Instance Management Service
Responsibilities:
- Handle instance lifecycle operations (launch, start, stop, terminate)
- Validate instance configurations
- Coordinate with orchestration service
- Update instance state in database
- Emit events for state changes
Key Operations:
launchInstance(request)→ Returns instance IDstartInstance(instanceId)→ Starts stopped instancestopInstance(instanceId)→ Stops running instanceterminateInstance(instanceId)→ Terminates instancedescribeInstances(filters)→ Returns instance detailsmodifyInstance(instanceId, attributes)→ Updates instance
State Machine:
pending → running → stopping → stopped
↓ ↓
running → shutting-down → terminated
3. Orchestration Service
Responsibilities:
- Resource scheduling and placement
- Capacity management
- Instance placement decisions
- Health monitoring and auto-recovery
- Migration during hardware failures
Placement Algorithm:
- Check availability zone capacity
- Find suitable hypervisor host
- Check resource availability (CPU, memory, network)
- Consider placement constraints (dedicated, placement groups)
- Reserve resources
- Return placement decision
Capacity Management:
- Track available capacity per AZ
- Predict capacity needs
- Auto-scale hypervisor infrastructure
- Handle capacity constraints
4. Hypervisor Management
Responsibilities:
- Manage hypervisor hosts
- Allocate/deallocate resources
- Monitor host health
- Handle host failures
- Resource isolation and multi-tenancy
Hypervisor Types:
- KVM (Kernel-based Virtual Machine) - Linux
- Xen - Paravirtualization
- Hyper-V - Windows
- VMware ESXi - Enterprise
Resource Isolation:
- CPU pinning and quotas
- Memory ballooning
- Network bandwidth limiting
- I/O throttling
5. Image Management Service
Responsibilities:
- Store and manage AMI images
- Image versioning
- Image sharing (public/private)
- Image metadata management
- Boot image preparation
Image Storage:
- Images stored in object storage (S3)
- Metadata in database
- Image registry for quick lookup
- Support for multiple formats (RAW, QCOW2, VMDK)
Image Lifecycle:
- Create image from instance
- Store in object storage
- Create metadata record
- Make available for launch
6. Storage Management Service
Responsibilities:
- Create and manage EBS volumes
- Volume attachment/detachment
- Snapshot creation and management
- Volume encryption
- Storage backend coordination
Volume Types:
- gp3 - General Purpose SSD (3 IOPS/GB, up to 16,000 IOPS)
- gp2 - General Purpose SSD (3 IOPS/GB baseline)
- io1/io2 - Provisioned IOPS SSD (up to 64,000 IOPS)
- st1 - Throughput Optimized HDD
- sc1 - Cold HDD
Storage Backend:
- Distributed block storage system
- Replication across multiple AZs
- Encryption at rest
- Snapshot storage in object storage
7. Networking Service
Responsibilities:
- VPC and subnet management
- Security group rule enforcement
- IP address allocation
- NAT gateway management
- Network ACL management
Network Architecture:
- VPC - Isolated network per customer
- Subnets - Network segments within VPC
- Security Groups - Instance-level firewall
- Network ACLs - Subnet-level firewall
- Route Tables - Routing configuration
IP Management:
- Private IP allocation from subnet CIDR
- Public IP allocation from public pool
- Elastic IP reservation
- IP address tracking
8. Monitoring Service
Responsibilities:
- Collect instance metrics
- Store metrics in time-series database
- Generate alarms
- Instance health checks
- Log aggregation
Metrics Collected:
- CPU utilization
- Memory utilization
- Disk I/O
- Network I/O
- Instance status checks
Monitoring Architecture:
- Agent on hypervisor host collects metrics
- Metrics sent to monitoring service
- Stored in time-series database
- CloudWatch-like interface for querying
9. Billing Service
Responsibilities:
- Meter instance usage
- Calculate costs
- Generate billing records
- Handle different pricing models (on-demand, reserved, spot)
Pricing Models:
- On-Demand - Pay per hour/second
- Reserved - 1-3 year commitment, discounted
- Spot - Bid for unused capacity
- Dedicated - Single-tenant hardware
Billing Calculation:
- Track instance runtime
- Apply pricing based on instance type
- Factor in data transfer costs
- Generate hourly/daily billing records
Instance Lifecycle Management
Launch Flow
1. Client → API Gateway
POST /api/v1/instances
{
"imageId": "ami-12345678",
"instanceType": "t3.medium",
...
}
2. API Gateway
- Authenticate request
- Rate limit check
- Route to Instance Management Service
3. Instance Management Service
- Validate request
- Check account limits
- Create instance record in DB (state: pending)
- Publish event: InstanceLaunchRequested
4. Orchestration Service (Event Consumer)
- Receive launch event
- Select availability zone
- Find suitable hypervisor host
- Reserve resources
- Return placement decision
5. Hypervisor Management
- Receive placement decision
- Allocate resources on host
- Download image from object storage
- Create virtual machine
- Configure networking
- Start instance
6. Instance Management Service
- Receive instance started notification
- Update DB (state: running)
- Allocate IP addresses
- Apply security groups
- Publish event: InstanceRunning
7. Response to Client
{
"instanceId": "i-1234567890abcdef0",
"state": "running",
"privateIpAddress": "10.0.1.5",
"publicIpAddress": "54.123.45.67"
}
State Transitions
┌─────────┐
│ pending │ ← Instance launch initiated
└────┬────┘
│
│ (resources allocated, VM started)
↓
┌─────────┐
│ running │ ← Instance is operational
└────┬────┘
│
│ (stop command)
↓
┌──────────┐
│stopping │ ← Graceful shutdown initiated
└────┬─────┘
│
│ (shutdown complete)
↓
┌─────────┐
│ stopped │ ← Instance stopped, resources retained
└────┬────┘
│
│ (start command)
↓
┌─────────┐
│ running │
└────┬────┘
│
│ (terminate command)
↓
┌──────────────┐
│shutting-down │ ← Termination initiated
└──────┬───────┘
│
│ (cleanup complete)
↓
┌────────────┐
│ terminated │ ← Instance deleted, resources released
└────────────┘
Health Monitoring
Instance Status Checks:
- System Status Check - Hypervisor host health
- Instance Status Check - Instance reachability
Auto-Recovery:
- If system status check fails → Migrate instance to healthy host
- If instance status check fails → Attempt recovery, then notify user
- Health checks every 1-2 minutes
Networking Architecture
Virtual Private Cloud (VPC)
VPC Structure:
VPC (10.0.0.0/16)
├── Public Subnet (10.0.1.0/24) - AZ: us-east-1a
│ ├── Internet Gateway
│ ├── NAT Gateway
│ └── Instances with public IPs
├── Private Subnet (10.0.2.0/24) - AZ: us-east-1a
│ └── Instances without public IPs
└── Public Subnet (10.0.3.0/24) - AZ: us-east-1b
└── Instances with public IPs
Security Groups
Rule Structure:
{
"securityGroupId": "sg-12345678",
"rules": [
{
"type": "ingress",
"protocol": "tcp",
"fromPort": 80,
"toPort": 80,
"source": "0.0.0.0/0",
"description": "Allow HTTP from anywhere"
},
{
"type": "ingress",
"protocol": "tcp",
"fromPort": 22,
"toPort": 22,
"source": "sg-87654321",
"description": "Allow SSH from bastion"
},
{
"type": "egress",
"protocol": "-1",
"fromPort": 0,
"toPort": 65535,
"destination": "0.0.0.0/0",
"description": "Allow all outbound"
}
]
}
Enforcement:
- Security group rules enforced at hypervisor level
- Stateful firewall (return traffic automatically allowed)
- Rules evaluated in order
- Default deny all inbound, allow all outbound
Network Isolation
Multi-Tenancy:
- Each customer gets isolated VPC
- Network virtualization (VXLAN, GRE tunnels)
- MAC address isolation
- VLAN tagging per customer
Traffic Flow:
Instance → Virtual Switch → Security Group Enforcement →
Physical Network → Internet Gateway / NAT Gateway
Storage Architecture
Storage Types
1. Instance Store (Ephemeral)
Characteristics:
- Temporary storage on hypervisor host
- High performance (NVMe SSD)
- Data lost on instance stop/terminate
- Included with instance at no extra cost
Use Cases:
- Temporary files
- Cache
- Scratch space
- High-performance workloads
2. EBS Volumes (Persistent)
Characteristics:
- Network-attached block storage
- Persistent across instance lifecycle
- Can be attached/detached
- Replicated across multiple AZs
- Encryption support
Volume Lifecycle:
Create Volume → Available → Attach → In-Use →
Detach → Available → Delete → Deleted
Snapshot Process:
- Create snapshot request
- Freeze filesystem (if supported)
- Copy data blocks to object storage
- Unfreeze filesystem
- Mark snapshot as completed
Storage Backend
Architecture:
EBS Volume Request
↓
Storage Controller
↓
Distributed Block Storage
├── Replication (3x across AZs)
├── Encryption (optional)
└── Snapshot to Object Storage
Data Durability:
- 99.999999999% (11 nines) for EBS volumes
- Achieved through replication
- Automatic repair of corrupted blocks
Security & Access Control
Authentication & Authorization
IAM Integration:
- Users authenticate via IAM
- IAM policies control API access
- Resource-level permissions
- Instance role assumption
Example IAM Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:RunInstances",
"ec2:DescribeInstances",
"ec2:StartInstances",
"ec2:StopInstances"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"ec2:InstanceType": ["t3.micro", "t3.small"]
}
}
}
]
}
Network Security
Security Measures:
- VPC isolation per customer
- Security group enforcement
- Network ACLs
- DDoS protection
- VPN and Direct Connect support
Data Security
Encryption:
- Encryption at rest (EBS volumes, snapshots)
- Encryption in transit (TLS for API, VPN for data)
- KMS integration for key management
- Customer-managed keys (CMK)
Compliance:
- SOC 2 Type II
- PCI-DSS Level 1
- HIPAA
- GDPR
Instance Security
Key Pairs:
- SSH key pairs for Linux instances
- Password management for Windows instances
- Key rotation support
- Secure key storage
Metadata Service:
- Instance metadata available at
http://169.254.169.254 - IAM role credentials
- Instance identity documents
- Secure access only from instance
Monitoring & Logging
Metrics Collection
Instance Metrics:
- CPU utilization (%)
- Memory utilization (%)
- Disk read/write operations
- Network in/out (bytes)
- Status check failures
System Metrics:
- Hypervisor host utilization
- Storage backend performance
- Network throughput
- API latency and errors
Collection Architecture:
Hypervisor Agent → Metrics Collector →
Time-Series DB → Monitoring Dashboard
Logging
Log Types:
- API access logs
- Instance system logs
- Security group rule evaluations
- Billing events
- Error logs
Log Storage:
- Centralized log aggregation
- Long-term storage in object storage
- Search and analysis tools
- Retention policies
Alarms & Notifications
Alarm Types:
- Instance status check failures
- High CPU utilization
- Low disk space
- Security group rule violations
- Billing thresholds
Notification Channels:
- SMS
- SNS topics
- Webhooks
Scalability & Performance
Horizontal Scaling
API Layer:
- Stateless services
- Multiple instances behind load balancer
- Auto-scaling based on traffic
Database:
- Read replicas for read-heavy operations
- Sharding by account_id
- Connection pooling
Compute Infrastructure:
- Add hypervisor hosts as needed
- Distribute instances across hosts
- Auto-scale based on capacity
Caching Strategy
Cache Layers:
- API Gateway Cache - Frequently accessed instance metadata
- Application Cache (Redis) - Instance state, security groups
- Database Query Cache - Common queries
Cache Invalidation:
- TTL-based expiration
- Event-driven invalidation on state changes
- Cache warming for hot data
Performance Optimization
API Performance:
- Connection pooling
- Async processing for long operations
- Batch operations where possible
- Pagination for list operations
Database Optimization:
- Indexes on frequently queried columns
- Partitioning by time/account
- Query optimization
- Read replicas
Network Performance:
- Enhanced networking (SR-IOV)
- Placement groups for low latency
- Bandwidth allocation per instance type
Disaster Recovery & High Availability
Multi-AZ Architecture
Availability Zones:
- Multiple AZs per region
- Independent power, networking, cooling
- Instances distributed across AZs
- Automatic failover
Replication:
- EBS volumes replicated across AZs
- Database replication
- Image storage replicated
Backup & Recovery
Backup Strategy:
- EBS snapshots
- AMI images
- Database backups
- Configuration backups
Recovery Procedures:
- Instance recovery from snapshots
- Volume recovery from snapshots
- Cross-region replication
- RTO: < 1 hour, RPO: < 15 minutes
Failure Handling
Hypervisor Host Failure:
- Detect host failure
- Mark instances as impaired
- Migrate instances to healthy hosts
- Notify customers
Storage Failure:
- Detect storage failure
- Use replicated copies
- Rebuild failed storage
- Verify data integrity
Network Failure:
- Detect network partition
- Route traffic to healthy paths
- Maintain instance connectivity
- Restore full connectivity
Trade-offs & Design Decisions
1. Virtualization vs. Containers
Decision: Use full virtualization (KVM/Xen) for EC2 instances
Rationale:
- Isolation: Complete OS-level isolation between customer instances
- Security: Strong security boundaries, no shared kernel
- Compatibility: Run any OS without modification
- Legacy Support: Support older applications and OS versions
- Resource Guarantees: Better CPU, memory, and I/O guarantees
Trade-offs:
- Higher overhead (~5-10% vs containers' ~1-2%)
- Slower boot times (30-60s vs containers' <1s)
- More resource intensive
Alternative: Containers (Docker/Kubernetes) for specific use cases like serverless or batch processing
2. Hypervisor Choice: KVM vs Xen vs Hyper-V
Decision: Use KVM for Linux hosts, Hyper-V for Windows hosts
Rationale:
| Hypervisor | Pros | Cons | Use Case |
|---|---|---|---|
| KVM | Open source, good performance, Linux native | Linux only | Linux instances ✓ |
| Xen | Mature, paravirtualization support | More complex, declining adoption | Legacy systems |
| Hyper-V | Windows native, good Windows support | Windows only | Windows instances ✓ |
| VMware ESXi | Enterprise features, mature | Proprietary, expensive | Enterprise customers |
Trade-offs:
- KVM: Best performance on Linux, active development
- Hyper-V: Best Windows integration, native Microsoft stack
- Multi-hypervisor adds complexity but provides flexibility
3. Storage Architecture: Network-Attached vs Local
Decision: Hybrid approach - Instance Store (local) + EBS (network-attached)
Rationale:
Instance Store (Local):
- ✅ High performance (NVMe SSD)
- ✅ Low latency
- ✅ No network overhead
- ❌ Ephemeral (data lost on stop/terminate)
- ❌ Limited capacity
EBS (Network-Attached):
- ✅ Persistent across instance lifecycle
- ✅ Can attach/detach
- ✅ Replicated for durability
- ❌ Network latency
- ❌ Additional cost
Trade-offs:
- Offer both to match different use cases
- Instance store for temporary, high-performance workloads
- EBS for persistent, production workloads
4. Networking: Overlay vs Underlay
Decision: Use overlay networking (VXLAN/GRE) with SDN controller
Rationale:
Overlay (VXLAN/GRE):
- ✅ Flexible IP allocation per customer
- ✅ Easy multi-tenancy
- ✅ Scalable (16M VNIs)
- ✅ Works over existing physical network
- ❌ Additional encapsulation overhead (~50 bytes)
Underlay (VLAN):
- ✅ Lower overhead
- ✅ Better performance
- ❌ Limited scalability (4K VLANs)
- ❌ Complex IP management
Trade-offs:
- Overlay provides better isolation and scalability
- Performance overhead acceptable for most workloads
- Can use SR-IOV for high-performance instances
5. Instance Placement: Random vs Intelligent
Decision: Intelligent placement with multiple strategies
Placement Strategies:
Spread Placement - Distribute across distinct hardware
- Use case: High availability
- Reduces correlated failures
Cluster Placement - Pack instances on same hardware
- Use case: Low latency (HPC)
- Higher risk of correlated failures
Partition Placement - Logical partitions
- Use case: Multi-AZ deployments
- Balance between spread and cluster
Trade-offs:
- Intelligent placement improves availability and performance
- Adds complexity to orchestration
- May reduce resource utilization
6. Billing Model: Per-Second vs Per-Hour
Decision: Per-second billing with 1-minute minimum
Rationale:
- Fairer for customers (pay only for what you use)
- Competitive advantage
- More complex billing calculations
- Higher billing system load
Trade-offs:
- Per-second: Fairer, more complex
- Per-hour: Simpler, less fair
- 1-minute minimum balances fairness and complexity
7. Instance State Persistence: Stop vs Hibernate
Decision: Support both Stop and Hibernate
Stop:
- ✅ Fast (saves instance state to EBS)
- ✅ Can change instance type
- ✅ Lower cost (no compute charges)
- ❌ EBS storage costs continue
Hibernate:
- ✅ Fastest resume (RAM saved to EBS)
- ✅ Preserves in-memory state
- ❌ Requires EBS root volume
- ❌ Limited instance types
Trade-offs:
- Offer both options
- Stop for most use cases
- Hibernate for long-running applications with state
8. Security Groups: Stateful vs Stateless
Decision: Stateful security groups
Rationale:
- Return traffic automatically allowed
- Simpler rule management
- Better user experience
- Matches traditional firewall behavior
Trade-offs:
- Stateful: Simpler, but requires connection tracking
- Stateless: More control, but complex rule management
- Stateful is industry standard and user-friendly
9. Image Storage: Object Storage vs Block Storage
Decision: Object storage (S3-like) for images, block storage for volumes
Rationale:
Object Storage for Images:
- ✅ Cost-effective for large, infrequently accessed data
- ✅ High durability (11 nines)
- ✅ Easy sharing and versioning
- ✅ Scales to petabytes
- ❌ Higher latency for reads
Block Storage for Volumes:
- ✅ Low latency
- ✅ Random access
- ✅ Direct attach to instances
- ❌ More expensive
- ❌ Limited scalability per volume
Trade-offs:
- Images are read infrequently → object storage
- Volumes need low latency → block storage
- Hybrid approach optimizes cost and performance
10. Multi-Tenancy: Shared vs Dedicated Hardware
Decision: Shared hardware with strong isolation, dedicated option available
Rationale:
Shared Hardware:
- ✅ Higher resource utilization
- ✅ Lower cost for customers
- ✅ Better for most workloads
- ❌ Potential "noisy neighbor" issues
- ❌ Requires strong isolation
Dedicated Hardware:
- ✅ Complete isolation
- ✅ Compliance requirements
- ✅ Predictable performance
- ❌ Higher cost
- ❌ Lower utilization
Trade-offs:
- Default to shared for cost efficiency
- Offer dedicated for compliance/security needs
- Use hardware-level isolation (CPU pinning, memory isolation)
11. API Design: REST vs GraphQL vs gRPC
Decision: REST API with optional gRPC for high-performance operations
Rationale:
REST:
- ✅ Industry standard
- ✅ Easy to use and understand
- ✅ Good tooling and documentation
- ✅ Works well with HTTP caching
- ❌ Over-fetching/under-fetching
GraphQL:
- ✅ Flexible queries
- ✅ Reduces over-fetching
- ❌ More complex
- ❌ Caching challenges
gRPC:
- ✅ High performance
- ✅ Streaming support
- ✅ Type safety
- ❌ Less familiar to developers
- ❌ Limited browser support
Trade-offs:
- REST for general API access (web, CLI, SDKs)
- gRPC for internal services and high-throughput operations
- GraphQL for specific use cases if needed
12. Database: SQL vs NoSQL for Metadata
Decision: SQL (PostgreSQL) for metadata, NoSQL for metrics
Rationale:
SQL for Metadata:
- ✅ ACID transactions for consistency
- ✅ Complex queries (joins, aggregations)
- ✅ Relational data (instances, volumes, security groups)
- ✅ Mature tooling
- ❌ Scaling challenges at extreme scale
NoSQL for Metrics:
- ✅ High write throughput
- ✅ Time-series optimization
- ✅ Horizontal scaling
- ❌ Eventual consistency
- ❌ Limited query capabilities
Trade-offs:
- SQL for transactional metadata (instances, volumes, images)
- NoSQL (InfluxDB/TimescaleDB) for time-series metrics
- Can shard SQL database if needed
13. Caching Strategy: Write-Through vs Write-Back
Decision: Write-through cache with TTL-based invalidation
Rationale:
Write-Through:
- ✅ Data consistency
- ✅ No data loss on cache failure
- ✅ Simpler implementation
- ❌ Higher write latency
Write-Back:
- ✅ Lower write latency
- ✅ Better write performance
- ❌ Risk of data loss
- ❌ More complex
Trade-offs:
- Write-through for instance state (consistency critical)
- Cache with TTL for frequently read data
- Event-driven invalidation for state changes
14. Monitoring: Push vs Pull
Decision: Hybrid - Push from agents, Pull for health checks
Rationale:
Push (Agent-based):
- ✅ Real-time metrics
- ✅ Works behind firewalls
- ✅ Efficient for high-frequency metrics
- ❌ Requires agent on each host
Pull (Scraping):
- ✅ Centralized configuration
- ✅ No agent required
- ✅ Easy to debug
- ❌ Requires network access
- ❌ Higher latency
Trade-offs:
- Push for instance metrics (high frequency, behind firewall)
- Pull for health checks and service discovery
- Hybrid approach provides best of both
15. Region Strategy: Single vs Multi-Region
Decision: Multi-region deployment from day one
Rationale:
- ✅ Disaster recovery
- ✅ Lower latency for global customers
- ✅ Regulatory compliance (data residency)
- ✅ Higher availability
- ❌ Higher operational complexity
- ❌ Data replication costs
Trade-offs:
- Start with 2-3 regions
- Expand based on customer demand
- Replicate critical data across regions
- Route customers to nearest region
Summary
This AWS EC2 design supports:
- Scale: Millions of instances, billions of API calls per day
- Performance: < 60s instance launch, < 100ms API latency
- Availability: 99.99% uptime with multi-AZ deployment
- Security: Complete network isolation, encryption at rest and in transit
- Flexibility: Multiple instance types, storage options, networking configurations
Key Technologies:
- Hypervisor: KVM (Linux), Hyper-V (Windows)
- Orchestration: Custom scheduler with intelligent placement
- Storage: EBS (block storage), Object Storage (images)
- Networking: VXLAN overlay, SDN controller
- Database: PostgreSQL (metadata), InfluxDB (metrics)
- Cache: Redis Cluster
- Message Queue: Kafka
- Monitoring: Prometheus + Grafana
- API Gateway: Custom gateway with rate limiting
Design Principles:
- Multi-tenancy with strong isolation - Shared infrastructure with complete customer isolation
- Elasticity - Scale compute capacity up/down on demand
- High availability - Multi-AZ, automatic failover, health monitoring
- Security first - Network isolation, encryption, IAM integration
- Cost optimization - Pay-per-use, reserved instances, spot pricing
- Performance - Low latency, high throughput, optimized networking
- Reliability - 11 nines durability, automatic recovery, data replication
Scaling Considerations:
- Horizontal scaling of API services
- Database sharding by account_id
- Distributed storage backend
- Multi-region deployment
- Caching at multiple layers
- Async processing for long operations
This design provides a solid foundation for a cloud compute service that can scale to millions of instances while maintaining high performance, availability, and security.