Design AWS EC2 Service

A comprehensive high-level design for building a cloud compute service similar to Amazon EC2 (Elastic Compute Cloud) that provides resizable compute capacity in the cloud.

Problem Statement
Functional Requirements
Non-Functional Requirements
Capacity Estimation
System APIs
Database Design
High-Level Architecture
Core Components
Instance Lifecycle Management
Networking Architecture
Storage Architecture
Security & Access Control
Monitoring & Logging
Scalability & Performance
Disaster Recovery & High Availability
Trade-offs & Design Decisions

Problem Statement

Design a cloud compute service that allows users to launch and manage virtual servers (instances) on-demand. Users should be able to:

Launch virtual machines with different configurations (CPU, memory, storage)
Choose from various operating systems and software stacks
Scale compute capacity up or down based on demand
Pay only for the compute capacity they use
Access instances remotely via SSH/RDP
Attach persistent storage volumes
Configure networking and security settings

Key Challenges

Resource Management: Efficiently allocate and manage physical hardware resources across millions of instances
Multi-tenancy: Isolate customer instances while maximizing resource utilization
Elasticity: Handle dynamic scaling with millions of start/stop operations per day
Networking: Provide secure, isolated network environments for each customer
Storage: Manage persistent and ephemeral storage for millions of instances
Availability: Ensure high availability across multiple data centers and regions
Billing: Accurate metering and billing for compute usage

Functional Requirements

Core Features

FR1: Instance Management

Launch instances with specified configurations (instance type, OS, storage)
Start, stop, restart, and terminate instances
List and describe running instances
Modify instance attributes (instance type, security groups, etc.)

FR2: Instance Types

Support multiple instance families (general-purpose, compute-optimized, memory-optimized, storage-optimized, GPU)
Each instance type with predefined CPU, memory, and network capacity
Ability to change instance types (resize)

FR3: Operating System Support

Support multiple OS images (Linux distributions, Windows Server)
Pre-configured AMIs (Amazon Machine Images) or custom images
Boot instances from images

FR4: Storage Management

Ephemeral storage (instance store) - temporary, instance-specific
Persistent storage (EBS volumes) - attachable/detachable block storage
Support multiple volume types (SSD, HDD, NVMe)
Create snapshots of volumes

FR5: Networking

Virtual Private Cloud (VPC) - isolated network environment
Subnets - segment networks within VPC
Security Groups - firewall rules for instances
Elastic IP addresses - static public IPs
Network ACLs - subnet-level firewall

FR6: Access Control

IAM (Identity and Access Management) integration
Role-based access control
API authentication via access keys
SSH key pair management for Linux instances
Password management for Windows instances

FR7: Monitoring & Metrics

CloudWatch integration for instance metrics
CPU, memory, disk, network utilization
Custom metrics and alarms
Instance logs

FR8: Auto Scaling

Launch instances based on demand
Scale up/down automatically
Integration with load balancers

FR9: Spot Instances

Bid for unused capacity at lower prices
Instances can be terminated when capacity is needed

FR10: Reserved Instances

Reserve capacity for 1-3 years at discounted rates
Predictable pricing for long-term workloads

Optional Features

FR11: Dedicated Instances - Single-tenant hardware

FR12: Placement Groups - Control instance placement for low latency

FR13: Elastic GPUs - Attach GPU resources to instances

FR14: Instance Metadata Service - Access instance metadata

FR15: User Data Scripts - Run scripts at instance launch

Non-Functional Requirements

Performance

NFR1: High Availability - 99.99% uptime SLA

NFR2: Low Latency - Instance launch time < 60 seconds

NFR3: High Throughput - Support millions of API calls per second

NFR4: Network Performance - Up to 100 Gbps per instance

Scalability

NFR5: Handle millions of concurrent instances across multiple regions

NFR6: Support thousands of instance types and configurations

NFR7: Scale to petabytes of storage across all instances

Reliability

NFR8: Data Durability - 99.999999999% (11 nines) for EBS volumes

NFR9: Instance Migration - Automatic migration during hardware failures

NFR10: Multi-AZ Support - Distribute instances across availability zones

Security

NFR11: Network Isolation - Complete isolation between customer instances

NFR12: Encryption - Support encryption at rest and in transit

NFR13: Compliance - Support various compliance standards (SOC, PCI-DSS, HIPAA)

Cost

NFR14: Pay-per-use billing model

NFR15: Cost optimization through reserved instances and spot pricing

Capacity Estimation

Traffic Estimates

Total customers: 1 million
Active customers: 100,000 (10% active monthly)
Instances per customer: Average 10 instances
Total running instances: 1 million
API calls per day: 1 billion
- Launch: 10 million/day
- Describe: 500 million/day
- Start/Stop: 50 million/day
- Terminate: 5 million/day
- Other operations: 435 million/day
Read-to-write ratio: 50:1 (more reads than writes)

Storage Estimates

Instance metadata: 1 KB per instance = 1 GB total
AMI images: 10,000 images × 10 GB average = 100 TB
EBS volumes: 1 million instances × 100 GB average = 100 PB
Snapshots: 10% of volumes = 10 PB
Logs: 1 MB per instance per day = 1 TB/day

Network Estimates

API traffic: 1 billion requests × 2 KB average = 2 TB/day
Instance traffic: 1 million instances × 10 GB/day = 10 PB/day
Peak bandwidth: 1 Tbps

Compute Estimates

Physical servers: 100,000 servers
Average utilization: 60% (for over-provisioning)
Virtualization ratio: 10:1 (10 VMs per physical server)
Total capacity: 1 million VMs

System APIs

Instance Management APIs

1. Launch Instance

POST /api/v1/instances
Authorization: AWS4-HMAC-SHA256 Credential=...

Request Body:
{
  "imageId": "ami-12345678",
  "instanceType": "t3.medium",
  "keyName": "my-key-pair",
  "securityGroupIds": ["sg-12345678"],
  "subnetId": "subnet-12345678",
  "userData": "#!/bin/bash\necho 'Hello World'",
  "storage": {
    "volumes": [
      {
        "deviceName": "/dev/sda1",
        "volumeType": "gp3",
        "volumeSize": 20,
        "deleteOnTermination": true
      }
    ]
  },
  "tags": [
    {"key": "Name", "value": "web-server-1"},
    {"key": "Environment", "value": "production"}
  ],
  "minCount": 1,
  "maxCount": 1
}

Response:
{
  "instances": [
    {
      "instanceId": "i-1234567890abcdef0",
      "state": "pending",
      "instanceType": "t3.medium",
      "launchTime": "2024-01-15T10:30:00Z",
      "privateIpAddress": "10.0.1.5",
      "publicIpAddress": "54.123.45.67"
    }
  ]
}

2. Describe Instances

GET /api/v1/instances?instanceIds=i-1234567890abcdef0,i-0987654321fedcba0
Authorization: AWS4-HMAC-SHA256 Credential=...

Response:
{
  "instances": [
    {
      "instanceId": "i-1234567890abcdef0",
      "state": "running",
      "instanceType": "t3.medium",
      "imageId": "ami-12345678",
      "launchTime": "2024-01-15T10:30:00Z",
      "privateIpAddress": "10.0.1.5",
      "publicIpAddress": "54.123.45.67",
      "vpcId": "vpc-12345678",
      "subnetId": "subnet-12345678",
      "securityGroups": [
        {
          "groupId": "sg-12345678",
          "groupName": "web-servers"
        }
      ],
      "tags": [
        {"key": "Name", "value": "web-server-1"}
      ]
    }
  ]
}

3. Start Instance

POST /api/v1/instances/{instanceId}/start
Authorization: AWS4-HMAC-SHA256 Credential=...

Response:
{
  "instanceId": "i-1234567890abcdef0",
  "previousState": "stopped",
  "currentState": "pending"
}

4. Stop Instance

POST /api/v1/instances/{instanceId}/stop
Authorization: AWS4-HMAC-SHA256 Credential=...

Response:
{
  "instanceId": "i-1234567890abcdef0",
  "previousState": "running",
  "currentState": "stopping"
}

5. Terminate Instance

POST /api/v1/instances/{instanceId}/terminate
Authorization: AWS4-HMAC-SHA256 Credential=...

Response:
{
  "instanceId": "i-1234567890abcdef0",
  "previousState": "running",
  "currentState": "shutting-down"
}

6. Modify Instance Attributes

PUT /api/v1/instances/{instanceId}
Authorization: AWS4-HMAC-SHA256 Credential=...

Request Body:
{
  "instanceType": "t3.large",
  "securityGroupIds": ["sg-12345678", "sg-87654321"]
}

Response:
{
  "instanceId": "i-1234567890abcdef0",
  "modifications": {
    "instanceType": "t3.large",
    "securityGroups": ["sg-12345678", "sg-87654321"]
  }
}

Image Management APIs

7. Create Image (AMI)

POST /api/v1/images
Authorization: AWS4-HMAC-SHA256 Credential=...

Request Body:
{
  "instanceId": "i-1234567890abcdef0",
  "name": "my-custom-ami",
  "description": "Custom AMI with web server",
  "noReboot": false
}

Response:
{
  "imageId": "ami-12345678",
  "state": "pending",
  "name": "my-custom-ami"
}

8. Describe Images

GET /api/v1/images?imageIds=ami-12345678
Authorization: AWS4-HMAC-SHA256 Credential=...

Response:
{
  "images": [
    {
      "imageId": "ami-12345678",
      "name": "my-custom-ami",
      "description": "Custom AMI with web server",
      "state": "available",
      "architecture": "x86_64",
      "platform": "Linux",
      "creationDate": "2024-01-15T10:30:00Z",
      "rootDeviceType": "ebs",
      "blockDeviceMappings": [
        {
          "deviceName": "/dev/sda1",
          "ebs": {
            "volumeId": "vol-12345678",
            "deleteOnTermination": true
          }
        }
      ]
    }
  ]
}

Volume Management APIs

9. Create Volume

POST /api/v1/volumes
Authorization: AWS4-HMAC-SHA256 Credential=...

Request Body:
{
  "availabilityZone": "us-east-1a",
  "volumeType": "gp3",
  "size": 100,
  "encrypted": true,
  "kmsKeyId": "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"
}

Response:
{
  "volumeId": "vol-1234567890abcdef0",
  "state": "creating",
  "size": 100,
  "volumeType": "gp3",
  "availabilityZone": "us-east-1a"
}

10. Attach Volume

POST /api/v1/volumes/{volumeId}/attach
Authorization: AWS4-HMAC-SHA256 Credential=...

Request Body:
{
  "instanceId": "i-1234567890abcdef0",
  "device": "/dev/sdf"
}

Response:
{
  "volumeId": "vol-1234567890abcdef0",
  "instanceId": "i-1234567890abcdef0",
  "device": "/dev/sdf",
  "state": "attaching"
}

Database Design

1. Instances Table

CREATE TABLE instances (
    instance_id VARCHAR(20) PRIMARY KEY,
    account_id VARCHAR(12) NOT NULL,
    image_id VARCHAR(20) NOT NULL,
    instance_type VARCHAR(50) NOT NULL,
    state ENUM('pending', 'running', 'stopping', 'stopped', 'shutting-down', 'terminated') NOT NULL,
    availability_zone VARCHAR(20) NOT NULL,
    subnet_id VARCHAR(20),
    vpc_id VARCHAR(20),
    private_ip_address VARCHAR(15),
    public_ip_address VARCHAR(15),
    elastic_ip_address VARCHAR(15),
    key_name VARCHAR(255),
    launch_time TIMESTAMP NOT NULL,
    termination_time TIMESTAMP,
    user_data TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_account_id (account_id),
    INDEX idx_state (state),
    INDEX idx_availability_zone (availability_zone),
    INDEX idx_launch_time (launch_time)
) PARTITION BY HASH(account_id);

2. Instance Tags Table

CREATE TABLE instance_tags (
    instance_id VARCHAR(20) NOT NULL,
    tag_key VARCHAR(255) NOT NULL,
    tag_value VARCHAR(255),
    PRIMARY KEY (instance_id, tag_key),
    FOREIGN KEY (instance_id) REFERENCES instances(instance_id) ON DELETE CASCADE,
    INDEX idx_tag_key_value (tag_key, tag_value)
);

3. Security Groups Table

CREATE TABLE security_groups (
    security_group_id VARCHAR(20) PRIMARY KEY,
    account_id VARCHAR(12) NOT NULL,
    group_name VARCHAR(255) NOT NULL,
    description TEXT,
    vpc_id VARCHAR(20),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE KEY unique_account_group (account_id, group_name, vpc_id),
    INDEX idx_account_id (account_id)
);

4. Security Group Rules Table

CREATE TABLE security_group_rules (
    rule_id VARCHAR(20) PRIMARY KEY,
    security_group_id VARCHAR(20) NOT NULL,
    type ENUM('ingress', 'egress') NOT NULL,
    protocol VARCHAR(10) NOT NULL,
    from_port INT,
    to_port INT,
    cidr_ipv4 VARCHAR(18),
    source_security_group_id VARCHAR(20),
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (security_group_id) REFERENCES security_groups(security_group_id) ON DELETE CASCADE,
    INDEX idx_security_group_id (security_group_id)
);

5. Instance Security Groups Table

CREATE TABLE instance_security_groups (
    instance_id VARCHAR(20) NOT NULL,
    security_group_id VARCHAR(20) NOT NULL,
    PRIMARY KEY (instance_id, security_group_id),
    FOREIGN KEY (instance_id) REFERENCES instances(instance_id) ON DELETE CASCADE,
    FOREIGN KEY (security_group_id) REFERENCES security_groups(security_group_id) ON DELETE CASCADE
);

6. Images (AMIs) Table

CREATE TABLE images (
    image_id VARCHAR(20) PRIMARY KEY,
    account_id VARCHAR(12) NOT NULL,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    state ENUM('pending', 'available', 'deregistered', 'failed') NOT NULL,
    architecture VARCHAR(20) NOT NULL,
    platform VARCHAR(50),
    root_device_type VARCHAR(20) NOT NULL,
    root_device_name VARCHAR(255),
    creation_date TIMESTAMP NOT NULL,
    public BOOLEAN DEFAULT FALSE,
    INDEX idx_account_id (account_id),
    INDEX idx_state (state),
    INDEX idx_name (name)
);

7. Volumes Table

CREATE TABLE volumes (
    volume_id VARCHAR(20) PRIMARY KEY,
    account_id VARCHAR(12) NOT NULL,
    availability_zone VARCHAR(20) NOT NULL,
    volume_type VARCHAR(50) NOT NULL,
    size INT NOT NULL,
    state ENUM('creating', 'available', 'in-use', 'deleting', 'deleted', 'error') NOT NULL,
    encrypted BOOLEAN DEFAULT FALSE,
    kms_key_id VARCHAR(255),
    iops INT,
    throughput INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_account_id (account_id),
    INDEX idx_state (state),
    INDEX idx_availability_zone (availability_zone)
);

8. Volume Attachments Table

CREATE TABLE volume_attachments (
    volume_id VARCHAR(20) NOT NULL,
    instance_id VARCHAR(20) NOT NULL,
    device VARCHAR(255) NOT NULL,
    state ENUM('attaching', 'attached', 'detaching', 'detached') NOT NULL,
    attach_time TIMESTAMP,
    delete_on_termination BOOLEAN DEFAULT FALSE,
    PRIMARY KEY (volume_id, instance_id),
    FOREIGN KEY (volume_id) REFERENCES volumes(volume_id) ON DELETE CASCADE,
    FOREIGN KEY (instance_id) REFERENCES instances(instance_id) ON DELETE CASCADE,
    INDEX idx_instance_id (instance_id)
);

9. Snapshots Table

CREATE TABLE snapshots (
    snapshot_id VARCHAR(20) PRIMARY KEY,
    account_id VARCHAR(12) NOT NULL,
    volume_id VARCHAR(20) NOT NULL,
    state ENUM('pending', 'completed', 'error') NOT NULL,
    progress INT DEFAULT 0,
    size INT NOT NULL,
    encrypted BOOLEAN DEFAULT FALSE,
    start_time TIMESTAMP NOT NULL,
    completed_time TIMESTAMP,
    description TEXT,
    INDEX idx_account_id (account_id),
    INDEX idx_volume_id (volume_id),
    INDEX idx_state (state)
);

10. Instance Metrics Table (Time-Series)

CREATE TABLE instance_metrics (
    instance_id VARCHAR(20) NOT NULL,
    metric_name VARCHAR(50) NOT NULL,
    metric_value DECIMAL(10, 2) NOT NULL,
    timestamp TIMESTAMP NOT NULL,
    PRIMARY KEY (instance_id, metric_name, timestamp),
    INDEX idx_timestamp (timestamp)
) PARTITION BY RANGE (YEAR(timestamp));

11. Reservations Table

CREATE TABLE reservations (
    reservation_id VARCHAR(20) PRIMARY KEY,
    account_id VARCHAR(12) NOT NULL,
    instance_type VARCHAR(50) NOT NULL,
    availability_zone VARCHAR(20),
    instance_count INT NOT NULL,
    state ENUM('active', 'payment-pending', 'retired') NOT NULL,
    start_date DATE NOT NULL,
    end_date DATE NOT NULL,
    offering_type ENUM('Heavy Utilization', 'Medium Utilization', 'Light Utilization') NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_account_id (account_id),
    INDEX idx_state (state)
);

12. Billing Records Table

CREATE TABLE billing_records (
    billing_id BIGINT PRIMARY KEY AUTO_INCREMENT,
    account_id VARCHAR(12) NOT NULL,
    instance_id VARCHAR(20),
    resource_type VARCHAR(50) NOT NULL,
    resource_id VARCHAR(50),
    usage_start TIMESTAMP NOT NULL,
    usage_end TIMESTAMP NOT NULL,
    usage_quantity DECIMAL(10, 4) NOT NULL,
    unit_price DECIMAL(10, 4) NOT NULL,
    total_cost DECIMAL(10, 2) NOT NULL,
    region VARCHAR(20) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_account_id (account_id),
    INDEX idx_usage_start (usage_start),
    INDEX idx_instance_id (instance_id)
) PARTITION BY RANGE (YEAR(usage_start));

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Client Layer                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │   Web UI     │  │   CLI Tools   │  │   SDK/API    │          │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘          │
└─────────┼──────────────────┼──────────────────┼─────────────────┘
          │                  │                  │
          └──────────────────┼──────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                    API Gateway Layer                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              Load Balancer (ALB/NLB)                      │   │
│  └───────────────────────────┬────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         API Gateway (Rate Limiting, Auth, Routing)        │   │
│  └────────────────────────────┼────────────────────────────┘   │
└───────────────────────────────┼─────────────────────────────────┘
                                │
┌───────────────────────────────┼─────────────────────────────────┐
│                    Application Layer                             │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │         Instance Management Service (Stateless)           │   │
│  │  - Launch/Start/Stop/Terminate                            │   │
│  │  - Describe Instances                                     │   │
│  │  - Modify Attributes                                      │   │
│  └───────────────────────────┬──────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Image Management Service                          │   │
│  │  - Create/Describe Images                                 │   │
│  │  - Image Registry                                         │   │
│  └────────────────────────────┼────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Storage Management Service                        │   │
│  │  - Volume Create/Attach/Detach                           │   │
│  │  - Snapshot Management                                    │   │
│  └────────────────────────────┼────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Networking Service                                │   │
│  │  - VPC/Subnet Management                                  │   │
│  │  - Security Groups                                        │   │
│  │  - IP Address Management                                  │   │
│  └────────────────────────────┼────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Orchestration Service                            │   │
│  │  - Resource Scheduling                                   │   │
│  │  - Capacity Management                                    │   │
│  │  - Placement Decisions                                   │   │
│  └────────────────────────────┼────────────────────────────┘   │
└───────────────────────────────┼─────────────────────────────────┘
                                │
┌───────────────────────────────┼─────────────────────────────────┐
│                    Control Plane Layer                           │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │         Hypervisor Management                            │   │
│  │  - Instance Lifecycle Control                            │   │
│  │  - Resource Allocation                                   │   │
│  │  - Health Monitoring                                     │   │
│  └───────────────────────────┬──────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Network Controller                               │   │
│  │  - Virtual Network Management                            │   │
│  │  - Security Group Enforcement                            │   │
│  │  - Routing & NAT                                         │   │
│  └────────────────────────────┼────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Storage Controller                               │   │
│  │  - Volume Management                                     │   │
│  │  - Snapshot Management                                   │   │
│  │  - Storage Backend Coordination                          │   │
│  └────────────────────────────┼────────────────────────────┘   │
└───────────────────────────────┼─────────────────────────────────┘
                                │
┌───────────────────────────────┼─────────────────────────────────┐
│                    Data Layer                                    │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │         Metadata Database (MySQL/PostgreSQL)              │   │
│  │  - Instance Metadata                                     │   │
│  │  - Images, Volumes, Snapshots                            │   │
│  │  - Security Groups, Tags                                 │   │
│  └───────────────────────────┬──────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Time-Series DB (InfluxDB/TimescaleDB)            │   │
│  │  - Metrics & Monitoring                                  │   │
│  │  - Billing Data                                          │   │
│  └────────────────────────────┼────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Cache (Redis)                                    │   │
│  │  - Instance State Cache                                 │   │
│  │  - Security Group Rules Cache                           │   │
│  │  - Rate Limiting                                        │   │
│  └────────────────────────────┼────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Message Queue (Kafka/RabbitMQ)                   │   │
│  │  - Event Streaming                                      │   │
│  │  - Async Operations                                     │   │
│  │  - State Change Events                                  │   │
│  └────────────────────────────┼────────────────────────────┘   │
└───────────────────────────────┼─────────────────────────────────┘
                                │
┌───────────────────────────────┼─────────────────────────────────┐
│                    Infrastructure Layer                          │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │         Compute Nodes (Hypervisor Hosts)                 │   │
│  │  - KVM/Xen/Hyper-V                                       │   │
│  │  - Instance Execution                                    │   │
│  │  - Local Storage (Instance Store)                        │   │
│  └───────────────────────────┬──────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Storage Backend                                   │   │
│  │  - Block Storage (EBS)                                   │   │
│  │  - Object Storage (S3) for Images                       │   │
│  │  - Distributed File System                              │   │
│  └────────────────────────────┼────────────────────────────┘   │
│                               │                                   │
│  ┌────────────────────────────┼────────────────────────────┐   │
│  │         Network Infrastructure                           │   │
│  │  - Physical Network                                      │   │
│  │  - SDN Controllers                                       │   │
│  │  - Load Balancers                                        │   │
│  └────────────────────────────┴────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Core Components

1. API Gateway

Responsibilities:

Request routing and load balancing
Authentication and authorization (IAM integration)
Rate limiting and throttling
Request/response transformation
API versioning

Technology: AWS API Gateway, Kong, or custom gateway

Scalability:

Stateless design for horizontal scaling
Multiple instances behind load balancer
Caching for frequently accessed data

2. Instance Management Service

Responsibilities:

Handle instance lifecycle operations (launch, start, stop, terminate)
Validate instance configurations
Coordinate with orchestration service
Update instance state in database
Emit events for state changes

Key Operations:

launchInstance(request) → Returns instance ID
startInstance(instanceId) → Starts stopped instance
stopInstance(instanceId) → Stops running instance
terminateInstance(instanceId) → Terminates instance
describeInstances(filters) → Returns instance details
modifyInstance(instanceId, attributes) → Updates instance

State Machine:

pending → running → stopping → stopped
   ↓         ↓
running → shutting-down → terminated

3. Orchestration Service

Responsibilities:

Resource scheduling and placement
Capacity management
Instance placement decisions
Health monitoring and auto-recovery
Migration during hardware failures

Placement Algorithm:

Check availability zone capacity
Find suitable hypervisor host
Check resource availability (CPU, memory, network)
Consider placement constraints (dedicated, placement groups)
Reserve resources
Return placement decision

Capacity Management:

Track available capacity per AZ
Predict capacity needs
Auto-scale hypervisor infrastructure
Handle capacity constraints

4. Hypervisor Management

Responsibilities:

Manage hypervisor hosts
Allocate/deallocate resources
Monitor host health
Handle host failures
Resource isolation and multi-tenancy

Hypervisor Types:

KVM (Kernel-based Virtual Machine) - Linux
Xen - Paravirtualization
Hyper-V - Windows
VMware ESXi - Enterprise

Resource Isolation:

CPU pinning and quotas
Memory ballooning
Network bandwidth limiting
I/O throttling

5. Image Management Service

Responsibilities:

Store and manage AMI images
Image versioning
Image sharing (public/private)
Image metadata management
Boot image preparation

Image Storage:

Images stored in object storage (S3)
Metadata in database
Image registry for quick lookup
Support for multiple formats (RAW, QCOW2, VMDK)

Image Lifecycle:

Create image from instance
Store in object storage
Create metadata record
Make available for launch

6. Storage Management Service

Responsibilities:

Create and manage EBS volumes
Volume attachment/detachment
Snapshot creation and management
Volume encryption
Storage backend coordination

Volume Types:

gp3 - General Purpose SSD (3 IOPS/GB, up to 16,000 IOPS)
gp2 - General Purpose SSD (3 IOPS/GB baseline)
io1/io2 - Provisioned IOPS SSD (up to 64,000 IOPS)
st1 - Throughput Optimized HDD
sc1 - Cold HDD

Storage Backend:

Distributed block storage system
Replication across multiple AZs
Encryption at rest
Snapshot storage in object storage

7. Networking Service

Responsibilities:

VPC and subnet management
Security group rule enforcement
IP address allocation
NAT gateway management
Network ACL management

Network Architecture:

VPC - Isolated network per customer
Subnets - Network segments within VPC
Security Groups - Instance-level firewall
Network ACLs - Subnet-level firewall
Route Tables - Routing configuration

IP Management:

Private IP allocation from subnet CIDR
Public IP allocation from public pool
Elastic IP reservation
IP address tracking

8. Monitoring Service

Responsibilities:

Collect instance metrics
Store metrics in time-series database
Generate alarms
Instance health checks
Log aggregation

Metrics Collected:

CPU utilization
Memory utilization
Disk I/O
Network I/O
Instance status checks

Monitoring Architecture:

Agent on hypervisor host collects metrics
Metrics sent to monitoring service
Stored in time-series database
CloudWatch-like interface for querying

9. Billing Service

Responsibilities:

Meter instance usage
Calculate costs
Generate billing records
Handle different pricing models (on-demand, reserved, spot)

Pricing Models:

On-Demand - Pay per hour/second
Reserved - 1-3 year commitment, discounted
Spot - Bid for unused capacity
Dedicated - Single-tenant hardware

Billing Calculation:

Track instance runtime
Apply pricing based on instance type
Factor in data transfer costs
Generate hourly/daily billing records

Instance Lifecycle Management

Launch Flow

1. Client → API Gateway
   POST /api/v1/instances
   {
     "imageId": "ami-12345678",
     "instanceType": "t3.medium",
     ...
   }

2. API Gateway
   - Authenticate request
   - Rate limit check
   - Route to Instance Management Service

3. Instance Management Service
   - Validate request
   - Check account limits
   - Create instance record in DB (state: pending)
   - Publish event: InstanceLaunchRequested

4. Orchestration Service (Event Consumer)
   - Receive launch event
   - Select availability zone
   - Find suitable hypervisor host
   - Reserve resources
   - Return placement decision

5. Hypervisor Management
   - Receive placement decision
   - Allocate resources on host
   - Download image from object storage
   - Create virtual machine
   - Configure networking
   - Start instance

6. Instance Management Service
   - Receive instance started notification
   - Update DB (state: running)
   - Allocate IP addresses
   - Apply security groups
   - Publish event: InstanceRunning

7. Response to Client
   {
     "instanceId": "i-1234567890abcdef0",
     "state": "running",
     "privateIpAddress": "10.0.1.5",
     "publicIpAddress": "54.123.45.67"
   }

State Transitions

┌─────────┐
│ pending │  ← Instance launch initiated
└────┬────┘
     │
     │ (resources allocated, VM started)
     ↓
┌─────────┐
│ running │  ← Instance is operational
└────┬────┘
     │
     │ (stop command)
     ↓
┌──────────┐
│stopping  │  ← Graceful shutdown initiated
└────┬─────┘
     │
     │ (shutdown complete)
     ↓
┌─────────┐
│ stopped │  ← Instance stopped, resources retained
└────┬────┘
     │
     │ (start command)
     ↓
┌─────────┐
│ running │
└────┬────┘
     │
     │ (terminate command)
     ↓
┌──────────────┐
│shutting-down │  ← Termination initiated
└──────┬───────┘
       │
       │ (cleanup complete)
       ↓
┌────────────┐
│ terminated │  ← Instance deleted, resources released
└────────────┘

Health Monitoring

Instance Status Checks:

System Status Check - Hypervisor host health
Instance Status Check - Instance reachability

Auto-Recovery:

If system status check fails → Migrate instance to healthy host
If instance status check fails → Attempt recovery, then notify user
Health checks every 1-2 minutes

Networking Architecture

Virtual Private Cloud (VPC)

VPC Structure:

VPC (10.0.0.0/16)
├── Public Subnet (10.0.1.0/24) - AZ: us-east-1a
│   ├── Internet Gateway
│   ├── NAT Gateway
│   └── Instances with public IPs
├── Private Subnet (10.0.2.0/24) - AZ: us-east-1a
│   └── Instances without public IPs
└── Public Subnet (10.0.3.0/24) - AZ: us-east-1b
    └── Instances with public IPs

Security Groups

Rule Structure:

{
  "securityGroupId": "sg-12345678",
  "rules": [
    {
      "type": "ingress",
      "protocol": "tcp",
      "fromPort": 80,
      "toPort": 80,
      "source": "0.0.0.0/0",
      "description": "Allow HTTP from anywhere"
    },
    {
      "type": "ingress",
      "protocol": "tcp",
      "fromPort": 22,
      "toPort": 22,
      "source": "sg-87654321",
      "description": "Allow SSH from bastion"
    },
    {
      "type": "egress",
      "protocol": "-1",
      "fromPort": 0,
      "toPort": 65535,
      "destination": "0.0.0.0/0",
      "description": "Allow all outbound"
    }
  ]
}

Enforcement:

Security group rules enforced at hypervisor level
Stateful firewall (return traffic automatically allowed)
Rules evaluated in order
Default deny all inbound, allow all outbound

Network Isolation

Multi-Tenancy:

Each customer gets isolated VPC
Network virtualization (VXLAN, GRE tunnels)
MAC address isolation
VLAN tagging per customer

Traffic Flow:

Instance → Virtual Switch → Security Group Enforcement → 
Physical Network → Internet Gateway / NAT Gateway

Storage Architecture

Storage Types

1. Instance Store (Ephemeral)

Characteristics:

Temporary storage on hypervisor host
High performance (NVMe SSD)
Data lost on instance stop/terminate
Included with instance at no extra cost

Use Cases:

Temporary files
Cache
Scratch space
High-performance workloads

2. EBS Volumes (Persistent)

Characteristics:

Network-attached block storage
Persistent across instance lifecycle
Can be attached/detached
Replicated across multiple AZs
Encryption support

Volume Lifecycle:

Create Volume → Available → Attach → In-Use → 
Detach → Available → Delete → Deleted

Snapshot Process:

Create snapshot request
Freeze filesystem (if supported)
Copy data blocks to object storage
Unfreeze filesystem
Mark snapshot as completed

Storage Backend

Architecture:

EBS Volume Request
    ↓
Storage Controller
    ↓
Distributed Block Storage
    ├── Replication (3x across AZs)
    ├── Encryption (optional)
    └── Snapshot to Object Storage

Data Durability:

99.999999999% (11 nines) for EBS volumes
Achieved through replication
Automatic repair of corrupted blocks

Security & Access Control

Authentication & Authorization

IAM Integration:

Users authenticate via IAM
IAM policies control API access
Resource-level permissions
Instance role assumption

Example IAM Policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:RunInstances",
        "ec2:DescribeInstances",
        "ec2:StartInstances",
        "ec2:StopInstances"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "ec2:InstanceType": ["t3.micro", "t3.small"]
        }
      }
    }
  ]
}

Network Security

Security Measures:

VPC isolation per customer
Security group enforcement
Network ACLs
DDoS protection
VPN and Direct Connect support

Data Security

Encryption:

Encryption at rest (EBS volumes, snapshots)
Encryption in transit (TLS for API, VPN for data)
KMS integration for key management
Customer-managed keys (CMK)

Compliance:

SOC 2 Type II
PCI-DSS Level 1
HIPAA
GDPR

Instance Security

Key Pairs:

SSH key pairs for Linux instances
Password management for Windows instances
Key rotation support
Secure key storage

Metadata Service:

Instance metadata available at http://169.254.169.254
IAM role credentials
Instance identity documents
Secure access only from instance

Monitoring & Logging

Metrics Collection

Instance Metrics:

CPU utilization (%)
Memory utilization (%)
Disk read/write operations
Network in/out (bytes)
Status check failures

System Metrics:

Hypervisor host utilization
Storage backend performance
Network throughput
API latency and errors

Collection Architecture:

Hypervisor Agent → Metrics Collector → 
Time-Series DB → Monitoring Dashboard

Logging

Log Types:

API access logs
Instance system logs
Security group rule evaluations
Billing events
Error logs

Log Storage:

Centralized log aggregation
Long-term storage in object storage
Search and analysis tools
Retention policies

Alarms & Notifications

Alarm Types:

Instance status check failures
High CPU utilization
Low disk space
Security group rule violations
Billing thresholds

Notification Channels:

Email
SMS
SNS topics
Webhooks

Scalability & Performance

Horizontal Scaling

API Layer:

Stateless services
Multiple instances behind load balancer
Auto-scaling based on traffic

Database:

Read replicas for read-heavy operations
Sharding by account_id
Connection pooling

Compute Infrastructure:

Add hypervisor hosts as needed
Distribute instances across hosts
Auto-scale based on capacity

Caching Strategy

Cache Layers:

API Gateway Cache - Frequently accessed instance metadata
Application Cache (Redis) - Instance state, security groups
Database Query Cache - Common queries

Cache Invalidation:

TTL-based expiration
Event-driven invalidation on state changes
Cache warming for hot data

Performance Optimization

API Performance:

Connection pooling
Async processing for long operations
Batch operations where possible
Pagination for list operations

Database Optimization:

Indexes on frequently queried columns
Partitioning by time/account
Query optimization
Read replicas

Network Performance:

Enhanced networking (SR-IOV)
Placement groups for low latency
Bandwidth allocation per instance type

Disaster Recovery & High Availability

Multi-AZ Architecture

Availability Zones:

Multiple AZs per region
Independent power, networking, cooling
Instances distributed across AZs
Automatic failover

Replication:

EBS volumes replicated across AZs
Database replication
Image storage replicated

Backup & Recovery

Backup Strategy:

EBS snapshots
AMI images
Database backups
Configuration backups

Recovery Procedures:

Instance recovery from snapshots
Volume recovery from snapshots
Cross-region replication
RTO: < 1 hour, RPO: < 15 minutes

Failure Handling

Hypervisor Host Failure:

Detect host failure
Mark instances as impaired
Migrate instances to healthy hosts
Notify customers

Storage Failure:

Detect storage failure
Use replicated copies
Rebuild failed storage
Verify data integrity

Network Failure:

Detect network partition
Route traffic to healthy paths
Maintain instance connectivity
Restore full connectivity

Trade-offs & Design Decisions

1. Virtualization vs. Containers

Decision: Use full virtualization (KVM/Xen) for EC2 instances

Rationale:

Isolation: Complete OS-level isolation between customer instances
Security: Strong security boundaries, no shared kernel
Compatibility: Run any OS without modification
Legacy Support: Support older applications and OS versions
Resource Guarantees: Better CPU, memory, and I/O guarantees

Trade-offs:

Higher overhead (~5-10% vs containers' ~1-2%)
Slower boot times (30-60s vs containers' <1s)
More resource intensive

Alternative: Containers (Docker/Kubernetes) for specific use cases like serverless or batch processing

2. Hypervisor Choice: KVM vs Xen vs Hyper-V

Decision: Use KVM for Linux hosts, Hyper-V for Windows hosts

Rationale:

Hypervisor	Pros	Cons	Use Case
KVM	Open source, good performance, Linux native	Linux only	Linux instances ✓
Xen	Mature, paravirtualization support	More complex, declining adoption	Legacy systems
Hyper-V	Windows native, good Windows support	Windows only	Windows instances ✓
VMware ESXi	Enterprise features, mature	Proprietary, expensive	Enterprise customers

Trade-offs:

KVM: Best performance on Linux, active development
Hyper-V: Best Windows integration, native Microsoft stack
Multi-hypervisor adds complexity but provides flexibility

3. Storage Architecture: Network-Attached vs Local

Decision: Hybrid approach - Instance Store (local) + EBS (network-attached)

Rationale:

Instance Store (Local):

✅ High performance (NVMe SSD)
✅ Low latency
✅ No network overhead
❌ Ephemeral (data lost on stop/terminate)
❌ Limited capacity

EBS (Network-Attached):

✅ Persistent across instance lifecycle
✅ Can attach/detach
✅ Replicated for durability
❌ Network latency
❌ Additional cost

Trade-offs:

Offer both to match different use cases
Instance store for temporary, high-performance workloads
EBS for persistent, production workloads

4. Networking: Overlay vs Underlay

Decision: Use overlay networking (VXLAN/GRE) with SDN controller

Rationale:

Overlay (VXLAN/GRE):

✅ Flexible IP allocation per customer
✅ Easy multi-tenancy
✅ Scalable (16M VNIs)
✅ Works over existing physical network
❌ Additional encapsulation overhead (~50 bytes)

Underlay (VLAN):

✅ Lower overhead
✅ Better performance
❌ Limited scalability (4K VLANs)
❌ Complex IP management

Trade-offs:

Overlay provides better isolation and scalability
Performance overhead acceptable for most workloads
Can use SR-IOV for high-performance instances

5. Instance Placement: Random vs Intelligent

Decision: Intelligent placement with multiple strategies

Placement Strategies:

Spread Placement - Distribute across distinct hardware
- Use case: High availability
- Reduces correlated failures
Cluster Placement - Pack instances on same hardware
- Use case: Low latency (HPC)
- Higher risk of correlated failures
Partition Placement - Logical partitions
- Use case: Multi-AZ deployments
- Balance between spread and cluster

Trade-offs:

Intelligent placement improves availability and performance
Adds complexity to orchestration
May reduce resource utilization

6. Billing Model: Per-Second vs Per-Hour

Decision: Per-second billing with 1-minute minimum

Rationale:

Fairer for customers (pay only for what you use)
Competitive advantage
More complex billing calculations
Higher billing system load

Trade-offs:

Per-second: Fairer, more complex
Per-hour: Simpler, less fair
1-minute minimum balances fairness and complexity

7. Instance State Persistence: Stop vs Hibernate

Decision: Support both Stop and Hibernate

Stop:

✅ Fast (saves instance state to EBS)
✅ Can change instance type
✅ Lower cost (no compute charges)
❌ EBS storage costs continue

Hibernate:

✅ Fastest resume (RAM saved to EBS)
✅ Preserves in-memory state
❌ Requires EBS root volume
❌ Limited instance types

Trade-offs:

Offer both options
Stop for most use cases
Hibernate for long-running applications with state

8. Security Groups: Stateful vs Stateless

Decision: Stateful security groups

Rationale:

Return traffic automatically allowed
Simpler rule management
Better user experience
Matches traditional firewall behavior

Trade-offs:

Stateful: Simpler, but requires connection tracking
Stateless: More control, but complex rule management
Stateful is industry standard and user-friendly

9. Image Storage: Object Storage vs Block Storage

Decision: Object storage (S3-like) for images, block storage for volumes

Rationale:

Object Storage for Images:

✅ Cost-effective for large, infrequently accessed data
✅ High durability (11 nines)
✅ Easy sharing and versioning
✅ Scales to petabytes
❌ Higher latency for reads

Block Storage for Volumes:

✅ Low latency
✅ Random access
✅ Direct attach to instances
❌ More expensive
❌ Limited scalability per volume

Trade-offs:

Images are read infrequently → object storage
Volumes need low latency → block storage
Hybrid approach optimizes cost and performance

10. Multi-Tenancy: Shared vs Dedicated Hardware

Decision: Shared hardware with strong isolation, dedicated option available

Rationale:

Shared Hardware:

✅ Higher resource utilization
✅ Lower cost for customers
✅ Better for most workloads
❌ Potential "noisy neighbor" issues
❌ Requires strong isolation

Dedicated Hardware:

✅ Complete isolation
✅ Compliance requirements
✅ Predictable performance
❌ Higher cost
❌ Lower utilization

Trade-offs:

Default to shared for cost efficiency
Offer dedicated for compliance/security needs
Use hardware-level isolation (CPU pinning, memory isolation)

11. API Design: REST vs GraphQL vs gRPC

Decision: REST API with optional gRPC for high-performance operations

Rationale:

REST:

✅ Industry standard
✅ Easy to use and understand
✅ Good tooling and documentation
✅ Works well with HTTP caching
❌ Over-fetching/under-fetching

GraphQL:

✅ Flexible queries
✅ Reduces over-fetching
❌ More complex
❌ Caching challenges

gRPC:

✅ High performance
✅ Streaming support
✅ Type safety
❌ Less familiar to developers
❌ Limited browser support

Trade-offs:

REST for general API access (web, CLI, SDKs)
gRPC for internal services and high-throughput operations
GraphQL for specific use cases if needed

12. Database: SQL vs NoSQL for Metadata

Decision: SQL (PostgreSQL) for metadata, NoSQL for metrics

Rationale:

SQL for Metadata:

✅ ACID transactions for consistency
✅ Complex queries (joins, aggregations)
✅ Relational data (instances, volumes, security groups)
✅ Mature tooling
❌ Scaling challenges at extreme scale

NoSQL for Metrics:

✅ High write throughput
✅ Time-series optimization
✅ Horizontal scaling
❌ Eventual consistency
❌ Limited query capabilities

Trade-offs:

SQL for transactional metadata (instances, volumes, images)
NoSQL (InfluxDB/TimescaleDB) for time-series metrics
Can shard SQL database if needed

13. Caching Strategy: Write-Through vs Write-Back

Decision: Write-through cache with TTL-based invalidation

Rationale:

Write-Through:

✅ Data consistency
✅ No data loss on cache failure
✅ Simpler implementation
❌ Higher write latency

Write-Back:

✅ Lower write latency
✅ Better write performance
❌ Risk of data loss
❌ More complex

Trade-offs:

Write-through for instance state (consistency critical)
Cache with TTL for frequently read data
Event-driven invalidation for state changes

14. Monitoring: Push vs Pull

Decision: Hybrid - Push from agents, Pull for health checks

Rationale:

Push (Agent-based):

✅ Real-time metrics
✅ Works behind firewalls
✅ Efficient for high-frequency metrics
❌ Requires agent on each host

Pull (Scraping):

✅ Centralized configuration
✅ No agent required
✅ Easy to debug
❌ Requires network access
❌ Higher latency

Trade-offs:

Push for instance metrics (high frequency, behind firewall)
Pull for health checks and service discovery
Hybrid approach provides best of both

15. Region Strategy: Single vs Multi-Region

Decision: Multi-region deployment from day one

Rationale:

✅ Disaster recovery
✅ Lower latency for global customers
✅ Regulatory compliance (data residency)
✅ Higher availability
❌ Higher operational complexity
❌ Data replication costs

Trade-offs:

Start with 2-3 regions
Expand based on customer demand
Replicate critical data across regions
Route customers to nearest region

Summary

This AWS EC2 design supports:

Scale: Millions of instances, billions of API calls per day
Performance: < 60s instance launch, < 100ms API latency
Availability: 99.99% uptime with multi-AZ deployment
Security: Complete network isolation, encryption at rest and in transit
Flexibility: Multiple instance types, storage options, networking configurations

Key Technologies:

Hypervisor: KVM (Linux), Hyper-V (Windows)
Orchestration: Custom scheduler with intelligent placement
Storage: EBS (block storage), Object Storage (images)
Networking: VXLAN overlay, SDN controller
Database: PostgreSQL (metadata), InfluxDB (metrics)
Cache: Redis Cluster
Message Queue: Kafka
Monitoring: Prometheus + Grafana
API Gateway: Custom gateway with rate limiting

Design Principles:

Multi-tenancy with strong isolation - Shared infrastructure with complete customer isolation
Elasticity - Scale compute capacity up/down on demand
High availability - Multi-AZ, automatic failover, health monitoring
Security first - Network isolation, encryption, IAM integration
Cost optimization - Pay-per-use, reserved instances, spot pricing
Performance - Low latency, high throughput, optimized networking
Reliability - 11 nines durability, automatic recovery, data replication

Scaling Considerations:

Horizontal scaling of API services
Database sharding by account_id
Distributed storage backend
Multi-region deployment
Caching at multiple layers
Async processing for long operations

This design provides a solid foundation for a cloud compute service that can scale to millions of instances while maintaining high performance, availability, and security.

Design AWS EC2 Service

Table of Contents​

Problem Statement​

Key Challenges​

Functional Requirements​

Core Features​

Optional Features​

Non-Functional Requirements​

Performance​

Scalability​

Reliability​

Security​

Cost​

Capacity Estimation​

Traffic Estimates​

Storage Estimates​

Network Estimates​

Compute Estimates​

System APIs​

Instance Management APIs​

1. Launch Instance​

2. Describe Instances​

3. Start Instance​

4. Stop Instance​

5. Terminate Instance​

6. Modify Instance Attributes​

Image Management APIs​

7. Create Image (AMI)​

8. Describe Images​

Volume Management APIs​

9. Create Volume​

10. Attach Volume​

Database Design​

1. Instances Table​

2. Instance Tags Table​

3. Security Groups Table​

4. Security Group Rules Table​

5. Instance Security Groups Table​

6. Images (AMIs) Table​

7. Volumes Table​

8. Volume Attachments Table​

9. Snapshots Table​

10. Instance Metrics Table (Time-Series)​

11. Reservations Table​

12. Billing Records Table​

High-Level Architecture​

Core Components​

1. API Gateway​

2. Instance Management Service​

3. Orchestration Service​

4. Hypervisor Management​

5. Image Management Service​

6. Storage Management Service​

7. Networking Service​

8. Monitoring Service​

9. Billing Service​

Instance Lifecycle Management​

Launch Flow​

State Transitions​

Health Monitoring​

Networking Architecture​

Virtual Private Cloud (VPC)​

Security Groups​

Network Isolation​

Storage Architecture​

Storage Types​

1. Instance Store (Ephemeral)​

2. EBS Volumes (Persistent)​

Storage Backend​

Security & Access Control​

Authentication & Authorization​

Network Security​

Data Security​

Instance Security​

Monitoring & Logging​

Metrics Collection​

Logging​

Alarms & Notifications​

Scalability & Performance​

Horizontal Scaling​

Table of Contents

Problem Statement

Key Challenges

Functional Requirements

Core Features

Optional Features

Non-Functional Requirements

Performance

Scalability

Reliability

Security

Cost

Capacity Estimation

Traffic Estimates

Storage Estimates

Network Estimates

Compute Estimates

System APIs

Instance Management APIs

1. Launch Instance

2. Describe Instances

3. Start Instance

4. Stop Instance

5. Terminate Instance

6. Modify Instance Attributes

Image Management APIs

7. Create Image (AMI)

8. Describe Images

Volume Management APIs

9. Create Volume

10. Attach Volume

Database Design

1. Instances Table

2. Instance Tags Table

3. Security Groups Table

4. Security Group Rules Table

5. Instance Security Groups Table

6. Images (AMIs) Table

7. Volumes Table

8. Volume Attachments Table

9. Snapshots Table

10. Instance Metrics Table (Time-Series)

11. Reservations Table

12. Billing Records Table

High-Level Architecture

Core Components

1. API Gateway

2. Instance Management Service

3. Orchestration Service

4. Hypervisor Management

5. Image Management Service

6. Storage Management Service

7. Networking Service

8. Monitoring Service

9. Billing Service

Instance Lifecycle Management

Launch Flow

State Transitions

Health Monitoring

Networking Architecture

Virtual Private Cloud (VPC)

Security Groups

Network Isolation

Storage Architecture

Storage Types

1. Instance Store (Ephemeral)

2. EBS Volumes (Persistent)

Storage Backend

Security & Access Control

Authentication & Authorization

Network Security

Data Security

Instance Security

Monitoring & Logging

Metrics Collection

Logging

Alarms & Notifications

Scalability & Performance

Horizontal Scaling

Caching Strategy