zoobzio February 17, 2025 Edit this page

Architecture

This document explains how aegis works internally. It's intended for contributors and users who want to understand the implementation.

Component Overview

┌─────────────────────────────────────────────────────────────────────┐
│                              Node                                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌──────────────┐    ┌────────────────────┐    │
│  │  MeshServer  │    │ PeerManager  │    │     Topology       │    │
│  │              │    │              │    │                    │    │
│  │  - Ping      │    │  - AddPeer   │    │  - AddNode         │    │
│  │  - Health    │    │  - Connect   │    │  - RemoveNode      │    │
│  │  - NodeInfo  │    │  - PingPeer  │    │  - GetProviders    │    │
│  │  - Topology  │    │  - SyncTopo  │    │  - Merge           │    │
│  │  - Services  │    │              │    │                    │    │
│  └──────┬───────┘    └──────┬───────┘    └─────────┬──────────┘    │
│         │                   │                      │               │
│         │ gRPC/mTLS         │ gRPC/mTLS           │ version-based │
│         ▼                   ▼                      ▼               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                        TLSConfig                             │   │
│  │                                                              │   │
│  │  - CA certificate (trust root)                               │   │
│  │  - Node certificate (identity)                               │   │
│  │  - Private key                                               │   │
│  │  - Server config (require client cert)                       │   │
│  │  - Client config (verify server cert)                        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

mTLS Flow

Certificate Generation

When a node starts with WithCertDir(), aegis checks for existing certificates:

  1. CA exists? Load ca-cert.pem and ca-key.pem
  2. CA missing? Generate new CA (4096-bit RSA, 365-day validity)
  3. Node cert exists? Load {nodeID}-cert.pem and {nodeID}-key.pem
  4. Node cert missing? Generate using CA (2048-bit RSA, 90-day validity)
certs/
├── ca-cert.pem     # CA certificate (shared across nodes)
├── ca-key.pem      # CA private key (keep secure)
├── node-1-cert.pem # Node certificate
└── node-1-key.pem  # Node private key

For production, provide your own CA via WithTLSOptions().

Connection Establishment

Server side (MeshServer):

tlsConfig := &tls.Config{
    ClientAuth: tls.RequireAndVerifyClientCert,
    ClientCAs:  certPool,  // CA that signed client certs
    Certificates: []tls.Certificate{serverCert},
}

Client side (PeerManager):

tlsConfig := &tls.Config{
    RootCAs:      certPool,  // CA that signed server cert
    Certificates: []tls.Certificate{clientCert},
    ServerName:   peerAddress,
}

Both sides verify certificates against the CA. Connection fails if:

  • Certificate not signed by trusted CA
  • Certificate expired
  • Common Name doesn't match expected identity

Caller Extraction

On every request, the server can extract caller identity:

caller, _ := aegis.CallerFromContext(ctx)
// caller.NodeID = certificate Common Name
// caller.Certificate = full X.509 certificate

This uses gRPC's peer.FromContext() to access TLS state.

Topology Synchronization

Data Model

type Topology struct {
    Nodes     map[string]NodeInfo  // nodeID → info
    Version   int64                 // monotonically increasing
    UpdatedAt time.Time
}

type NodeInfo struct {
    ID        string
    Name      string
    Type      NodeType
    Address   string
    Services  []ServiceInfo
    JoinedAt  time.Time
    UpdatedAt time.Time
}

Sync Protocol

When node A syncs with node B:

  1. A calls B.SyncTopology(version: A.version)
  2. B responds with its full topology and version
  3. If B.version > A.version, A adopts B's topology
if resp.Version > n.Topology.GetVersion() {
    n.Topology.Merge(remoteTopology)
}

This is a simple "highest version wins" model. More sophisticated CRDT-based merging could be added for conflict resolution.

When Sync Happens

  • Manually via node.SyncTopology(ctx, peerID)
  • Bulk via node.SyncTopologyWithAllPeers(ctx)
  • Applications can trigger on timers or events

Service Discovery

Registration

Services are declared at node creation:

WithServices(aegis.ServiceInfo{Name: "identity", Version: "v1"})

This adds services to the node's NodeInfo, which propagates via topology sync.

Discovery

Query topology for providers:

providers := topology.GetServiceProviders("identity", "v1")

This scans all nodes, returning those with matching service declarations:

func (t *Topology) GetServiceProviders(name, version string) []NodeInfo {
    var providers []NodeInfo
    for _, node := range t.Nodes {
        for _, svc := range node.Services {
            if svc.Name == name && svc.Version == version {
                providers = append(providers, node)
                break
            }
        }
    }
    return providers
}

Client Load Balancing

ServiceClientPool distributes calls via round-robin:

type ServiceClientPool struct {
    conns    map[string]*grpc.ClientConn  // address → connection
    counters map[string]*atomic.Uint64     // service → counter
}

func (p *ServiceClientPool) GetConn(ctx context.Context, name, version string) (*grpc.ClientConn, error) {
    providers := p.node.Topology.GetServiceProviders(name, version)
    idx := p.counters[key].Add(1) - 1
    provider := providers[idx % uint64(len(providers))]
    return p.getOrCreateConn(provider.Address)
}

Connections are pooled and reused across calls.

Design Q&A

Why generate certificates automatically?

Developer experience. For local development and testing, manual certificate management is friction. For production, WithTLSOptions() accepts external certificates.

Why version-based topology sync instead of CRDTs?

Simplicity. Version-based sync works for small clusters where a single source of truth exists. CRDTs add complexity for conflict resolution that isn't needed in most deployments.

Why not use a service mesh like Istio?

Aegis is embedded. No sidecars, no control plane, no Kubernetes dependency. It's a library, not infrastructure.

Why gRPC?

Protocol buffers provide typed contracts. gRPC provides streaming, deadlines, and interceptors. mTLS is built into gRPC's credential system.

Performance Characteristics

OperationComplexityNotes
GetServiceProvidersO(n × m)n nodes, m services per node
Topology syncO(n)Full topology transfer
Connection pool lookupO(1)Map by address
Round-robin selectionO(1)Atomic counter

For large meshes (100+ nodes), consider:

  • Caching service provider queries
  • Incremental topology sync
  • Hierarchical topology (regional clusters)

Next Steps