Architecture
This document explains how aegis works internally. It's intended for contributors and users who want to understand the implementation.
Component Overview
┌─────────────────────────────────────────────────────────────────────┐
│ Node │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ MeshServer │ │ PeerManager │ │ Topology │ │
│ │ │ │ │ │ │ │
│ │ - Ping │ │ - AddPeer │ │ - AddNode │ │
│ │ - Health │ │ - Connect │ │ - RemoveNode │ │
│ │ - NodeInfo │ │ - PingPeer │ │ - GetProviders │ │
│ │ - Topology │ │ - SyncTopo │ │ - Merge │ │
│ │ - Services │ │ │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └─────────┬──────────┘ │
│ │ │ │ │
│ │ gRPC/mTLS │ gRPC/mTLS │ version-based │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ TLSConfig │ │
│ │ │ │
│ │ - CA certificate (trust root) │ │
│ │ - Node certificate (identity) │ │
│ │ - Private key │ │
│ │ - Server config (require client cert) │ │
│ │ - Client config (verify server cert) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
mTLS Flow
Certificate Generation
When a node starts with WithCertDir(), aegis checks for existing certificates:
- CA exists? Load
ca-cert.pemandca-key.pem - CA missing? Generate new CA (4096-bit RSA, 365-day validity)
- Node cert exists? Load
{nodeID}-cert.pemand{nodeID}-key.pem - Node cert missing? Generate using CA (2048-bit RSA, 90-day validity)
certs/
├── ca-cert.pem # CA certificate (shared across nodes)
├── ca-key.pem # CA private key (keep secure)
├── node-1-cert.pem # Node certificate
└── node-1-key.pem # Node private key
For production, provide your own CA via WithTLSOptions().
Connection Establishment
Server side (MeshServer):
tlsConfig := &tls.Config{
ClientAuth: tls.RequireAndVerifyClientCert,
ClientCAs: certPool, // CA that signed client certs
Certificates: []tls.Certificate{serverCert},
}
Client side (PeerManager):
tlsConfig := &tls.Config{
RootCAs: certPool, // CA that signed server cert
Certificates: []tls.Certificate{clientCert},
ServerName: peerAddress,
}
Both sides verify certificates against the CA. Connection fails if:
- Certificate not signed by trusted CA
- Certificate expired
- Common Name doesn't match expected identity
Caller Extraction
On every request, the server can extract caller identity:
caller, _ := aegis.CallerFromContext(ctx)
// caller.NodeID = certificate Common Name
// caller.Certificate = full X.509 certificate
This uses gRPC's peer.FromContext() to access TLS state.
Topology Synchronization
Data Model
type Topology struct {
Nodes map[string]NodeInfo // nodeID → info
Version int64 // monotonically increasing
UpdatedAt time.Time
}
type NodeInfo struct {
ID string
Name string
Type NodeType
Address string
Services []ServiceInfo
JoinedAt time.Time
UpdatedAt time.Time
}
Sync Protocol
When node A syncs with node B:
- A calls
B.SyncTopology(version: A.version) - B responds with its full topology and version
- If
B.version > A.version, A adopts B's topology
if resp.Version > n.Topology.GetVersion() {
n.Topology.Merge(remoteTopology)
}
This is a simple "highest version wins" model. More sophisticated CRDT-based merging could be added for conflict resolution.
When Sync Happens
- Manually via
node.SyncTopology(ctx, peerID) - Bulk via
node.SyncTopologyWithAllPeers(ctx) - Applications can trigger on timers or events
Service Discovery
Registration
Services are declared at node creation:
WithServices(aegis.ServiceInfo{Name: "identity", Version: "v1"})
This adds services to the node's NodeInfo, which propagates via topology sync.
Discovery
Query topology for providers:
providers := topology.GetServiceProviders("identity", "v1")
This scans all nodes, returning those with matching service declarations:
func (t *Topology) GetServiceProviders(name, version string) []NodeInfo {
var providers []NodeInfo
for _, node := range t.Nodes {
for _, svc := range node.Services {
if svc.Name == name && svc.Version == version {
providers = append(providers, node)
break
}
}
}
return providers
}
Client Load Balancing
ServiceClientPool distributes calls via round-robin:
type ServiceClientPool struct {
conns map[string]*grpc.ClientConn // address → connection
counters map[string]*atomic.Uint64 // service → counter
}
func (p *ServiceClientPool) GetConn(ctx context.Context, name, version string) (*grpc.ClientConn, error) {
providers := p.node.Topology.GetServiceProviders(name, version)
idx := p.counters[key].Add(1) - 1
provider := providers[idx % uint64(len(providers))]
return p.getOrCreateConn(provider.Address)
}
Connections are pooled and reused across calls.
Design Q&A
Why generate certificates automatically?
Developer experience. For local development and testing, manual certificate management is friction. For production, WithTLSOptions() accepts external certificates.
Why version-based topology sync instead of CRDTs?
Simplicity. Version-based sync works for small clusters where a single source of truth exists. CRDTs add complexity for conflict resolution that isn't needed in most deployments.
Why not use a service mesh like Istio?
Aegis is embedded. No sidecars, no control plane, no Kubernetes dependency. It's a library, not infrastructure.
Why gRPC?
Protocol buffers provide typed contracts. gRPC provides streaming, deadlines, and interceptors. mTLS is built into gRPC's credential system.
Performance Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| GetServiceProviders | O(n × m) | n nodes, m services per node |
| Topology sync | O(n) | Full topology transfer |
| Connection pool lookup | O(1) | Map by address |
| Round-robin selection | O(1) | Atomic counter |
For large meshes (100+ nodes), consider:
- Caching service provider queries
- Incremental topology sync
- Hierarchical topology (regional clusters)
Next Steps
- Services Guide — Registering domain services
- API Reference — Function signatures