zoobzio February 17, 2025 Edit this page

Troubleshooting

This guide covers common errors and how to resolve them.

Connection Errors

"certificate signed by unknown authority"

Cause: Client doesn't trust the CA that signed the server's certificate.

Solutions:

  1. Same certificate directory: Ensure both nodes use certificates from the same CA.
    // Both nodes should generate certs in the same directory
    // OR share the same CA files
    WithCertDir("./shared-certs")
    
  2. Check CA files: Verify ca-cert.pem exists and is the same on both sides.
  3. Custom CA: If using external certificates, ensure the CA is in the trust pool:
    opts := &aegis.TLSOptions{
        Source:   aegis.CertSourceFile,
        CAFile:   "/path/to/ca.crt",
        CertFile: "/path/to/node.crt",
        KeyFile:  "/path/to/node.key",
    }
    

"connection refused"

Cause: Server not running or wrong address.

Solutions:

  1. Server started? Ensure node.StartServer() was called.
  2. Correct address? Check the peer's address matches the server's listening address.
    // Server
    WithAddress("localhost:8443")
    
    // Client peer info must match
    node.AddPeer(aegis.PeerInfo{
        ID:      "server-node",
        Type:    aegis.NodeTypeGeneric,
        Address: "localhost:8443",
    })
    
  3. Firewall: Check if the port is blocked.

"context deadline exceeded"

Cause: Connection or RPC timeout.

Solutions:

  1. Network reachable? Verify nodes can reach each other.
  2. Increase timeout:
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
  3. Check server load: Server may be overwhelmed.

Service Discovery Errors

"no providers available for service"

Cause: No nodes in topology provide the requested service.

Solutions:

  1. Service declared? Provider must declare the service:
    WithServices(aegis.ServiceInfo{Name: "identity", Version: "v1"})
    
  2. Topology synced? Consumer's topology must include the provider:
    // Add provider as peer
    node.AddPeer(aegis.PeerInfo{...})
    
    // Sync topology
    node.SyncTopology(ctx, providerID)
    
    // Now query should work
    providers := node.Topology.GetServiceProviders("identity", "v1")
    
  3. Version match? Service name AND version must match exactly.

"node has no TLS configuration"

Cause: Attempting to connect without TLS setup.

Solutions:

  1. Use NodeBuilder: Always use NewNodeBuilder() instead of NewNode().
  2. Specify cert source:
    WithCertDir("./certs")
    // OR
    WithTLSOptions(&aegis.TLSOptions{...})
    

Certificate Errors

"certificate has expired"

Cause: Node certificate past validity period.

Solutions:

  1. Regenerate certificates: Delete old certs and restart node.
    rm certs/node-1-cert.pem certs/node-1-key.pem
    
  2. Check CA expiry: CA certificates expire after 365 days by default.
  3. Disable expiry check (development only):
    opts := &aegis.TLSOptions{
        AllowExpired: true,  // NOT for production
    }
    

"certificate does not contain any IP SANs"

Cause: Certificate missing Subject Alternative Names for the connection address.

Solutions:

  1. Use hostname: Connect via hostname, not IP, if cert has DNS SANs only.
  2. Regenerate with correct SANs: Delete cert and let aegis regenerate with proper SANs.

Topology Issues

Topology not syncing

Cause: Version comparison or connectivity issues.

Debug steps:

  1. Check versions:
    log.Printf("Local version: %d", node.Topology.GetVersion())
    // After sync
    log.Printf("Local version: %d", node.Topology.GetVersion())
    
  2. Verify peer connection:
    peer, exists := node.GetPeer(peerID)
    if !exists {
        log.Println("Peer not found")
    }
    if !peer.IsConnected() {
        log.Println("Peer not connected")
    }
    
  3. Manual sync:
    err := node.SyncTopology(ctx, peerID)
    if err != nil {
        log.Printf("Sync failed: %v", err)
    }
    

Stale topology data

Cause: Topology not refreshed after changes.

Solutions:

  1. Periodic sync:
    ticker := time.NewTicker(30 * time.Second)
    for range ticker.C {
        node.SyncTopologyWithAllPeers(ctx)
    }
    
  2. Event-driven sync: Sync when peers connect/disconnect.

Health Check Issues

Health status not updating

Cause: Health checker not being called.

Solutions:

  1. Call CheckHealth explicitly:
    err := node.CheckHealth(ctx, checker)
    
  2. Periodic health checks:
    go func() {
        ticker := time.NewTicker(10 * time.Second)
        for range ticker.C {
            node.CheckHealth(ctx, checker)
        }
    }()
    

Debugging Strategies

Enable gRPC logging

import "google.golang.org/grpc/grpclog"

grpclog.SetLoggerV2(grpclog.NewLoggerV2(os.Stdout, os.Stdout, os.Stderr))

Inspect certificates

# View certificate details
openssl x509 -in certs/node-1-cert.pem -text -noout

# Verify certificate chain
openssl verify -CAfile certs/ca-cert.pem certs/node-1-cert.pem

Check connection state

peer, _ := node.GetPeer(peerID)
state := peer.Conn.GetState()
log.Printf("Connection state: %v", state)
// IDLE, CONNECTING, READY, TRANSIENT_FAILURE, SHUTDOWN

Next Steps