Skip to content

Clustered Deployment Architecture

This guide covers architecture considerations for deploying Hydden Discovery in a clustered on-premises configuration. Clustered deployments provide high availability, fault tolerance, and disaster recovery capabilities for enterprise environments.

What Is a Clustered Deployment?

What it is: A clustered deployment runs multiple Hydden server nodes that work together as a unified system. If one node fails, the remaining nodes continue to serve requests. Data is replicated across nodes to prevent data loss.

Why it matters: Single-server deployments create a single point of failure. For organizations with strict uptime requirements, compliance mandates, or large-scale identity data, a clustered deployment ensures continuous availability and data protection.

Diagram description: A top-down flow diagram showing a clustered deployment architecture. A Load Balancer (HAProxy, F5, or AWS ALB) distributes traffic to a Hydden Server Cluster containing Node 1 (Primary), Node 2 (Secondary), and Node 3 (Secondary). All server nodes communicate with a NATS Cluster of three brokers and connect to Shared Storage containing the Identity Graph Database and Key-Value Store.

Key terms:

  • Node — A single Hydden server instance in the cluster.
  • Quorum — The minimum number of nodes required for the cluster to operate (typically N/2 + 1).
  • Failover — Automatic transfer of operations from a failed node to a healthy node.
  • Replication — Copying data between nodes to ensure redundancy.

Deployment Topologies

Choose a deployment topology based on your availability and disaster recovery requirements.

Active-Active Cluster

All nodes actively serve requests. A load balancer distributes traffic across nodes.

AspectConfiguration
Minimum nodes3 (for quorum)
Load balancingRequired
Failover timeSeconds (automatic)
Use caseHigh availability within a single data center

Advantages:

  • Full resource utilization across all nodes
  • Automatic load distribution
  • No manual intervention during failover

Considerations:

  • All nodes must have identical configuration
  • Network latency between nodes should be under 10ms for optimal performance
  • RPC connections must complete within 10 seconds (configurable via RpcConnectTimeout)
  • Requires shared or replicated storage

Active-Passive Cluster

One node handles all traffic while standby nodes remain idle. Standby nodes activate only during failover.

AspectConfiguration
Minimum nodes2
Load balancingOptional (for health checks)
Failover time30–60 seconds
Use caseCost-sensitive deployments with moderate uptime requirements

Advantages:

  • Lower resource utilization
  • Simpler configuration
  • Standby node can serve as warm backup

Considerations:

  • Active node is a single point of failure until failover completes
  • Manual or scripted failover may be required
  • Standby nodes require regular testing

Multi-Site (Disaster Recovery)

Nodes are distributed across geographically separated data centers for disaster recovery. Hydden uses a leaf node architecture where each site operates as a leaf cluster connected to a central hub.

AspectConfiguration
Minimum nodes5 total (2 per site + 1 witness node)
SitesPrimary + DR site(s)
ReplicationAsynchronous (cross-site) via NATS leaf connections
Failover time60–90 seconds (automatic detection + promotion)
Use caseBusiness continuity and regulatory compliance

Advantages:

  • Protection against site-wide failures
  • Geographic redundancy
  • Compliance with data residency requirements
  • Each site can operate independently during network partitions

Considerations:

  • Network timeout between sites is 300 seconds (configurable via NetworkTimeout)
  • Asynchronous replication may result in data lag during partitions
  • DNS or global load balancer required for site failover
  • Witness node ensures quorum during site failures

Why 5 Nodes for Multi-Site?

For multi-site deployments, deploy 5 nodes minimum (2 per site plus a witness):

ConfigurationFault ToleranceQuorum Maintained If...
3 nodes (single site)1 node failure2 of 3 nodes available
4 nodes (2 per site)1 node failure3 of 4 nodes available (loses quorum if entire site fails)
5 nodes (2+2+1)2 node failures OR 1 site failure3 of 5 nodes available

The quorum formula is (n/2) + 1, meaning a 5-node cluster requires 3 nodes for quorum. This allows an entire 2-node site to fail while maintaining operations.

Witness node placement: Deploy the witness node in a third location (cloud region, separate data center, or colocation facility) to break ties during site partitions.

Leaf Node Architecture

Multi-site deployments use NATS leaf node connections:

Diagram description: A top-down flow diagram showing the multi-site leaf node architecture. Site A (Primary) has Node 1 and Node 2 communicating with each other. Site B (DR) has Node 3 and Node 4 communicating with each other. Site C has a Witness node (Node 5). Both sites connect to the Witness via leaf connections, and Site A and Site B are linked via cross-site replication.

Within each site, nodes communicate directly with low latency. Between sites, leaf connections handle replication and cluster coordination.

Single Region with Multiple Data Centers

A single-region multi-datacenter deployment works well when:

  • Network latency between data centers is under 50ms (round-trip)
  • Network operations complete within the 300-second timeout threshold
  • Dedicated network links exist between data centers

For same-region deployments, synchronous replication is possible if latency permits, reducing RPO to near-zero.

Server Node Requirements

Hardware Specifications

ComponentMinimum (per node)Recommended (per node)
CPU4 cores8+ cores
Memory16 GB RAM32 GB RAM
Storage100 GB SSD500 GB NVMe SSD
Network1 Gbps10 Gbps

Operating System

Hydden server supports:

  • Windows Server 2019, 2022
  • Linux Ubuntu 20.04+, RHEL 8+, Rocky Linux 8+

All nodes in a cluster should run the same operating system and version.

Network Configuration

PortProtocolPurpose
TCP 22100NATSStream broker (node-to-node)
TCP 22101HTTPBootstrap (initial setup only)
TCP 22103NATSSMB client connections
TCP 22104NATSMessage broker gateway
TCP 443HTTPSWeb interface and API

Firewall rules:

  • All nodes must allow inbound/outbound traffic on ports 22100, 22103, and 22104 between cluster members
  • Load balancer must reach TCP 443 on all nodes
  • Clients must reach the load balancer VIP on TCP 443

NATS Cluster Configuration

Hydden uses NATS as its message broker. In a clustered deployment, NATS runs in cluster mode for high availability.

Cluster Sizing

Cluster SizeFault ToleranceRecommended For
3 nodes1 node failureStandard HA
5 nodes2 node failuresHigh fault tolerance
7 nodes3 node failuresMaximum fault tolerance

IMPORTANT

Always use an odd number of nodes to ensure quorum can be established during network partitions.

NATS Configuration

Each node's NATS configuration includes cluster peer definitions:

yaml
# Example NATS cluster configuration
cluster:
  name: hydden-cluster
  listen: 0.0.0.0:6222
  routes:
    - nats://node1.internal:6222
    - nats://node2.internal:6222
    - nats://node3.internal:6222

The Hydden installer configures NATS automatically. For manual cluster setup, contact Hydden support.

Timeout and Health Check Settings

Hydden uses the following default timeouts for cluster operations:

SettingDefault ValueDescription
StartupTimeout10 secondsMaximum time for a node to start and join the cluster
ShutdownTimeout30 secondsGraceful shutdown period before forced termination
RpcConnectTimeout10 secondsTimeout for establishing RPC connections between nodes
NetworkTimeout300 secondsMaximum time for network operations (cross-site replication)
ProbeInterval15 secondsHealth check interval between nodes (with jittering)
OfflineThreshold60 secondsTime without response before a node is marked offline

Failure detection timing: When a node fails, the cluster detects the failure within 60 seconds (4 missed probe intervals). Automatic failover begins immediately after detection.

TIP

For environments with higher latency, adjust NetworkTimeout accordingly. Cross-site deployments may require values up to 600 seconds.

JetStream Persistence

NATS JetStream handles message persistence and replication across the cluster. JetStream provides:

  • Stream replication: Messages are replicated to multiple nodes before acknowledgment
  • Merkle tree verification: Data integrity verified using Merkle tree hashes
  • Automatic recovery: Nodes automatically sync missed messages after reconnection

Load Balancer Configuration

A load balancer distributes incoming requests across cluster nodes and provides health checking.

Health Check Endpoint

Configure your load balancer to check:

CheckEndpointExpected Response
HTTP healthGET /healthHTTP 200
ReadinessGET /readyHTTP 200

Sample HAProxy Configuration

haproxy
frontend hydden_https
    bind *:443 ssl crt /etc/ssl/hydden.pem
    default_backend hydden_servers

backend hydden_servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200
    server node1 192.168.1.10:443 check ssl verify none
    server node2 192.168.1.11:443 check ssl verify none
    server node3 192.168.1.12:443 check ssl verify none

Session Persistence

Hydden uses JWT tokens for authentication. Session persistence (sticky sessions) is not required because tokens are validated by any cluster node.

Data Storage and Replication

Identity Graph Database

The Identity Graph stores all discovered identity data. In clustered deployments:

Storage OptionReplicationRecommended For
Shared storage (SAN/NAS)Storage-levelActive-passive clusters
Replicated storageApplication-levelActive-active clusters
External databaseDatabase clusteringLarge-scale deployments

Key-Value Store

The KV store maintains configuration and session data. Options include:

  • Embedded (default): Replicated via NATS JetStream
  • External Redis: Redis Cluster or Redis Sentinel for HA

Backup Strategy

Data TypeBackup MethodFrequency
Identity GraphDatabase dump or snapshotDaily
ConfigurationFile backup or exportAfter changes
Credentials vaultEncrypted backupDaily

NOTE

Hydden provides built-in backup APIs. See Backup & Restore API for automation options.

High Availability Procedures

Adding a Node to the Cluster

Purpose: Expand cluster capacity or replace a failed node.

Prerequisites:

  • New server meets hardware requirements
  • Network connectivity to existing nodes
  • Hydden installer package

Steps:

  1. Install Hydden on the new server. Run the installer as described in Linux Server or Windows Server.

  2. Select Join existing cluster during bootstrap. The bootstrap wizard detects cluster mode.

  3. Enter the cluster join token. Obtain the token from an existing node's admin interface under Settings > Cluster.

  4. Wait for synchronization. The new node synchronizes data from existing nodes. This may take several minutes depending on data size.

  5. Verify node health. Check the cluster status page to confirm the new node shows Healthy.

Result: The new node joins the cluster and begins serving requests after synchronization completes.

Removing a Node from the Cluster

Purpose: Decommission a node for maintenance or replacement.

Steps:

  1. Drain the node. Navigate to Settings > Cluster > [Node Name] and click Drain. The node stops accepting new requests.

  2. Wait for active connections to complete. Monitor the connection count until it reaches zero.

  3. Remove the node. Click Remove from Cluster. The node disconnects from the NATS cluster.

  4. Update load balancer configuration. Remove the node from the backend server pool.

Result: The node is safely removed without service disruption.

Failover Testing

Test failover procedures regularly to ensure they work correctly during actual failures.

Expected timing: Node failure detection takes up to 60 seconds (OfflineThreshold). Load balancer failover depends on your health check configuration (typically 15–30 seconds additional).

  1. Schedule a maintenance window. Allow at least 15 minutes for the full test cycle.

  2. Simulate node failure.

    • Option A: Stop the Hydden service on one node (systemctl stop hydden or Stop-Service Hydden)
    • Option B: Block network ports using firewall rules (blocks ports 22100, 22103, 22104)
  3. Start timing and monitor.

    • Note the exact time of simulated failure
    • Watch the cluster status page for node state changes
    • Expected: Node marked "Unhealthy" within 60 seconds
  4. Verify automatic failover.

    • Confirm remaining nodes continue serving requests (test API calls)
    • Check that the load balancer marks the failed node as unhealthy
    • Verify client connections are redirected to healthy nodes
    • Monitor cluster logs for error messages: journalctl -u hydden -f (Linux) or Event Viewer (Windows)
  5. Restore the node.

    • Restart the Hydden service or remove firewall rules
    • Verify the node rejoins the cluster automatically
    • Expected: Node marked "Healthy" within 30 seconds of restart
  6. Document results. Record actual failover time and compare to expected thresholds:

    MetricExpectedActual
    Failure detection≤ 60 seconds
    Load balancer failover≤ 30 seconds
    Total client impact≤ 90 seconds
    Node rejoin time≤ 30 seconds

Disaster Recovery

Recovery Point Objective (RPO)

Replication TypeTypical RPO
Synchronous (same site)0 (no data loss)
Asynchronous (cross-site)1–5 minutes
Backup-basedLast backup interval

Recovery Time Objective (RTO)

ScenarioTypical RTO
Single node failure (HA cluster)Seconds
Site failure (DR site available)5–30 minutes
Full restore from backup1–4 hours

DR Failover Procedure

Purpose: Activate the disaster recovery site when the primary site is unavailable.

Prerequisites:

  • DR site is operational and synchronized
  • DNS or global load balancer access

Steps:

  1. Confirm primary site is unavailable. Verify the outage is not a network issue before initiating DR failover.

  2. Update DNS or global load balancer. Point traffic to the DR site. Example:

    • Update hydden.yourcompany.com to resolve to DR site IPs
    • Or update global load balancer to prefer DR backend
  3. Verify DR site is serving requests. Access the Hydden portal and confirm functionality.

  4. Notify stakeholders. Inform users that the system is operating from the DR site.

  5. Plan failback. Once the primary site is restored, schedule failback during a maintenance window.

Result: Operations continue from the DR site with minimal data loss (based on replication lag).

Monitoring and Alerting

Key Metrics

MetricWarning ThresholdCritical Threshold
Node healthAny node unhealthyQuorum lost
CPU utilization> 70%> 90%
Memory utilization> 75%> 90%
Disk utilization> 70%> 85%
Replication lag> 30 seconds> 5 minutes
API response time> 500ms> 2000ms

Integration with Monitoring Tools

Hydden exposes metrics in Prometheus format at /metrics. Configure your monitoring stack to scrape this endpoint:

yaml
# Example Prometheus scrape config
scrape_configs:
  - job_name: 'hydden'
    static_configs:
      - targets:
        - 'node1.internal:443'
        - 'node2.internal:443'
        - 'node3.internal:443'
    scheme: https
    tls_config:
      insecure_skip_verify: true  # Use proper CA in production

Troubleshooting

IssuePossible CauseResolution
Node fails to join clusterFirewall blocking NATS portsOpen TCP 22100, 22103, 22104 between nodes
Node join times outStartupTimeout (10s) exceededCheck network connectivity; increase StartupTimeout if needed
Split-brain conditionNetwork partitionRestore network connectivity; cluster will auto-heal after 60s
Slow replication (same site)High network latencyEnsure < 10ms latency between nodes; check network utilization
Slow replication (cross-site)NetworkTimeout thresholdIncrease NetworkTimeout beyond 300s for high-latency links
RPC connection failuresRpcConnectTimeout (10s) exceededCheck inter-node connectivity; verify no packet loss
Load balancer marks all nodes unhealthyHealth check misconfiguredVerify /health endpoint returns HTTP 200
Node marked offline unexpectedlyMissed 4 consecutive probes (60s)Check node health; verify network stability
Data inconsistency after failoverAsynchronous replication lagAccept data from most recent node; reconcile manually if needed
Quorum lost in multi-siteFewer than (n/2)+1 nodes availableRestore connectivity to additional nodes; check witness node

Hydden Documentation and Training Hub