Clustered Deployment Architecture
This guide covers architecture considerations for deploying Hydden Discovery in a clustered on-premises configuration. Clustered deployments provide high availability, fault tolerance, and disaster recovery capabilities for enterprise environments.
What Is a Clustered Deployment?
What it is: A clustered deployment runs multiple Hydden server nodes that work together as a unified system. If one node fails, the remaining nodes continue to serve requests. Data is replicated across nodes to prevent data loss.
Why it matters: Single-server deployments create a single point of failure. For organizations with strict uptime requirements, compliance mandates, or large-scale identity data, a clustered deployment ensures continuous availability and data protection.
Key terms:
- Node — A single Hydden server instance in the cluster.
- Quorum — The minimum number of nodes required for the cluster to operate (typically N/2 + 1).
- Failover — Automatic transfer of operations from a failed node to a healthy node.
- Replication — Copying data between nodes to ensure redundancy.
Deployment Topologies
Choose a deployment topology based on your availability and disaster recovery requirements.
Active-Active Cluster
All nodes actively serve requests. A load balancer distributes traffic across nodes.
| Aspect | Configuration |
|---|---|
| Minimum nodes | 3 (for quorum) |
| Load balancing | Required |
| Failover time | Seconds (automatic) |
| Use case | High availability within a single data center |
Advantages:
- Full resource utilization across all nodes
- Automatic load distribution
- No manual intervention during failover
Considerations:
- All nodes must have identical configuration
- Network latency between nodes should be under 10ms for optimal performance
- RPC connections must complete within 10 seconds (configurable via
RpcConnectTimeout) - Requires shared or replicated storage
Active-Passive Cluster
One node handles all traffic while standby nodes remain idle. Standby nodes activate only during failover.
| Aspect | Configuration |
|---|---|
| Minimum nodes | 2 |
| Load balancing | Optional (for health checks) |
| Failover time | 30–60 seconds |
| Use case | Cost-sensitive deployments with moderate uptime requirements |
Advantages:
- Lower resource utilization
- Simpler configuration
- Standby node can serve as warm backup
Considerations:
- Active node is a single point of failure until failover completes
- Manual or scripted failover may be required
- Standby nodes require regular testing
Multi-Site (Disaster Recovery)
Nodes are distributed across geographically separated data centers for disaster recovery. Hydden uses a leaf node architecture where each site operates as a leaf cluster connected to a central hub.
| Aspect | Configuration |
|---|---|
| Minimum nodes | 5 total (2 per site + 1 witness node) |
| Sites | Primary + DR site(s) |
| Replication | Asynchronous (cross-site) via NATS leaf connections |
| Failover time | 60–90 seconds (automatic detection + promotion) |
| Use case | Business continuity and regulatory compliance |
Advantages:
- Protection against site-wide failures
- Geographic redundancy
- Compliance with data residency requirements
- Each site can operate independently during network partitions
Considerations:
- Network timeout between sites is 300 seconds (configurable via
NetworkTimeout) - Asynchronous replication may result in data lag during partitions
- DNS or global load balancer required for site failover
- Witness node ensures quorum during site failures
Why 5 Nodes for Multi-Site?
For multi-site deployments, deploy 5 nodes minimum (2 per site plus a witness):
| Configuration | Fault Tolerance | Quorum Maintained If... |
|---|---|---|
| 3 nodes (single site) | 1 node failure | 2 of 3 nodes available |
| 4 nodes (2 per site) | 1 node failure | 3 of 4 nodes available (loses quorum if entire site fails) |
| 5 nodes (2+2+1) | 2 node failures OR 1 site failure | 3 of 5 nodes available |
The quorum formula is (n/2) + 1, meaning a 5-node cluster requires 3 nodes for quorum. This allows an entire 2-node site to fail while maintaining operations.
Witness node placement: Deploy the witness node in a third location (cloud region, separate data center, or colocation facility) to break ties during site partitions.
Leaf Node Architecture
Multi-site deployments use NATS leaf node connections:
Within each site, nodes communicate directly with low latency. Between sites, leaf connections handle replication and cluster coordination.
Single Region with Multiple Data Centers
A single-region multi-datacenter deployment works well when:
- Network latency between data centers is under 50ms (round-trip)
- Network operations complete within the 300-second timeout threshold
- Dedicated network links exist between data centers
For same-region deployments, synchronous replication is possible if latency permits, reducing RPO to near-zero.
Server Node Requirements
Hardware Specifications
| Component | Minimum (per node) | Recommended (per node) |
|---|---|---|
| CPU | 4 cores | 8+ cores |
| Memory | 16 GB RAM | 32 GB RAM |
| Storage | 100 GB SSD | 500 GB NVMe SSD |
| Network | 1 Gbps | 10 Gbps |
Operating System
Hydden server supports:
- Windows Server 2019, 2022
- Linux Ubuntu 20.04+, RHEL 8+, Rocky Linux 8+
All nodes in a cluster should run the same operating system and version.
Network Configuration
| Port | Protocol | Purpose |
|---|---|---|
| TCP 22100 | NATS | Stream broker (node-to-node) |
| TCP 22101 | HTTP | Bootstrap (initial setup only) |
| TCP 22103 | NATS | SMB client connections |
| TCP 22104 | NATS | Message broker gateway |
| TCP 443 | HTTPS | Web interface and API |
Firewall rules:
- All nodes must allow inbound/outbound traffic on ports 22100, 22103, and 22104 between cluster members
- Load balancer must reach TCP 443 on all nodes
- Clients must reach the load balancer VIP on TCP 443
NATS Cluster Configuration
Hydden uses NATS as its message broker. In a clustered deployment, NATS runs in cluster mode for high availability.
Cluster Sizing
| Cluster Size | Fault Tolerance | Recommended For |
|---|---|---|
| 3 nodes | 1 node failure | Standard HA |
| 5 nodes | 2 node failures | High fault tolerance |
| 7 nodes | 3 node failures | Maximum fault tolerance |
IMPORTANT
Always use an odd number of nodes to ensure quorum can be established during network partitions.
NATS Configuration
Each node's NATS configuration includes cluster peer definitions:
# Example NATS cluster configuration
cluster:
name: hydden-cluster
listen: 0.0.0.0:6222
routes:
- nats://node1.internal:6222
- nats://node2.internal:6222
- nats://node3.internal:6222The Hydden installer configures NATS automatically. For manual cluster setup, contact Hydden support.
Timeout and Health Check Settings
Hydden uses the following default timeouts for cluster operations:
| Setting | Default Value | Description |
|---|---|---|
StartupTimeout | 10 seconds | Maximum time for a node to start and join the cluster |
ShutdownTimeout | 30 seconds | Graceful shutdown period before forced termination |
RpcConnectTimeout | 10 seconds | Timeout for establishing RPC connections between nodes |
NetworkTimeout | 300 seconds | Maximum time for network operations (cross-site replication) |
ProbeInterval | 15 seconds | Health check interval between nodes (with jittering) |
OfflineThreshold | 60 seconds | Time without response before a node is marked offline |
Failure detection timing: When a node fails, the cluster detects the failure within 60 seconds (4 missed probe intervals). Automatic failover begins immediately after detection.
TIP
For environments with higher latency, adjust NetworkTimeout accordingly. Cross-site deployments may require values up to 600 seconds.
JetStream Persistence
NATS JetStream handles message persistence and replication across the cluster. JetStream provides:
- Stream replication: Messages are replicated to multiple nodes before acknowledgment
- Merkle tree verification: Data integrity verified using Merkle tree hashes
- Automatic recovery: Nodes automatically sync missed messages after reconnection
Load Balancer Configuration
A load balancer distributes incoming requests across cluster nodes and provides health checking.
Health Check Endpoint
Configure your load balancer to check:
| Check | Endpoint | Expected Response |
|---|---|---|
| HTTP health | GET /health | HTTP 200 |
| Readiness | GET /ready | HTTP 200 |
Sample HAProxy Configuration
frontend hydden_https
bind *:443 ssl crt /etc/ssl/hydden.pem
default_backend hydden_servers
backend hydden_servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
server node1 192.168.1.10:443 check ssl verify none
server node2 192.168.1.11:443 check ssl verify none
server node3 192.168.1.12:443 check ssl verify noneSession Persistence
Hydden uses JWT tokens for authentication. Session persistence (sticky sessions) is not required because tokens are validated by any cluster node.
Data Storage and Replication
Identity Graph Database
The Identity Graph stores all discovered identity data. In clustered deployments:
| Storage Option | Replication | Recommended For |
|---|---|---|
| Shared storage (SAN/NAS) | Storage-level | Active-passive clusters |
| Replicated storage | Application-level | Active-active clusters |
| External database | Database clustering | Large-scale deployments |
Key-Value Store
The KV store maintains configuration and session data. Options include:
- Embedded (default): Replicated via NATS JetStream
- External Redis: Redis Cluster or Redis Sentinel for HA
Backup Strategy
| Data Type | Backup Method | Frequency |
|---|---|---|
| Identity Graph | Database dump or snapshot | Daily |
| Configuration | File backup or export | After changes |
| Credentials vault | Encrypted backup | Daily |
NOTE
Hydden provides built-in backup APIs. See Backup & Restore API for automation options.
High Availability Procedures
Adding a Node to the Cluster
Purpose: Expand cluster capacity or replace a failed node.
Prerequisites:
- New server meets hardware requirements
- Network connectivity to existing nodes
- Hydden installer package
Steps:
Install Hydden on the new server. Run the installer as described in Linux Server or Windows Server.
Select Join existing cluster during bootstrap. The bootstrap wizard detects cluster mode.
Enter the cluster join token. Obtain the token from an existing node's admin interface under Settings > Cluster.
Wait for synchronization. The new node synchronizes data from existing nodes. This may take several minutes depending on data size.
Verify node health. Check the cluster status page to confirm the new node shows Healthy.
Result: The new node joins the cluster and begins serving requests after synchronization completes.
Removing a Node from the Cluster
Purpose: Decommission a node for maintenance or replacement.
Steps:
Drain the node. Navigate to Settings > Cluster > [Node Name] and click Drain. The node stops accepting new requests.
Wait for active connections to complete. Monitor the connection count until it reaches zero.
Remove the node. Click Remove from Cluster. The node disconnects from the NATS cluster.
Update load balancer configuration. Remove the node from the backend server pool.
Result: The node is safely removed without service disruption.
Failover Testing
Test failover procedures regularly to ensure they work correctly during actual failures.
Expected timing: Node failure detection takes up to 60 seconds (OfflineThreshold). Load balancer failover depends on your health check configuration (typically 15–30 seconds additional).
Schedule a maintenance window. Allow at least 15 minutes for the full test cycle.
Simulate node failure.
- Option A: Stop the Hydden service on one node (
systemctl stop hyddenorStop-Service Hydden) - Option B: Block network ports using firewall rules (blocks ports 22100, 22103, 22104)
- Option A: Stop the Hydden service on one node (
Start timing and monitor.
- Note the exact time of simulated failure
- Watch the cluster status page for node state changes
- Expected: Node marked "Unhealthy" within 60 seconds
Verify automatic failover.
- Confirm remaining nodes continue serving requests (test API calls)
- Check that the load balancer marks the failed node as unhealthy
- Verify client connections are redirected to healthy nodes
- Monitor cluster logs for error messages:
journalctl -u hydden -f(Linux) or Event Viewer (Windows)
Restore the node.
- Restart the Hydden service or remove firewall rules
- Verify the node rejoins the cluster automatically
- Expected: Node marked "Healthy" within 30 seconds of restart
Document results. Record actual failover time and compare to expected thresholds:
Metric Expected Actual Failure detection ≤ 60 seconds Load balancer failover ≤ 30 seconds Total client impact ≤ 90 seconds Node rejoin time ≤ 30 seconds
Disaster Recovery
Recovery Point Objective (RPO)
| Replication Type | Typical RPO |
|---|---|
| Synchronous (same site) | 0 (no data loss) |
| Asynchronous (cross-site) | 1–5 minutes |
| Backup-based | Last backup interval |
Recovery Time Objective (RTO)
| Scenario | Typical RTO |
|---|---|
| Single node failure (HA cluster) | Seconds |
| Site failure (DR site available) | 5–30 minutes |
| Full restore from backup | 1–4 hours |
DR Failover Procedure
Purpose: Activate the disaster recovery site when the primary site is unavailable.
Prerequisites:
- DR site is operational and synchronized
- DNS or global load balancer access
Steps:
Confirm primary site is unavailable. Verify the outage is not a network issue before initiating DR failover.
Update DNS or global load balancer. Point traffic to the DR site. Example:
- Update
hydden.yourcompany.comto resolve to DR site IPs - Or update global load balancer to prefer DR backend
- Update
Verify DR site is serving requests. Access the Hydden portal and confirm functionality.
Notify stakeholders. Inform users that the system is operating from the DR site.
Plan failback. Once the primary site is restored, schedule failback during a maintenance window.
Result: Operations continue from the DR site with minimal data loss (based on replication lag).
Monitoring and Alerting
Key Metrics
| Metric | Warning Threshold | Critical Threshold |
|---|---|---|
| Node health | Any node unhealthy | Quorum lost |
| CPU utilization | > 70% | > 90% |
| Memory utilization | > 75% | > 90% |
| Disk utilization | > 70% | > 85% |
| Replication lag | > 30 seconds | > 5 minutes |
| API response time | > 500ms | > 2000ms |
Integration with Monitoring Tools
Hydden exposes metrics in Prometheus format at /metrics. Configure your monitoring stack to scrape this endpoint:
# Example Prometheus scrape config
scrape_configs:
- job_name: 'hydden'
static_configs:
- targets:
- 'node1.internal:443'
- 'node2.internal:443'
- 'node3.internal:443'
scheme: https
tls_config:
insecure_skip_verify: true # Use proper CA in productionTroubleshooting
| Issue | Possible Cause | Resolution |
|---|---|---|
| Node fails to join cluster | Firewall blocking NATS ports | Open TCP 22100, 22103, 22104 between nodes |
| Node join times out | StartupTimeout (10s) exceeded | Check network connectivity; increase StartupTimeout if needed |
| Split-brain condition | Network partition | Restore network connectivity; cluster will auto-heal after 60s |
| Slow replication (same site) | High network latency | Ensure < 10ms latency between nodes; check network utilization |
| Slow replication (cross-site) | NetworkTimeout threshold | Increase NetworkTimeout beyond 300s for high-latency links |
| RPC connection failures | RpcConnectTimeout (10s) exceeded | Check inter-node connectivity; verify no packet loss |
| Load balancer marks all nodes unhealthy | Health check misconfigured | Verify /health endpoint returns HTTP 200 |
| Node marked offline unexpectedly | Missed 4 consecutive probes (60s) | Check node health; verify network stability |
| Data inconsistency after failover | Asynchronous replication lag | Accept data from most recent node; reconcile manually if needed |
| Quorum lost in multi-site | Fewer than (n/2)+1 nodes available | Restore connectivity to additional nodes; check witness node |
Related Topics
- On-prem Deployment Overview — Single-node deployment instructions
- Linux Server Installation — Linux server setup
- Windows Server Installation — Windows server setup
- Architecture — Hydden Discovery architecture overview
- Backup & Restore API — Automated backup endpoints
