Clustered Deployment Architecture

This guide covers architecture considerations for deploying Hydden Discovery in a clustered on-premises configuration. Clustered deployments provide high availability, fault tolerance, and disaster recovery capabilities for enterprise environments.

What Is a Clustered Deployment?

What it is: A clustered deployment runs multiple Hydden server nodes that work together as a unified system. If one node fails, the remaining nodes continue to serve requests. Data is replicated across nodes to prevent data loss.

Why it matters: Single-server deployments create a single point of failure. For organizations with strict uptime requirements, compliance mandates, or large-scale identity data, a clustered deployment ensures continuous availability and data protection.

Key terms:

Node — A single Hydden server instance in the cluster.
Quorum — The minimum number of nodes required for the cluster to operate (typically N/2 + 1).
Failover — Automatic transfer of operations from a failed node to a healthy node.
Replication — Copying data between nodes to ensure redundancy.

Deployment Topologies

Choose a deployment topology based on your availability and disaster recovery requirements.

Active-Active Cluster

All nodes actively serve requests. A load balancer distributes traffic across nodes.

Aspect	Configuration
Minimum nodes	3 (for quorum)
Load balancing	Required
Failover time	Seconds (automatic)
Use case	High availability within a single data center

Advantages:

Full resource utilization across all nodes
Automatic load distribution
No manual intervention during failover

Considerations:

All nodes must have identical configuration
Network latency between nodes should be under 10ms for optimal performance
RPC connections must complete within 10 seconds (configurable via RpcConnectTimeout)
Requires shared or replicated storage

Active-Passive Cluster

One node handles all traffic while standby nodes remain idle. Standby nodes activate only during failover.

Aspect	Configuration
Minimum nodes	2
Load balancing	Optional (for health checks)
Failover time	30–60 seconds
Use case	Cost-sensitive deployments with moderate uptime requirements

Advantages:

Lower resource utilization
Simpler configuration
Standby node can serve as warm backup

Considerations:

Active node is a single point of failure until failover completes
Manual or scripted failover may be required
Standby nodes require regular testing

Multi-Site (Disaster Recovery)

Nodes are distributed across geographically separated data centers for disaster recovery. Hydden uses a leaf node architecture where each site operates as a leaf cluster connected to a central hub.

Aspect	Configuration
Minimum nodes	5 total (2 per site + 1 witness node)
Sites	Primary + DR site(s)
Replication	Asynchronous (cross-site) via NATS leaf connections
Failover time	60–90 seconds (automatic detection + promotion)
Use case	Business continuity and regulatory compliance

Advantages:

Protection against site-wide failures
Geographic redundancy
Compliance with data residency requirements
Each site can operate independently during network partitions

Considerations:

Network timeout between sites is 300 seconds (configurable via NetworkTimeout)
Asynchronous replication may result in data lag during partitions
DNS or global load balancer required for site failover
Witness node ensures quorum during site failures

Why 5 Nodes for Multi-Site?

For multi-site deployments, deploy 5 nodes minimum (2 per site plus a witness):

Configuration	Fault Tolerance	Quorum Maintained If...
3 nodes (single site)	1 node failure	2 of 3 nodes available
4 nodes (2 per site)	1 node failure	3 of 4 nodes available (loses quorum if entire site fails)
5 nodes (2+2+1)	2 node failures OR 1 site failure	3 of 5 nodes available

The quorum formula is (n/2) + 1, meaning a 5-node cluster requires 3 nodes for quorum. This allows an entire 2-node site to fail while maintaining operations.

Witness node placement: Deploy the witness node in a third location (cloud region, separate data center, or colocation facility) to break ties during site partitions.

Leaf Node Architecture

Multi-site deployments use NATS leaf node connections:

Within each site, nodes communicate directly with low latency. Between sites, leaf connections handle replication and cluster coordination.

Single Region with Multiple Data Centers

A single-region multi-datacenter deployment works well when:

Network latency between data centers is under 50ms (round-trip)
Network operations complete within the 300-second timeout threshold
Dedicated network links exist between data centers

For same-region deployments, synchronous replication is possible if latency permits, reducing RPO to near-zero.

Server Node Requirements

Hardware Specifications

Component	Minimum (per node)	Recommended (per node)
CPU	4 cores	8+ cores
Memory	16 GB RAM	32 GB RAM
Storage	100 GB SSD	500 GB NVMe SSD
Network	1 Gbps	10 Gbps

Operating System

Hydden server supports:

Windows Server 2019, 2022
Linux Ubuntu 20.04+, RHEL 8+, Rocky Linux 8+

All nodes in a cluster should run the same operating system and version.

Network Configuration

Port	Protocol	Purpose
TCP 22100	NATS	Stream broker (node-to-node)
TCP 22101	HTTP	Bootstrap (initial setup only)
TCP 22103	NATS	SMB client connections
TCP 22104	NATS	Message broker gateway
TCP 443	HTTPS	Web interface and API

Firewall rules:

All nodes must allow inbound/outbound traffic on ports 22100, 22103, and 22104 between cluster members
Load balancer must reach TCP 443 on all nodes
Clients must reach the load balancer VIP on TCP 443

NATS Cluster Configuration

Hydden uses NATS as its message broker. In a clustered deployment, NATS runs in cluster mode for high availability.

Cluster Sizing

Cluster Size	Fault Tolerance	Recommended For
3 nodes	1 node failure	Standard HA
5 nodes	2 node failures	High fault tolerance
7 nodes	3 node failures	Maximum fault tolerance

IMPORTANT

Always use an odd number of nodes to ensure quorum can be established during network partitions.

NATS Configuration

Each node's NATS configuration includes cluster peer definitions:

yaml

# Example NATS cluster configuration
cluster:
  name: hydden-cluster
  listen: 0.0.0.0:6222
  routes:
    - nats://node1.internal:6222
    - nats://node2.internal:6222
    - nats://node3.internal:6222

The Hydden installer configures NATS automatically. For manual cluster setup, contact Hydden support.

Timeout and Health Check Settings

Hydden uses the following default timeouts for cluster operations:

Setting	Default Value	Description
`StartupTimeout`	10 seconds	Maximum time for a node to start and join the cluster
`ShutdownTimeout`	30 seconds	Graceful shutdown period before forced termination
`RpcConnectTimeout`	10 seconds	Timeout for establishing RPC connections between nodes
`NetworkTimeout`	300 seconds	Maximum time for network operations (cross-site replication)
`ProbeInterval`	15 seconds	Health check interval between nodes (with jittering)
`OfflineThreshold`	60 seconds	Time without response before a node is marked offline

Failure detection timing: When a node fails, the cluster detects the failure within 60 seconds (4 missed probe intervals). Automatic failover begins immediately after detection.

TIP

For environments with higher latency, adjust NetworkTimeout accordingly. Cross-site deployments may require values up to 600 seconds.

JetStream Persistence

NATS JetStream handles message persistence and replication across the cluster. JetStream provides:

Stream replication: Messages are replicated to multiple nodes before acknowledgment
Merkle tree verification: Data integrity verified using Merkle tree hashes
Automatic recovery: Nodes automatically sync missed messages after reconnection

Load Balancer Configuration

A load balancer distributes incoming requests across cluster nodes and provides health checking.

Health Check Endpoint

Configure your load balancer to check:

Check	Endpoint	Expected Response
HTTP health	`GET /health`	HTTP 200
Readiness	`GET /ready`	HTTP 200

Sample HAProxy Configuration

haproxy

frontend hydden_https
    bind *:443 ssl crt /etc/ssl/hydden.pem
    default_backend hydden_servers

backend hydden_servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200
    server node1 192.168.1.10:443 check ssl verify none
    server node2 192.168.1.11:443 check ssl verify none
    server node3 192.168.1.12:443 check ssl verify none

Session Persistence

Hydden uses JWT tokens for authentication. Session persistence (sticky sessions) is not required because tokens are validated by any cluster node.

Data Storage and Replication

Identity Graph Database

The Identity Graph stores all discovered identity data. In clustered deployments:

Storage Option	Replication	Recommended For
Shared storage (SAN/NAS)	Storage-level	Active-passive clusters
Replicated storage	Application-level	Active-active clusters
External database	Database clustering	Large-scale deployments

Key-Value Store

The KV store maintains configuration and session data. Options include:

Embedded (default): Replicated via NATS JetStream
External Redis: Redis Cluster or Redis Sentinel for HA

Backup Strategy

Data Type	Backup Method	Frequency
Identity Graph	Database dump or snapshot	Daily
Configuration	File backup or export	After changes
Credentials vault	Encrypted backup	Daily

NOTE

Hydden provides built-in backup APIs. See Backup & Restore API for automation options.

High Availability Procedures

Adding a Node to the Cluster

Purpose: Expand cluster capacity or replace a failed node.

Prerequisites:

New server meets hardware requirements
Network connectivity to existing nodes
Hydden installer package

Steps:

Install Hydden on the new server. Run the installer as described in Linux Server or Windows Server.
Select Join existing cluster during bootstrap. The bootstrap wizard detects cluster mode.
Enter the cluster join token. Obtain the token from an existing node's admin interface under Settings > Cluster.
Wait for synchronization. The new node synchronizes data from existing nodes. This may take several minutes depending on data size.
Verify node health. Check the cluster status page to confirm the new node shows Healthy.

Result: The new node joins the cluster and begins serving requests after synchronization completes.

Removing a Node from the Cluster

Purpose: Decommission a node for maintenance or replacement.

Steps:

Drain the node. Navigate to Settings > Cluster > [Node Name] and click Drain. The node stops accepting new requests.
Wait for active connections to complete. Monitor the connection count until it reaches zero.
Remove the node. Click Remove from Cluster. The node disconnects from the NATS cluster.
Update load balancer configuration. Remove the node from the backend server pool.

Result: The node is safely removed without service disruption.

Failover Testing

Test failover procedures regularly to ensure they work correctly during actual failures.

Expected timing: Node failure detection takes up to 60 seconds (OfflineThreshold). Load balancer failover depends on your health check configuration (typically 15–30 seconds additional).

Schedule a maintenance window. Allow at least 15 minutes for the full test cycle.
Simulate node failure.
- Option A: Stop the Hydden service on one node (systemctl stop hydden or Stop-Service Hydden)
- Option B: Block network ports using firewall rules (blocks ports 22100, 22103, 22104)
Start timing and monitor.
- Note the exact time of simulated failure
- Watch the cluster status page for node state changes
- Expected: Node marked "Unhealthy" within 60 seconds
Verify automatic failover.
- Confirm remaining nodes continue serving requests (test API calls)
- Check that the load balancer marks the failed node as unhealthy
- Verify client connections are redirected to healthy nodes
- Monitor cluster logs for error messages: journalctl -u hydden -f (Linux) or Event Viewer (Windows)
Restore the node.
- Restart the Hydden service or remove firewall rules
- Verify the node rejoins the cluster automatically
- Expected: Node marked "Healthy" within 30 seconds of restart
Document results. Record actual failover time and compare to expected thresholds:
Metric Expected Actual
Failure detection ≤ 60 seconds
Load balancer failover ≤ 30 seconds
Total client impact ≤ 90 seconds
Node rejoin time ≤ 30 seconds

Metric	Expected	Actual
Failure detection	≤ 60 seconds
Load balancer failover	≤ 30 seconds
Total client impact	≤ 90 seconds
Node rejoin time	≤ 30 seconds

Disaster Recovery

Recovery Point Objective (RPO)

Replication Type	Typical RPO
Synchronous (same site)	0 (no data loss)
Asynchronous (cross-site)	1–5 minutes
Backup-based	Last backup interval

Recovery Time Objective (RTO)

Scenario	Typical RTO
Single node failure (HA cluster)	Seconds
Site failure (DR site available)	5–30 minutes
Full restore from backup	1–4 hours

DR Failover Procedure

Purpose: Activate the disaster recovery site when the primary site is unavailable.

Prerequisites:

DR site is operational and synchronized
DNS or global load balancer access

Steps:

Confirm primary site is unavailable. Verify the outage is not a network issue before initiating DR failover.
Update DNS or global load balancer. Point traffic to the DR site. Example:
- Update hydden.yourcompany.com to resolve to DR site IPs
- Or update global load balancer to prefer DR backend
Verify DR site is serving requests. Access the Hydden portal and confirm functionality.
Notify stakeholders. Inform users that the system is operating from the DR site.
Plan failback. Once the primary site is restored, schedule failback during a maintenance window.

Result: Operations continue from the DR site with minimal data loss (based on replication lag).

Monitoring and Alerting

Key Metrics

Metric	Warning Threshold	Critical Threshold
Node health	Any node unhealthy	Quorum lost
CPU utilization	> 70%	> 90%
Memory utilization	> 75%	> 90%
Disk utilization	> 70%	> 85%
Replication lag	> 30 seconds	> 5 minutes
API response time	> 500ms	> 2000ms

Integration with Monitoring Tools

Hydden exposes metrics in Prometheus format at /metrics. Configure your monitoring stack to scrape this endpoint:

yaml

# Example Prometheus scrape config
scrape_configs:
  - job_name: 'hydden'
    static_configs:
      - targets:
        - 'node1.internal:443'
        - 'node2.internal:443'
        - 'node3.internal:443'
    scheme: https
    tls_config:
      insecure_skip_verify: true  # Use proper CA in production

Troubleshooting

Issue	Possible Cause	Resolution
Node fails to join cluster	Firewall blocking NATS ports	Open TCP 22100, 22103, 22104 between nodes
Node join times out	StartupTimeout (10s) exceeded	Check network connectivity; increase `StartupTimeout` if needed
Split-brain condition	Network partition	Restore network connectivity; cluster will auto-heal after 60s
Slow replication (same site)	High network latency	Ensure < 10ms latency between nodes; check network utilization
Slow replication (cross-site)	NetworkTimeout threshold	Increase `NetworkTimeout` beyond 300s for high-latency links
RPC connection failures	RpcConnectTimeout (10s) exceeded	Check inter-node connectivity; verify no packet loss
Load balancer marks all nodes unhealthy	Health check misconfigured	Verify `/health` endpoint returns HTTP 200
Node marked offline unexpectedly	Missed 4 consecutive probes (60s)	Check node health; verify network stability
Data inconsistency after failover	Asynchronous replication lag	Accept data from most recent node; reconcile manually if needed
Quorum lost in multi-site	Fewer than (n/2)+1 nodes available	Restore connectivity to additional nodes; check witness node

On-prem Deployment Overview — Single-node deployment instructions
Linux Server Installation — Linux server setup
Windows Server Installation — Windows server setup
Architecture — Hydden Discovery architecture overview
Backup & Restore API — Automated backup endpoints

Data Sources

Universal Collector Overview

Threat Detection Rules

OpenID Providers

Credentials

How to use the CyberArk Integration

Clustered Deployment Architecture ​

What Is a Clustered Deployment? ​

Deployment Topologies ​

Active-Active Cluster ​

Active-Passive Cluster ​

Multi-Site (Disaster Recovery) ​

Why 5 Nodes for Multi-Site? ​

Leaf Node Architecture ​

Single Region with Multiple Data Centers ​

Server Node Requirements ​

Hardware Specifications ​

Operating System ​

Network Configuration ​

NATS Cluster Configuration ​

Cluster Sizing ​

NATS Configuration ​

Timeout and Health Check Settings ​

JetStream Persistence ​

Load Balancer Configuration ​

Health Check Endpoint ​

Sample HAProxy Configuration ​

Session Persistence ​

Data Storage and Replication ​

Identity Graph Database ​

Key-Value Store ​

Backup Strategy ​

High Availability Procedures ​

Adding a Node to the Cluster ​

Removing a Node from the Cluster ​

Failover Testing ​

Disaster Recovery ​

Recovery Point Objective (RPO) ​

Recovery Time Objective (RTO) ​

DR Failover Procedure ​

Monitoring and Alerting ​

Key Metrics ​

Integration with Monitoring Tools ​

Troubleshooting ​

Related Topics ​

Clustered Deployment Architecture

What Is a Clustered Deployment?

Deployment Topologies

Active-Active Cluster

Active-Passive Cluster

Multi-Site (Disaster Recovery)

Why 5 Nodes for Multi-Site?

Leaf Node Architecture

Single Region with Multiple Data Centers

Server Node Requirements

Hardware Specifications

Operating System

Network Configuration

NATS Cluster Configuration

Cluster Sizing

NATS Configuration

Timeout and Health Check Settings

JetStream Persistence

Load Balancer Configuration

Health Check Endpoint

Sample HAProxy Configuration

Session Persistence

Data Storage and Replication

Identity Graph Database

Key-Value Store

Backup Strategy

High Availability Procedures

Adding a Node to the Cluster

Removing a Node from the Cluster

Failover Testing

Disaster Recovery

Recovery Point Objective (RPO)

Recovery Time Objective (RTO)

DR Failover Procedure

Monitoring and Alerting

Key Metrics

Integration with Monitoring Tools

Troubleshooting

Related Topics