Resolving Witness server unreachable HA alert

Overview

The ScaleArc health monitor notifies that the witness server is unreachable by logging the following alert:

HA Alert: Witness server unreachable

Further, the ScaleArc logs have the following errors, usually with the same error appearing multiple times:

ERROR FAILED: Witness server setup (65280, 'ssh_exchange_identification: read: Connection reset by peer\r').
ERROR FAILED: Witness server setup (65280, 'ssh_exchange_identification: Connection closed by remote host\r').

Environment

  • MSSQL Server with Always ON cluster
  • HA enabled with 'ScaleArc Cluster' configured as the fencing option
  • HA enabled with 'SSH Server' configured as the fencing option

Solution

This alert can be generated on MSSQL HA environments relying on ScaleArc Cluster as the fencing option or where an SSH Server configured as the witness server experiences a network outage.

 

The 'witness server unreachable' alert is expected to appear only when the cluster is down in a HA setup that relies on a ScaleArc cluster for resolving split-brain situations.

If it continues to appear despite the cluster being up, ensure both ScaleArc nodes have full access to the SQL Servers as HA creates a small database prefixed with SA_* (e.g. SA_024571ec_6b8) and a table in this database to continuously update and query the HA status.

This database should be made part of the AlwaysON Availability Group in SQL Server. Refer to this external article for detailed instructions on Creating AlwaysOn Availability Groups in SQL Server.

Tip:  The Always-ON listener should be reachable from ScaleArc if the SQL Browser service on the database servers is running. In an Always-ON cluster, connectivity to the Always-ON listener is required so that ScaleArc is able to track the Always-ON cluster status at all times. When a database server is being removed, it is strongly recommended to first remove it from the ScaleArc cluster before removing it from the Always-ON cluster. When adding a server, add it first to the Always-ON cluster, and ScaleArc will automatically report a new server was added and accordingly prompt you to add it in the cluster.

ScaleArc makes use of the HA cluster name to name this database when the HA fencing is configured to use a ScaleArc Cluster which is the default and recommended option. You can find the HA cluster name by running the following command in an SSH session on the Primary or Secondary node:

# pcs status | grep name
Cluster name: SA_aeaa3fae_30e

Alternatively, to achieve HA independent of cluster status you can configure the other two supported fencing options i.e. using an External database or SSH fencing as documented in Set Up High Availability and also shown below:

Fencing_options2.png

The alert can also be encountered in the SSH Server fencing scenario, in which case the root cause could be some kind of rate-limiting at the SSH server or network/firewall equipment between the ScaleArc servers and the SSH witness server. 

Further troubleshooting will require the customer to provide information on the Operating System running on the SSH witness server as well as the following log files:

/var/log/messages
/var/log/auth.log
/etc/ssh/sshd_config

If the above is insufficient to isolate the root cause, further investigation can be carried out by taking tcpdump traffic captures with the help of the ScaleArc Support team.

Note: To avoid these alerts and ensure proper and quick HA failover when SSH Server is selected for fencing, the network communication between the ScaleArc instances and the SSH server used as the witness server should be reviewed for stability as well as availability.

Testing

The 'Witness server unreachable' alert should go away after putting the SA_* database into the AlwaysOn Availability Group in MSSQL or configuring either of the other two supported fencing options described in the solution section.

Comments

0 comments

Please sign in to leave a comment.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request