GRID Health

GRID health checks

New feature and corresponding settings to monitor GRID health were added to ISL Conference Proxy (Released in version 4.4.2120.112). GRID health status can be checked here:

Health Check

The settings are configurable in multiple parts of the ISL Conference Proxy configuration page:

Security
GRID settings
Load balancing

Health Check

The GRID health page becomes available at <server-address>/users/main/network_status.html (image above) after the setting "Enable GRID health api in /health/grid" is enabled in the Security settings section of the ICP Administraion (<server-address>/conf)

Security

The following settings can be found on your ISL Conference Proxy configuration page in Configuration -> Security.

Enable GRID health api in /health/grid (Enable this setting to expose health status with http api on /health/grid)
GRID health api secret (Enable this setting to protect api with secret (if not empty, http api requests must have secret provided as URL parameter))

GRID settings

The following settings can be found on your ISL Conference Proxy configuration page in Configuration -> GRID -> Settings.

GRID server tags (Specify tags for each server. These tags are then used when checking for number of servers and services per tag. Server can have multiple tags (format: $TAG1$,$TAG2$))
Delay before first GRID health check (secs) (Specify delay before first GRID health check is performed after restart of server (before this delay any GRID health issues are not detected or exposed through HTTP API))
Minimum number of connected servers in GRID (Specify minimum number of connected servers in GRID with enabled load)
Minimum number of connected servers per tags (Specify minimum number of connected servers in GRID with enabled load per tag (format: $TAG$=$MIN_SERVERS$))

Load balancing

The following settings can be found on your ISL Conference Proxy configuration page in Configuration -> GRID -> Load balancing -> Service settings.

Minimum number of servers providing service in GRID (Specify minimum number of connected servers that provide given service in GRID)
Minimum number of servers providing service per tags (Specify minimum number of connected servers that provide given service per tag (format: $TAG$=$MIN_SERVERS$))

GRID health check errors

GRID health check is performed periodically (every minute) and every minute, all checks are performed. If any check fails logs are created and error is exposed in the configuration page. GRID health check is also performed when manipulating with GRID settings so user should get immediate feedback if setup no longer matches the given limitations.

Each of the mentioned checks is exposed with simple web page (response code is set to 200 if check OK, otherwise the response code is 500, except if "no_err" query argument is set in request, then 200 is returned on error also.).

Description	Location	All OK response (200)	Error response (500)
Check all	/health/grid/overall	{"status":"grid_health_ok"}	{"status":"grid_health_degraded"}
Minimum connected servers with load in GRID	/health/grid/servers/connected	{"status":"grid_health_ok"}	{"errors":[{"count":$COUNT$,"limit":$LIMIT$}],"status":"grid_health_degraded"}
Minimum connected servers with load per tags	/health/grid/servers/tags	{"status":"grid_health_ok"}	{"errors":[{"count":$COUNT$,"limit":$LIMIT$,"name":"$TAG$"}, ...],"status":"grid_health_degraded"}
Disconnected servers with load in GRID	/health/grid/servers/disconnected	{"status":"grid_health_ok"}	{"errors":[{"dtime":$TIME$,"server":$SERVER$}, ...],"status":"grid_health_degraded"}
Minimum servers with service in GRID	/health/grid/services/total	{"status":"grid_health_ok"}	{"errors":[{"count":$COUNT$,"limit":$LIMIT$,"name":"$SERVICE$"}, ...],"status":"grid_health_degraded"}
Minimum servers with service per tags	/health/grid/services/tags	{"status":"grid_health_ok"}	{"errors":[{"count":$COUNT$,"limit":$LIMIT$,"service":"$SERVICE$","tag":"$TAG$"}, ...],"status":"grid_health_degraded"}

Legend:

$SERVERS$ - Server ID number
$TAG$ - Name of the tag as specified in settings
$SERVICE$ - Name of the service (core_login, api, ...)
$TIME$ - Disconnected for N seconds
$COUNT$ - Number of servers
$LIMIT$ - Current limit for this check