Support for Galera requires a lot of features to be added to ProxySQL , from checking if a node is in sync state, if it is a garbd process, auto-detect of new nodes, send writes to only one node, remote DC support, and more.
Furthermore, as Galera is a clustering solution and not all setups have the same requirements, to support all the possible configurations seemed to be a too complex task.
Support for Galera in all its possible configurations wasn't the only request that seems to be complex to implement in the core of ProxySQL, but also additional requests not related to Galera, like changing the weight on a node while overloaded (as measured by different metrics! CPU usage, disk IO, number of threads) or removing a node from a pool while a backup was running.
I always advised that ProxySQL is easily configurable, and anybody can write a script that configures it at runtime through the Admin interface. The script can have any complex logic or algorithm, and then configures ProxySQL accordingly. In short, I always advised that ProxySQL should be a core component, and tools can be built around it.
But there was still a missing part: a glue to link ProxySQL and the external script(s).
For this reason, recently ProxySQL supported a new feature: jobs Scheduler.
Scheduler is a cron-like solution embedded directly inside ProxySQL. For more details, please have a look at the documentation.
Scheduler can run any sort of external executable, therefore we can write a script that monitor all backends of a specific hostgroup (a Galera cluster), remove hosts that are not in sync and re-add them once in sync.
A ready to use script is distributed with ProxySQL itself: proxysql_galera_checker.sh .
The script is inspired by Percona clustercheck, and accepts 3 arguments mandatory and one optional:
$ ./proxysql_galera_checker.sh
Usage: ./proxysql_galera_checker.sh <hostname> <port> <hostgroup_id> <log_file>
When running the script we need to pass ProxySQL's hostname and the port where the Admin interface is running. The script will connect (using hardcoded credential) to the Admin interface, it will retrieves credentials to connect to the backends and a list of backends for a specific hostgroup, it will check the value of wsrep_local_state on all these backends, and if they change status will automatically reconfigure ProxySQL.
As an example, we have a Galera cluster with 3 nodes:
Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------+
| hostgroup_id | hostname | port | status |
+--------------+-------------+------+--------+
| 0 | 10.10.10.11 | 3306 | ONLINE |
| 0 | 10.10.10.12 | 3306 | ONLINE |
| 0 | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------+
3 rows in set (0.00 sec)
Next, we need to configure the scheduler to run the script at regular interval. In this example, we will run the job every 10000 milliseconds (10 seconds):
Admin> SHOW CREATE TABLE scheduler\G
*************************** 1. row ***************************
table: scheduler
Create Table: CREATE TABLE scheduler (
id INTEGER NOT NULL,
interval_ms INTEGER CHECK (interval_ms>=100 AND interval_ms<=100000000) NOT NULL,
filename VARCHAR NOT NULL,
arg1 VARCHAR,
arg2 VARCHAR,
arg3 VARCHAR,
arg4 VARCHAR,
arg5 VARCHAR,
PRIMARY KEY(id))
1 row in set (0.00 sec)
Admin> INSERT INTO scheduler(id,interval_ms,filename,arg1,arg2,arg3,arg4) VALUES
(1,'10000','/var/lib/proxysql/proxysql_galera_checker.sh','127.0.0.1','6032','0',
'/var/lib/proxysql/proxysql_galera_checker.log');
Query OK, 1 row affected (0.00 sec)
Admin> select * from scheduler\G
*************************** 1. row ***************************
id: 1
interval_ms: 10000
filename: /var/lib/proxysql/proxysql_galera_checker.sh
arg1: 127.0.0.1
arg2: 6032
arg3: 0
arg4: /var/lib/proxysql/proxysql_galera_checker.log
arg5: NULL
1 row in set (0.00 sec)
Admin> SELECT * FROM runtime_scheduler;
Empty set (0.00 sec)
At this stage the job is not loaded at runtime yet. So next we must run LOAD SCHEDULER TO RUNTIME:
Admin> LOAD SCHEDULER TO RUNTIME;
Query OK, 0 rows affected (0.00 sec)
Admin> select * from runtime_scheduler\G
*************************** 1. row ***************************
id: 1
interval_ms: 10000
filename: /var/lib/proxysql/proxysql_galera_checker.sh
arg1: 127.0.0.1
arg2: 6032
arg3: 0
arg4: /var/lib/proxysql/proxysql_galera_checker.log
arg5: NULL
1 row in set (0.00 sec)
The first test we will run is to shutdown a node:
vagrant@n3:~$ sudo service mysql stop
* Stopping MySQL (Percona XtraDB Cluster) mysqld
From the error log of this check, we can see that the script detected that a node is not reachable, and set it to offline:
vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
Sat Jul 16 18:41:33 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:34 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state
Sat Jul 16 18:42:14 UTC 2016 Changing server 10.10.10.12:3306 to status OFFLINE_SOFT
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Also mysql_servers confirms the change:
Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-------------+------+--------------+
| 0 | 10.10.10.11 | 3306 | ONLINE |
| 0 | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0 | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------------+
3 rows in set (0.00 sec)
To continue our tests, we will set a host in desync mode:
mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync | OFF |
+---------------+-------+
1 row in set (0.00 sec)
mysql> set global wsrep_desync=ON;
Query OK, 0 rows affected (0.05 sec)
mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync | ON |
+---------------+-------+
1 row in set (0.00 sec)
Also in this case the script will notice it:
vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:01 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:01 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 2
Sat Jul 16 18:48:31 UTC 2016 Changing server 10.10.10.11:3306 to status OFFLINE_SOFT
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:41 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:42 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:42 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
And mysql_servers will confirm the change:
Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-------------+------+--------------+
| 0 | 10.10.10.11 | 3306 | OFFLINE_SOFT |
| 0 | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0 | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------------+
3 rows in set (0.00 sec)
To continue this proof of concept, we will now put the last node in sync:
mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync | ON |
+---------------+-------+
1 row in set (0.00 sec)
mysql> set global wsrep_desync=OFF;
Query OK, 0 rows affected (0.00 sec)
mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync | OFF |
+---------------+-------+
1 row in set (0.00 sec)
The script will detect it that the node is back ONLINE:
vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:49:53 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:49:53 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:23 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:24 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:24 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 4
Sat Jul 16 18:50:34 UTC 2016 Changing server 10.10.10.11:3306 to status ONLINE
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
And correctly reconfigure mysql_servers:
Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname | port | status |
+--------------+-------------+------+--------------+
| 0 | 10.10.10.11 | 3306 | ONLINE |
| 0 | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0 | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------------+
3 rows in set (0.01 sec)
Finally, we will restart the node shutdown previously:
vagrant@n3:~$ sudo service mysql start
* Starting MySQL (Percona XtraDB Cluster) database server mysqld [ OK ]
Once again, the script will detect that a node is now healthy:
vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
Sat Jul 16 18:51:15 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Changing server 10.10.10.12:3306 to status ONLINE
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
And correctly reconfigure mysql_servers:
Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------+
| hostgroup_id | hostname | port | status |
+--------------+-------------+------+--------+
| 0 | 10.10.10.11 | 3306 | ONLINE |
| 0 | 10.10.10.12 | 3306 | ONLINE |
| 0 | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------+
3 rows in set (0.00 sec)
Finally, do not forget to save the job to disk:
Admin> SAVE SCHEDULER TO DISK;
Query OK, 0 rows affected (0.01 sec)
Conclusion
Complete Support for Galera Cluster in ProxySQL is still not ready.Although, the ability of ProxySQL to be easily reconfigurable at runtime, together with its Scheduler, will quickly extend ProxySQL capabilities, include a better integration with Galera in complex setups.
No comments:
Post a Comment