ProxySQL: SQL Load Balancing Benchmark - Comparing Performance of ProxySQL vs MaxScale

In the MySQL ecosystem there are few load balancers there are also open-source, and ProxySQL is one of the few proxies that works at the application layer and therefore is SQL aware.
In this blog post we will benchmark ProxySQL against MaxScale, another popular proxy for MySQL.
The idea to compare ProxySQL vs MaxScale came after reading an interesting blog post of Krzysztof Książek on SQL Load Balancing Benchmark, comparing performance of MaxScale vs HAProxy.

Disclaimer: ProxySQL is not GA yet, therefore please do not use it in production.

Sysbench setup

I wanted to setup a similar sysbench setup to what Krzysztof used in his benchmark, but it is slightly different:
a) instead of using a MySQL cluster with Galera, I setup a cluster with 1 master and 3 slaves. Since the workload was meant to be completely read-only and in-memory, the 2 setups are functionally identical;
b) instead of using AWS instances I used 4 physical servers: server A was running as a master and servers B, C and D were running as slaves. Since the master was idle (remember, this is a read-only workload that use only the slaves), I used the same box to also run sysbench and all the various proxies.

Benchmark were executed running the follow:
./sysbench \
--test=./tests/db/oltp.lua \
--num-threads=$THREADS \
--max-requests=0 \
--max-time=600 \
--mysql-user=rcannao \
--mysql-password=rcannao \
--mysql-db=test \
--db-driver=mysql \
--oltp-tables-count=128 \
--oltp-read-only=on \
--oltp-skip-trx=on \
--report-interval=1 \
--oltp-point-selects=100 \
--oltp-table-size=400000 \
--mysql-host=127.0.0.1 \
--mysql-port=$PORT \
run

The versions used are:
Percona Server 5.6.22
sysbench 0.5
ProxySQL at commit a47136e with debugging disabled
MaxScale 1.0.5 GA
HAProxy 1.4.15

ProxySQL and MaxScale: few design differences

In the benchmark executed by Krzysztof, MaxScale was configured to listen on port 4006 where the service "RW Split Router" was running, and on port 4008 where the service "Read Connection Router" was running.
To my understand:
a) RW Split Router performs read/write split, parsing the queries and tracking the state of the transaction;
b) Read Connection Router performs a simple network forwarding, connecting clients to backends;
c) the two services, to operate, need to listen on different ports.

ProxySQL is, by design, different.

ProxySQL and RW split

ProxySQL performs a very simple query analysis to determine where the queries need to be send.
ProxySQL decides where a query needs to be forwarded based on a user configurable chain of rules, where a DBA can specify various matching criteria like username, schemaname, if there is an active transaction (feature not completely implemented), and a regular expression to match the query.
Matching against a regular expression provides better speed than building a syntax tree, and having a chain of rules that match with either regex or other attributes allows a great degree of flexibility compared to hardcoded routing policies.
Therefore, to implemented a basic read/write split, ProxySQL was configured in a way that:
a) all the queries matching '^SELECT.*FOR UPDATE$' were sent to master ;
b) all the queries not matching the previous rules but matching '^SELECT.*' were sent to slaves.
c) by default, all traffic not matching any of the previous rules was sent to master;

Considering the 3 rules listed above, all traffic generated by sysbench was always sent to slaves.

Additionally, while ProxySQL doesn't perform any syntax parsing to determine the target of a query, no matter what routing rules are in place, it also performs a very simple query analysis to determine what type of statement is being executed and generate statistics based on these. That is, ProxySQL is counting the type of statements that is executing, and these information are accessible through ProxySQL itself.

As already pointed in previous articles, one of the main idea behind ProxySQL is that the DBA is now the one controlling and defining query routing rules, making routing completely transparent to the developers, eliminates the politics behind DBAs depending on developers for such tweaking
of the setup, and therefore increasing interaction speed.

ProxySQL and Fast Forwarding

I think that the way MaxScale implements different modules listening on different port is a very interesting approach, yet it forces the developers to enforce some sort of read/write split in the application: connect to port 4006 if you want R/W split, or port 4008 if you want RO load balancing.
My aim in ProxySQL is that the application should have a single connection point, ProxySQL, and the proxy should determine what to do with the incoming requests. In other words, the application should just connect to ProxySQL and this should take care of the rests, according to its configuration.
To do so, ProxySQL should always authenticate the client before applying any rule. Therefore I thought that a quick feature to implement is Fast Forwarding based on username: when a specific user connects, all its requests are forwarded to the backends without any query processing or connection pool.
In other words, ProxySQL's Fast Forwarding is a concept similar to MaxScale's Read Connection, but uses the same port as the R/W split module and the matching criteria is the client's username instead of listener port.
Note that ProxySQL already support multiple listeners, but the same rules apply to all ports; in future versions, ProxySQL will support matching criteria also based on listener's port behaving in a similar way of MaxScale, but will also add additional matching criteria like the source of the connection.

Performance benchmarks

As said previously, on the same host where sysbench was running I also configured ProxySQL, MaxScale and HAProxy.
In the blog post published by Severalnines, one of the comment states that MaxScale was very slow with few connections, on physical hardware.
Therefore, the first benchmark I wanted to run was exactly at low number of connections, and progressively increase the number of connections.
ProxySQL and MaxScale were both configured with just 1 worker thread, and HAProxy was configured with only 1 process.

Please note that in the follows benchmark worker threads and connections are two completely different entities:
1) a connection is defined as a client connection;
2) a worker thread is a thread inside the proxy, either ProxySQL, MaxScale or HAProxy (even if HAProxy uses processes and not threads).
What could cause confusion is the fact that in sysbench a thread is a connection: from a proxy prospective, it is just a connection.

Benchmark with 1 worker thread

Tagline:
maxscale rw = MaxScale with RW Split Router
maxscale rr = MaxScale with Read Connection Router
proxysql rw = ProxySQL with query routing enabled
proxysql ff = ProxySQL with fast forwarding enabled

Average throughput in QPS:

Connections	HAProxy	MaxScale RW	MaxScale RR	ProxySQL RW	ProxySQL FF
1	3703.36	709.99	722.27	3534.92	3676.04
4	14506.45	2815.7	2926.44	13125.67	14275.66
8	26628.44	5690.22	5833.77	23000.98	24514.94
32	54570.26	14722.97	22969.73	41072.51	51998.35
256	53715.79	13902.92	42227.46	45348.59	58210.93

In the above graphs we can easily spot that:
a) indeed, MaxScale performance are very low when running with just few connections (more details below);
b) for any proxy, performance become quite unstable when the number of connections increases;
c) proxysql-ff is very close to the performance of haproxy;
d) with only 1 or 4 client connections, ProxySQL provides 5 times more throughput than MaxScale in both modules; with only 8 client connections ProxySQL provides 4 times more throughput than MaxScale in R/W split, and 4.3 times more in fast forward mode;
e) at 32 client connections, proxysql-rw provides 2.8x more throughput than maxscale-rw, and proxysql-ff provides 2.3x more than maxscale-rr ;
f) 4 proxies configurations (haproxy, maxscale-rw, proxysql-rw, proxysql-ff) behave similarly at 32 or 256 client's connections, while maxscale-rr almost double its throughput at 256 connections vs 32 connections: in other words, when the number of connections is high some bottleneck is taken away.

Below are also the graphs of average throughput, average and 95% response time at low number of connections.

Fortunately, I have access to physical hardware (not AWS instances) and I was able to reproduce the issue reported in that comment: MaxScale seems to be very slow when running with just few connections.
Although, for comparison, I tried a simple benchmark on AWS and I found that MaxScale doesn't behave as bad as on physical server.
After these interesting results, I tried running the same benchmark connecting to MaxScale and ProxySQL not through TCP but through Unix Domain Socket, with further interesting results.
Unfortunately, I didn't have a version of HAProxy that accepted connections via UDS, so I ran benchmark against HAProxy using TCP connections.

Average throughput in QPS:

Connections	HAProxy	MaxScale RW	MaxScale RR	ProxySQL RW	ProxySQL FF
1	3703.36	3276.85	3771.15	3716.19	3825.81
4	14506.45	11780.27	14807.45	13333.03	14729.59
8	26628.44	15203.93	27068.81	24504.42	25538.57
32	54570.26	16370.69	44711.25	46846.04	58016.03
256	53715.79	14689.73	45108.54	54229.29	71981.32

In the above graphs we can easily spot that:
a) MaxScale is no longer slow when running with just few connections: the performance bottleneck at low number of connections is not present when using UDS instead of TCP;
b) again, for any proxy, performance become quite unstable when the number of connections increase;
d) maxscale-rw is the slowest configuration at any number of connections;d) with an increased number of client connections, performance of MaxScale reaches its limits with an average QPS of 16.4k reads/s peaking at 32 connections for maxscale-rw , and an average QPS of 45.1k reads/s peaking at 256 connections for maxscale-rr;
e) with an increased number of client connections, performance of ProxySQL reaches its limits with an average QPS of 54.2k reads/s peaking at 256 connections for proxysql-rw , and an average QPS of 72.0k reads/s peaking at 256 connections for proxysql-ff .

As pointed already, with an increased number of connections the performance become quite unstable, although it is easy to spot that:
1) in R/W split mode, ProxySQL can reached a throughput over 3 times higher than MaxScale;
2) ProxySQL in Fast Forward mode can reach a throughput of 33% more than MaxScale in Read Connection Router mode;
3) ProxySQL in R/W split mode is faster than MaxScale in simple Read Connection Router mode.

The above points that while MaxScale has a readconnroute module with a low latency, none of the two MaxScale's module scale very well. The bottleneck seems to be that MaxScale uses a lot of CPU, as already pointed by Krzysztof in his blog post, therefore it quickly saturates its CPU resources without being able to scale.

Of course, it is possible to scale adding more threads: more results below!

MaxScale and TCP

At this stage I knew that, on physical hardware:
- ProxySQL was running well when clients were connecting via TCP or UDS at any number of connections;
- MaxScale was running well when clients were connecting via UDS at any number of connections;
- MaxScale was running well when clients were connecting via TCP with a high number of connections;
- MaxScale was not running well when clients were connecting via TCP with a low number of connections.

My experience with networking programming quickly drove me to where the bottleneck could be.
This search returns no results:
https://github.com/mariadb-corporation/MaxScale/search?utf8=%E2%9C%93&q=TCP_NODELAY

In other words, MaxScale never disabled the Nagle's algorithm, adding latency to any communication with the client. The problem is noticeable only at low number of connections because at high number of connections the latency introduced by Nagle's algorithm become smaller compared to the overall latency caused by processing multiple clients. For reference:
http://en.wikipedia.org/wiki/Nagle%27s_algorithm

I will also soon open a bug report against MaxScale.

What I can't understand, and I would appreciate if someone's else can comment on this, is why Nagle's algorithm doesn't seem to have any effect on AWS or other virtualization environments.
In any case, this is a very interesting example of how software behave differently on physical hardware and virtualization environments.

Because MaxScale performs on average, 5x more slowly at low number of connections via TCP, the follow graphs only use UDS for ProxySQL and MaxScale: the performance of MaxScale on TCP were too low to be considered.

Benchmark with 2 worker threads

Because MaxScale performs really bad at low number of connections via TCP due the Nagle's algorithm on physical hardware, I decided to run all the next benchmark connecting to MaxScale and ProxySQL only through UDS. HAProxy will still be used for comparison, even if connections are through TCP sockets.
I know it is not fair to compare performance of connections via TCP (HAProxy) against connections via UDS (for ProxySQL and MaxScale), but HAProxy is used only for reference.

Average throughput in QPS:

Connections	HAProxy	MaxScale RW	MaxScale RR	ProxySQL RW	ProxySQL FF
4	14549.61	11627.16	14185.88	13697.03	14795.74
8	27492.41	21865.39	27863.94	25540.61	27747.1
32	81301.32	29602.84	63553.77	62350.89	77449.45
256	109867.66	28329.8	73751.24	81663.75	125717.18
512	105999.84	26696.6	69488.71	81734.18	128512.32
1024	103654.97	27340.47	63446.61	74747.25	118992.24

Notes with 2 worker threads (for MaxScale and ProxySQL) or 2 worker processes (HAProxy):
a) once again, for any proxy, performance become quite unstable when the number of connections increase. Perhaps this is not a bug in the proxies, but it is a result of how the kernel schedules processes;
b) up to 32 client connections, performance of 2 workers is very similar to performance of 1 worker no matter the proxy. Each proxy configuration has its different performance, but it performs the same with either 1 or 2 workers;
c) maxscale-rw reaches its average peak at 32 connections, reaching 29.6k reads/s;
d) maxscale-rr reaches its average peak at 256 connections, reaching 73.8k reads/s;
e) proxysql-rw reaches its average peak at 512 connections, reaching 81.7k reads/s;
f) proxysql-ff reaches its average peak at 512 connections, reaching 128.5k reads/s;

As pointed already, with an increased number of connections the performance become quite unstable, but as in the workload with just one worker thread it is easy to spot that:
1) in R/W split mode, ProxySQL can reach a throughput of nearly 3 times higher than MaxScale;
2) ProxySQL in Fast Forward mode can reach a throughput of 74% more than MaxScale in Read Connection Router mode;
3) ProxySQL in R/W split mode is faster than MaxScale in simple Read Connection Router mode.

The above points confirms what said previously: ProxySQL uses less CPU resources, therefore it is able to scale a lot better than MaxScale with an increased number of client connections.

Benchmark with 4 worker threads

I ran more benchmark using 4 worker threads for ProxySQL and MaxScale, and 4 worker processes for HAProxy.

Average throughput in QPS:

Connections	HAProxy	MaxScale RW	MaxScale RR	ProxySQL RW	ProxySQL FF
16	50258.21	41939.8	50621.74	46265.65	51280.99
32	89501.33	50339.81	87192.58	70321.17	85846.94
256	174666.09	52294.7	117709.3	115056.5	183602.6
512	176398.33	46777.17	114743.73	112982.78	188264.03
2048	157304.08	0	107052.01	102456.38	187906.29

What happens with 4 worker threads/processes?
a) as with 1 or 2 workers, for any proxy, performance become quite unstable when the number of connections increase, but this time the fluctuations seems more smooth. Yet, ProxySQL seems the most stable proxy at high number of connections;
b) at 32 connections, ProxySQL and HAProxy gives similar throughput at either 2 or 4 workers;
c) at 32 connections, MaxScale provides more throughput with 4 workers than at 2 workers, showing that MaxScale needs more CPU power to provide better throughput;
d) at 32 connections, HAProxy, ProxySQL and MaxScale provide similar reads/s if they do not analyze traffic (89.5k , 85.8k and 87.2k);
e) using R/W functionality, at 16 connections ProxySQL provides 10% more reads/s than MaxScale (46.3k vs 41.9k), and at 32 connections ProxySQL provides 40% more reads/s than MaxScale (70.3k vs 50.3k);
f) MaxScale in R/W mode wasn't able to run 2048 client's connections;
g) maxscale-rw reaches its average peak at 256 connections, with 52.3k reads/s;
h) maxscale-rr reaches its average peak at 256 connections, with 117.7k reads/s;
i) proxysql-rw reaches its average peak at 256 connections, with 115.1k reads/s;
j) proxysql-ff reaches its average peak at 512 connections, with 188.3k reads/s;

Few more notes on scalability with 4 threads:
1) in R/W split mode, ProxySQL can reached a throughput over 2 times higher than MaxScale;
2) ProxySQL in Fast Forward mode can reach a throughput of 60% more than MaxScale in Read Connection Router mode;
3) ProxySQL in R/W split mode is, for the first time, slightly slower than MaxScale in simple Read Connection Router mode (115.1k vs 117.7k).

Note on transport layer load balancing

I consider important only the benchmark related to R/W split because only this provides SQL load balancing; HAProxy, ProxySQL with fast forward and MaxScale with readconnroute module do not provide SQL load balancing, but are present in the benchmark above to provide some reference of the overhead caused by processing SQL traffic.
Furthermore, the performance of MaxScale's readconnroute cannot be compared with the performance of HAProxy or ProxySQL. From a user's prospective, I would prefer to use HAProxy because it can provide way better performance.

Conclusions

One of the main focus while developing ProxySQL is that it must be a very fast proxy, to introduce almost no latency. This goal seems to be very well achieved, and ProxySQL is able to process MySQL traffic with very little overhead, and it is able to scale very well.
In all the benchmark listed above ProxySQL is able to scale easily.
In fact, in R/W split mode (highly configurable in ProxySQL, but hardcoded in MaxScale), ProxySQL is able to provide up to 5 times more throughput than MaxScale, depending from workload.

Since ProxySQL in query processing mode (R/W split) provides more throughput than MaxScale's readconnroute in the majority of the cases, I would always use ProxySQL's query processing that implements important features like query routing, query rewrite, query caching, statistics, connection poll, etc.
At today, the only reason why I wouldn't use ProxySQL in production is that ProxySQL is not GA ... yet!

ProxySQL

Wednesday, June 3, 2015

SQL Load Balancing Benchmark - Comparing Performance of ProxySQL vs MaxScale

Sysbench setup

ProxySQL and MaxScale: few design differences

ProxySQL and RW split

ProxySQL and Fast Forwarding

Performance benchmarks

Benchmark with 1 worker thread

MaxScale and TCP

Benchmark with 2 worker threads

Benchmark with 4 worker threads

Note on transport layer load balancing

Conclusions

3 comments: