Sunday, April 22, 2018

MyRocks, malloc and fragmentation -- a strong case for jemalloc

While trying to reproduce a MyRocks performance problem I ran a test using a 4gb block cache and tried both jemalloc and glibc malloc. The test server uses Ubuntu 16.04 which has glibc 2.23 today. The table below lists the VSZ and RSS values for the mysqld process after a test table has been loaded. RSS with glibc malloc is 2.6x larger than with jemalloc. MyRocks and RocksDB are much harder on an allocator than InnoDB and this test shows the value of jemalloc.

VSZ(gb) RSS(gb) malloc  
 7.9     4.8    jemalloc-3.6.0
13.6    12.4    glibc-2.23

I am not sure that it is possible to use a large RocksDB block cache with glibc malloc, where large means that it gets about 80% of RAM.

I previously shared results for MySQL and for MongoDB. There have been improvements over the past few years to make glibc malloc perform better on many-core servers. I don't know whether that work also made it better at avoiding fragmentation.

Friday, April 20, 2018

Fun with caching_sha2_password in MySQL 8.0.11


I want to get benchmark numbers with MySQL 8.0.11. This is my first impression. The default auth method was changed to caching_sha2_password. See this post for more details. There will be some confusion with this change. By confusion I mean the difference between "error" and "OK because cached" below. I am not alone. See the experience that an expert had with replication.

Fun with caching_sha2_password occurs even with clients compiled as part of 8.0.11:

  1. install MySQL 8.0.11, disable SSL but use mostly default my.cnf
  2. bin/mysql -u... -p... -h127.0.0.1 -e ... -> error
  3. bin/mysql -u... -p... -e ... -> OK
  4. bin/mysql -u... -p... -h127.0.0.1 -e ... -> OK because cached

The error in step 2 is: ERROR 2061 (HY000): Authentication plugin 'caching_sha2_password' reported error: Authentication requires secure connection.

From show global variables I see the default is caching_sha2_password:

default_authentication_plugin   caching_sha2_password
Setting this in my.cnf after I created the user doesn't fix the problem. Setting this before creating the user is one fix. I did not test whether changing the value of user.plugin to "mysql_native_password" is another workaround.
default_authentication_plugin=mysql_native_password
The error when using an old mysql client will also be a source of confusion:
$ ~/b/orig5635/bin/mysql -u... -p.. -h127.0.0.1
ERROR 2059 (HY000): Authentication plugin 'caching_sha2_password' cannot be loaded: /home/mdcallag/b/orig5635/lib/plugin/caching_sha2_password.so: cannot open shared object file: No such file or directory
 

Thursday, April 5, 2018

Index Structures, Access Methods, whatever

I prefer to use Index Structures when writing about algorithms for indexing data stored on disk and SSD but Access Methods is another popular name. Recently I have been working on a high-level comparison (lots of hand waving) of index structures in terms of read, write, space and cache amplification.

I started by dividing the index structures into tree-based and hash-based. Tree-based support range queries while hash-based do not. Most of my experience is with tree-based approaches. There was a lot of research on hash-based in the 80s (extendible, dynamic and linear hashing), they are in production today but not as prominent as b-trees, and there is recent research on them. This paper by Seltzer and Yigit is a great overview on hash-based index structures.

The next classification is by algorithm - update-in-place, index+log and LSM. I use this for both tree-based and hash-based so LSM means Log Structured Merge rather than Log Structured Merge Tree. Most DBMS implement one or two of these. Tarantool has more algorithmic diversity, but I don't have much experience with it.

For tree-based:
  • update-in-place - an update-in-place B-Tree is the best example including clustered (InnoDB) and non-clustered (PostgreSQL). Even though it isn't update-in-place, I expect WiredTiger to be close enough given that it is copy-on-write-random (CoW-R).
  • LSM - leveled and tiered compaction are examples (RocksDB, LevelDB, Cassandra). Note that the original LSM design by O'Neil et al didn't have an L0 and I suspect that the L0 might be a mistake but that is for another post. I don't want to insult LevelDB authors, the L0 makes sense when you want to limit memory consumption for database embedded in client apps, I am not sure it makes sense for serious OLTP with RocksDB. O'Neil also did fundamental work on bitmap indexes. I worked on both so he made my career possible (thank you).
  • index+log - use a tree-based index with data in log segments. The index points into the log segments. GC is required. It scans the log segments and probes the index to determine whether a record is live or can be deleted. There are fewer examples of this approach but interesting systems include WiscKey and ForestDB. Range queries will suffer unless the index is covering (this needs more discussion, but not here).

For hash-based:
  • update-in-place - see dynamic, extendible and linear hashing. TIL that ZFS implements extendible hashing. BerkeleyDB supports linear hashing. The dbm implementation on many Linux/Unix systems also implements some variant of update-in-place persistent hash tables.
  • LSM - today I only know of one example of this - SILT. That is a great paper (read it). I include it as an example of hash-based even though one of the levels is ordered.
  • index+log - BitCask is one example but the index wasn't durable and it took a long time (scan all log segments) to rebuild it on process start. Faster might be another example, but I am still reading the paper. I hope someone names a future system Efficienter or Greener.

Finally, I have a few comments on performance. I will be brief, maybe there will be another post with a lot more detail on my opinions:
  • b-tree - provides better read efficiency at the cost of write efficiency. The worst case for write efficiency is writing back a page for every modified row. Cache efficiency is better for a clustered index than a non-clustered index -- for a clustered index you should cache at least one key/pointer per leaf block but for a non-clustered index you need the entire index in RAM or there will be extra storage reads. Space efficiency for a b-tree is good (not great, not bad) -- the problem is fragmentation.
  • LSM - provides better write efficiency at the cost of read efficiency. Leveled compaction provides amazing space efficiency. Tiered compaction gets better write efficiency at the cost of space efficiency. Compaction does many key comparisons and this should be considered as part of the CPU overhead for an insert (maybe 4X more comparisons/insert than a b-tree).
  • index+log - provides better write efficiency. Depending on the choice of index structure this doesn't sacrifice read efficiency like an LSM. But the entire index must remain in RAM (just like a non-clustered b-tree) or GC will fall behind and/or do too many storage reads. GC does many index probes and the cost of this is greatly reduced by using a hash-based solution. These comparisons should be considered as part of the CPU overhead of an insert. There is also a write vs space efficiency tradeoff. By increasing the amount of dead data that can be tolerated in log segments then GC is less frequent, write-amplification is improved but space-amplification suffers. There are variants that don't need GC, but they are not general purpose.


Wednesday, March 28, 2018

Missing documentation

In my time with Linux there are some things that would benefit from better documentation.

  1. The use of per-inode mutexes by some members of the ext family for files opened with O_DIRECT
  2. PTHREAD_MUTEX_ADAPTIVE_NP - this enables some busy-waiting when trying to lock a mutex. It isn't mentioned in man pages.

Friday, March 9, 2018

Cache amplification

How much of the database must be in cache so that a point-query does at most one read from storage? I call this cache-amplification or cache amplification. The answer depends on the index structure (b-tree, LSM, something else). Cache amplification can join read, write and space amplification. Given that RWS was renamed RUM by the excellent RUM Conjecture now we have CRUM which is close to crummy. I briefly wrote about this in a previous post.

To do at most 1 storage read for a point query:
  • clustered b-tree - everything above the leaf level must be in cache. This is a key/pointer pair per leaf block. The InnoDB primary key index is an example.
  • non-clustered b-tree - the entire index must be in cache. This is a key/pointer pair per row which is much more memory than the cache-amplification for a clustered-btree. Non-covering secondary indexes with InnoDB are an example, although in that case everything you must also consider the cache-amplification for the PK index.
  • LSM - I assume there is a bloom filter per SST. Bloom filters for all levels but the max level should be in cache. Block indexes for all levels should be in cache. Data blocks don't have to be in cache. I assume there are no false positives from the bloom filter so at most one data block will be read. Note that with an LSM, more space amplification means more cache amplification. So cache-amp is worse (higher) for tiered compaction than for leveled.
  • something else - there have a been a few interesting variants on the same theme that I call index+log -- BitCask, ForestDB and WiscKey. These are similar to a non-clustered b-tree in that the entire index must be in cache so that the storage read can be spent on reading the data from the log.
I have ignored hash-based solutions for now but eventually they will be important. SILT is a great example of a solution with excellent cache-amplification.

Updated to correct what should be in cache for the LSM.

Friday, February 16, 2018

Sharded replica sets - MySQL and MongoDB

MongoDB used to have a great story for sharded replica sets. But the storage engine, sharding and replica management code had significant room for improvement. Over the last few releases they made remarkable progress on that and the code is starting to match the story. I continue to be impressed by the rate at which they paid off their tech debt and transactions coming to MongoDB 4.0 is one more example.

It is time for us to do the same in the MySQL community.

I used to be skeptical about the market for sharded replica sets with MySQL. This is popular with the web-scale crowd but that is a small market. Today I am less skeptical and assume the market extends far beyond web-scale. This can be true even if the market for replicasets, without sharding, is so much larger.


The market for replica sets is huge. For most users, if you need one instance of MySQL then you also need HA and disaster recovery. So you must manage failover and for a long time (before crash-proof slaves and GTID) that was a lousy experience. It is better today thanks to cloud providers and DIY solutions even if some assembly is required. Upstream is finally putting a solution together with MySQL Group Replication and other pieces.


But sharded replica sets are much harder, and even more so if you want to do cross-shard queries and transactions. While there have been many attempts at sharding solutions for the MySQL community, it is difficult to provide something that works across customers. Fortunately Vitess has shown this can be done and already has many customers in production.

ProxySQL and Orchestrator might also be vital pieces of this stack. I am curious to see how the traditional vendors (MySQL, MariaDB, Percona) respond to this progress.

Updates:

I think binlog server should be part of the solution. But for that to happen we need a GPLv2 binlog server and that has yet to be published.

Wednesday, January 17, 2018

Meltdown vs storage

tl;dr - sysbench fileio throughput for ext4 drops by more than 20% from Linux 4.8 to 4.13

I shared results from sysbench with a cached database to show a small impact from the Meltdown patch in Ubuntu 16.04. Then I repeated the test for an IO-bound configuration using a 200mb buffer pool for InnoDB and database that is ~1.5gb.

The results for read-only tests looked similar to what I saw previously so I won't share them. The results for write-heavy tests were odd as QPS for the kernel without the patch (4.8.0-36) were much better than for the kernel with the patch (4.13.0-26).

The next step was to use sysbench fileio to determine whether storage performance was OK and it was similar for 4.8 and 4.13 with read-only and write-only tests. But throughput with 4.8 was better than 4.13 for a mixed test that does reads and writes.

Configuration


I used a NUC7i5bnh server with a Samsung 960 EVO SSD that uses NVMe. The OS is Ubuntu 16.04 with the HWE kernels -- either 4.13.0-26 that has the Meltdown fix or 4.8.0-36 that does not. For the 4.13 kernel I repeat the test with PTI enabled and disabled. The test uses sysbench with one 2gb file, O_DIRECT and 4 client threads. The server has 2 cores and 4 HW threads. The filesystem is ext4.

I used these command lines for sysbench:
sysbench fileio --file-num=1 --file-test-mode=rndrw --file-extra-flags=direct \
    --max-requests=0 --num-threads=4 --max-time=60 prepare
sysbench fileio --file-num=1 --file-test-mode=rndrw --file-extra-flags=direct \
    --max-requests=0 --num-threads=4 --max-time=60 run

And I see this:
cat /sys/block/nvme0n1/queue/write_cache
write back

Results

The next step was to understand the impact of the filesystem mount options. I used ext4 for these tests and don't have much experience with it. The table has the throughput in MB/s from sysbench fileio that does reads and writes. I noticed a few things:
  1. Throughput is much worse with the nobarrier mount option. I don't know whether this is expected.
  2. There is a small difference in performance from enabling the Meltdown fix - about 3%
  3. There is a big difference in performance between the 4.8 and 4.13 kernels, whether or not PTI is enabled for the 4.13 kernel. I get about 25% more throughput with the 4.8 kernel.

4.13    4.13    4.8    mount options
pti=on  pti=off no-pti
100     104     137     nobarrier,data=ordered,discard,noauto,dioread_nolock
 93     119     128     nobarrier,data=ordered,discard,noauto
226     235     275     data=ordered,discard,noauto
233     239     299     data=ordered,discard,noauto,dioread_nolock

Is it the kernel?

I am curious about what happened between 4.8 and 4.13 to explain the 25% loss of IO throughput.

I have another set of Intel NUC servers that use Ubuntu 16.04 without the HWE kernels -- 4.4.0-109 with the Meltdown fix and 4.4.0-38 without the Meltdown fix. These servers still use XFS. I get ~2% more throughput with the 4.4.0-38 kernel than the 4.4.0-109 kernel (whether or not PTI is enabled).

The loss in sysbench fileio throughput does not reproduce for XFS. The filesystem mount options are "noatime,nodiratime,discard,noauto" and tests were run with /sys/block/nvme0n1/queue/write_cache set to write back and write through. The table below has MB/s of IO throughput.

4.13    4.13    4.8
pti=on  pti=off no-pti
225     229     232     write_cache="write back"
125     168     138     write_cache="write through"

More debugging

This is vmstat output from the sysbench test and the values for wa are over 40 for the 4.13 kernel but less than 10 for the 4.8 kernel. The ratio of cs per IO operation is similar for 4.13 and 4.8.

# vmstat from 4.13 with pti=off

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  4      0 15065620 299600 830564    0    0 64768 43940 7071 21629  1  6 42 51  0
 0  4      0 15065000 300168 830512    0    0 67728 45972 7312 22816  1  3 44 52  0
 2  2      0 15064380 300752 830564    0    0 69856 47516 7584 23657  1  5 43 51  0
 0  2      0 15063884 301288 830524    0    0 64688 43924 7003 21745  0  4 43 52  0

# vmstat from 4.8 with pti=on

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  4      0 14998364 384536 818532    0    0 142080 96484 15538 38791  1  6  9 84  0
 0  4      0 14997868 385132 818248    0    0 144096 97788 15828 39576  1  7 10 83  0
 1  4      0 14997248 385704 818488    0    0 151360 102796 16533 41417  2  9  9 81  0
 0  4      0 14997124 385704 818660    0    0 140240 95140 15301 38219  1  7 11 82  0

Output from Linux perf for 4.8 and for 4.13.