Monthly Archives: November 2014

XtraBackup causing swap despite numa_interleave on mysqld

Jeremy Cole has an excellent article on why a mysqld instance will swap despite proper memory settings on modern hardware here. A quick hand-wave of a summary goes: in modern architecture, every process is, by default, limited to a 1/Nth of the available memory on a machine, where N is the number of cores, and will start to swap on disk when all the available memory in that 1/Nth is consumed.

 

The eventual solution for a large, single-instance mysqld-running machine is to enable numa_interleave in the mysqld_safe section of the config:

Screen Shot 2014-11-18 at 10.09.03 AM

Implicitly, when that flag is set, mysqld is being started by mysqld_safe with a numactl flag that will spread the memory allocation evenly across all nodes (“numactl –interleave=all [… rest of mysqld start command]”).

 

There is a subtlety, though, that must be considered: mysqld was just started up and allocated most of the memory on each of the nodes. When all nodes on a machine are mostly full, no locally running script, such as XtraBackup, can consume significant memory without starting to cause swap since that script, by default, will only consume memory in its hosting node. Also, that swap, generally, will be the mysqld process.

 

The solution, then, becomes to interleave all processes that lead to swap or any process that can reasonably be expected to consume a significant amount of memory before it leads to swap.

 

Before, Xtrabackup could look like this:

innobackupex –defaults-extra-file=/path/to/auth –tmpdir=/backup/partition –sla[…]

Now, Xtrabackup will look like this:

numactl –interleave=all innobackupex –defaults-extra-file=/path/to/auth –tmpdi[…]

 

I hope this helps resolve any additional swap issues even after mysqld’s numa_interleave flag was enabled. Let me know if there are clarifications I can make.

 

Update Aug 2015: I’ve learned about a few more variables related to numa management. Will provide corrections soon.

Preventing Max Connections Errors with InnoDB

Stop increasing max_connections every time there’s a 1040: Too Many Connections error. Every additional connection is another share to further divide the available memory.

Instead, while it would be best to manage the workload, it is also reasonable to properly utilize the available hardware with good server configuration.

There are three relevant server configuration options for managing connection counts as they relate to satisfying web requests.

  1. max_connections – the queue depth
  2. innodb_thread_concurrency – the count of queue consumers
  3. innodb_concurrency_tickets – the amount of work a consumer can do on a query before switching to the next query request

Correctly configuring these three variables, and controlling your workload of course, can prevent 1040 Too many connections errors, assuming, of course, that your server is not beyond its configurable capacity.

Taken together, these variables have a number of interesting behaviors.

The first typical issue, resource starvation due to large reporting queries, can be mitigated with proper innodb_concurrency_tickets when the hardware is too weak to increase the query consumer count (innodb_thread_concurrency). When innodb_concurrency_tickets equals 500, which is the default, then five hundred InnoDB rows may be touched before the InnoDB thread moves onto the next query in the queue. A ticket is consumed on any row touch, without regard to whether or not it’s a row-read, a uniqueness check, or a foreign-key lookup. Accordingly, setting innodb_concurrency_tickets to 1 means the InnoDB consumer threads will spend a large portion of their cycles context switching between queries, whereas setting innodb_concurrency_tickets to max will starve other potentially much cheaper queries of resources. Personally, the default value of innodb_concurrency_tickets works just fine for my workload and I’ve never moved it from the default value of 500. A few cpu resources will be spent context switching, but the fairness will be worthwhile for most web apps.

The second typical issue, stampedes of cheap queries causing 1040 errors, can be mitigated with an appropriately sized max_connection value when the hardware is too weak to increase the query consumer count. It’s not really an ideal solution, though. Every additional max_connection value means there’s a risk of another query concurrently doing joins, temporary tables, and the piles of complicated, memory-consuming behavior that leads to swap, and even potentially the OOM-Killer. Managing workload with good caching, potentially even proactive caching to manage the Thundering Herd problem, and throttled asynchronous application behavior is a positive solution that scales well.

The combination of these issues, as well as modern, multi-core hardware with powerful IO backends, pulls in the third system configuration option, innodb_thread_concurrency. The count of ‘query consumers,’ innodb_thread_concurrency will allow InnoDB to work on separate queries at the same time. By default, innodb_thread_concurrency is set to 8. Modern OS’s are just fine at scheduling, so I like to keep this value sufficiently high that I have at least twice as many consumers available as the number of possible concurrent, expensive queries. Having a large count of query consumers available keeps the cheap queries processed and back out of the queue, allowing the connection count to stay low.

So, what’s wrong with just opening up the max_connection and/or the innodb_thread_concurrency values to max? It doesn’t matter how powerful your hardware is, too many queries all simultaneously consuming an equal share of resources will prevent any of them from completing. Queries not completing and closing their connections will, of course, lead to a backlog of the cheap queries, and potentially a whole new batch of the cheap queries if the web requests retry the unreturned cheap queries. Eventually, the server goes unresponsive, customers get angry, the business crashes, and you’re looking for a new job.

Hopefully it’s clear: the right number of query consumers (innodb_thread_concurrency), the right amount of work per focus (innodb_concurrency_tickets), and the right amount of time for the consumers to close incoming query requests (max_connections), and you’ll be able to avoid 1040 Too many connections exceptions, business closings, and job hunts.