This post is about keeping it simple when it comes to managing a company’s databases. Simplicity lets me easily diagnose problems, quickly setup new databases, readily validate my actions while configuring replication, and identify the right course of action even at two in the morning.
Simplicity in my infrastructure allows me to diagnose problems. It is difficult to diagnose if a past network spike was from a Memcached instance or a Postfix instance hosted on the same machine. Service segregation, meaning leaving a single host responsible for a single heavy process, significantly aids in diagnoses because standard, rapidly deployed graphs in any of a number of RRDTool implementations can show host-wide performance statistics that can only have come from the single heavy process that machine is responsible for running.
Simplicity in my instance configuration has made it easy to add another instance. The current time from request to implementation is about ten minutes plus a pair of code reviews for puppet. Every configuration file is identical save for the last two lines, which are host specific and related replication.
Standardizing replication log file names such that they are based on the hostname provides two excellent benefits. The first benefit is that I am able to rapidly validate my connection within the same command line prompt where I am making replication changes. Switching between windows and tabs to validate that I’m connected to the correct machine produces an unnecessary risk that I will switch back to the wrong machine. Having the logfile related to the host I am working on allows for easy validation. The second benefit is that I am able to provide programmatic protection against human error. Even when connected to the wrong machine, having a host-specific file name means the replication session will try to read files that do not exist, ensuring that improper replication events will not be played on the wrong host.
Simplicity in my application configuration allows me to do a fail over in five minutes and shove cleanup tasks into the surrounding half hour. I’d be happier with a faster fail over and cleanup, but my team finds the current system acceptable because we (knock on wood) haven’t had cause to perform an immediate, emergency fail over in well over a year. At the time of this writing, we maintain a list of IPs in a file inside the application code base. The list of ips is always available wherever the code exists, does not require a database connection, and is easily changed due to the complete lack of any need to cache the data. It’s a simple solution when compared to keeping shard information in a database and writing a complex caching layer for that IP data.
Keeping it simple is a goal I’ve strived for in my production environments. Failovers are an IP change, replication settings are as simple as I can safely make them, setup is a new file in puppet, and database hosts follow Service Segregation by only running MySQL/Percona Server. It keeps my nights pretty free of Nagios alerts. 🙂