In the beginning, there were some tables. Then came a few more tables, but the initial tables remained and they were growing. This is a (very) boiled down version of the story of any database-backed API that continues to grow in users and features. With tables of even a moderate size, a near-zero-downtime migration run directly on the live database becomes impossible. This is hardly news. Thankfully, some lovely folks around the world have created tools that help with exactly this problem. Say hello to online-schema-change tooling!
For quite some time at WeTransfer, we had been using 'pt-online-schema-change' from the Percona Toolkit. In a nutshell, this tool creates an empty copy of the table to be migrated, migrates that table, then copies data over from the "old" table to the "new" one. The data copy is a combination of simply copying existing rows and using a set of MySQL triggers to propagate write events from the "old" table onto the "new" one.
On one hand, using triggers feels like a very elegant solution. They carry the benefit of using built in SQL commands and remaining declarative (telling the database what to do, not how to do it). On the other hand, triggers can have some (very serious) side effects. The primary side effect that caused us to look around was the potential for deadlocks. Queries and triggers are simultaneously competing for locks on the same table rows, unfortunately not in a coordinated fashion. This has resulted in deadlocks of entire tables, which would we'd really like to avoid. Thankfully, we've found a tool that doesn't require us to change our migration procedure, but manages to avoid the deadlock problem. Let me introduce you to gh-ost!
What is gh-ost?
'gh-ost' is another online schema change tool developed by GitHub. It takes a different approach to the problem from many other tools by avoiding MySQL triggers entirely. It is able to achieve this by acting like a replica and connecting to the binary log stream. It prefers to read from the log stream of the replica, which means that there is no overhead per-write on the main db and there is no competition for locks. This come with some trade-offs, most notably the added complexity of having to implement all of this logic within the tool rather than relying on built in MySQL functionality. With these trade-offs, though, come a whole lot of benefits: no more deadlocks, significantly faster migrations (so far, likely due to not competing for resources), the ability to truly pause a migration mid-way, and many more. Rather than continue to rewrite the book on this tool, I'll just point you to the official documentation explaining why the trigger-less design.
Using gh-ost out in production
Now that I've spent some time outlining why we originally explored using 'gh-ost', how about we dig into what the experience has been like actually using it. If you've installed the tool, or read through some of the documentation, you've seen how many configuration options are available. Here is an example of a command that we would run (some values changed):
gh-ost \ --max-load=Threads_running=150 \ --critical-load=Threads_running=4000 \ --chunk-size=1000 \ --throttle-control-replicas="main.host.database.com" \ --max-lag-millis=3000 \ --user="db-user" \ --password="really-great-db-password" \ --host="read-replica.host.database.com" \ --database="wetransfer-database-name" \ --table="objects" \ --dml-batch-size=1000 \ --verbose \ --alter="ADD COLUMN description TEXT" \ --assume-rbr \ --cut-over=default \ --exact-rowcount \ --concurrent-rowcount \ --default-retries=1200 \ --cut-over-lock-timeout-seconds=9 \ --panic-flag-file=/tmp/ghost.panic.objectst.flag \ --assume-master-host="main.host.database.com" \ --postpone-cut-over-flag-file=/tmp/ghost.postpone.objects.flag \ --initially-drop-ghost-table --execute
There's a lot going on there. I'm not going to go through each option as they're explained plenty well in the docs as well as in the '--help' output of the tool.
This looks great, but this sort of assumes that we're accustomed to writing raw SQL commands to migrate our database. This may be the case for many, and if so, then the rest of this post may not interest you. For those of us that use ActiveRecord (or something similar) to manage our database migrations and state, even writing this command requires the additional step of converting a migration written in ruby to SQL. This isn't a massive problem, as one can easily run this migration locally, copy the SQL that ActiveRecord generated, and paste it into the '--alter' argument. Unfortunately that's not the only manual step required, though. ActiveRecord also keeps track of our current database state by keeping record of which migrations have been run, which means we also need to bring those two back in sync. This can be resolved manually by inserting a row into the 'schema_migrations' table. The addition of two manual steps to every migration wasn't good enough for us, so we decided it was time to build some tooling to help out.
With the goal of the additional manual steps for migrations, we decided to first try out a relatively small investment solution of writing a rake task to automate the sql generation and migration version handling. We built a rake task that consisted primarily of overriding the default 'execute' method that is called for executing any query with ActiveRecord. We hard-coded most of the above configuration options and inserted the rest based on the incoming alter table query. With the new task, we were again able to run migrations with a single command 'bundle exec ghost:migrate', while also using the new tool.
On top of getting us back to a single command for migration, this solution also gave us the benefit of being invisible to the production rails app. Since this code lives exclusively in a rake task, the database connections in our production app function without any knowledge of our database migration strategy. This was an important difference between our new way of running migrations and our previous way, using adapters like departure. This adapter is always present, executing a regular expression for reach SQL command to check if it is an
ALTER TABLE statement. With these learnings and a few successful migrations behind us, we decided it was time to build out the tooling a bit more with ghost_adapter.
Let me be the first to introduce you to our new tool, Ghost Adapter. This tool builds on our learnings from running 'gh-ost' migrations through custom rake tasks by adding just a little convenience and mostly staying out of the way. There were two primary goals for the first release of this tool: (1) be every bit as configurable as the command-line tool without requiring users to input the same arguments on every migration and (2) stay out of the way unless specifically asked.
We took care of the first goal by allowing multiple layers of configuration. We currently use the tool by setting up common configuration in a rails initializer, adding a few environment specific options in our
config/environments/* files, and (potentially, but not yet) using environment variables to override configuration for a specific run. We've written documentation about configuration here, so I won't bore you by explaining everything in this post. By encouraging configuration to live in code, users can be sure that every migration runs under the same conditions and can check who set the configuration this way and why (assuming you use git).
For comparison sake, to configure something similar to the command above, our initializer looks like:
GhostAdapter.setup do |config| config.max_load = "Threads_running=150" config.chunk_size = 1000 config.critical_load = "Threads_running=4000" config.max_lag_millis = 3000 config.throttle_control_replicas = "main.host.database.com" config.password = "really-great-db-password" config.user = "db-user" config.database = "wetransfer-database-name" config.host = "read-replica.host.database.com" config.dml_batch_size = 1000 config.verbose = true config.cut_over = "default" config.assume_rbr = true config.concurrent_rowcount = true config.exact_rowcount = true config.cut_over_lock_timeout_seconds = 9 config.default_retries = 1200 config.assume_master_host = "main.host.database.com" config.panic_flag_file = "/tmp/ghost.panic.objects.flag" config.initially_drop_ghost_table = true config.postpone_cut_over_flag_file = "/tmp/ghost.postpone.objects.flag" end
The second goal came out of the principle that a tool installed specifically to migrate a database has absolutely no reason to be present when you're not running a migration. With this in mind, we've designed 'ghost_adapter' so the default behavior is to use ActiveRecord's Mysql2Adapter unless some condition tells it a migration is happening. We've created a rails generator that will set up a rake task to ensure 'ghost_adapter' is used for all of the 'db:migrate' like tasks. This can always be overridden using the
GHOST_MIGRATE environment variable.
Development of Ghost Adapter is active and we hope to build out more features that make this tool even easier to configure and safer to run. Any feature requests, bug reports, or general feedback is more than welcome on our GitHub repo.
If you made it this far, well, thanks for reading. So far, we're very happy with our usage of 'gh-ost' and Ghost Adapter for migrations. Since switching over, we've yet to see any table locks or database traffic spikes. We already feel substantially more confident before starting a database migration at any time of the day. If any of the problems I've described sound familiar, we would definitely encourage giving 'gh-ost' and Ghost Adapter a try!