Efficiently deleting one million things per hour

For ten years people have known WeTransfer as the go-to service for getting their stuff from A to B. Simply, safely, and (dare we say so ourselves?) beautifully. But what about when the files reach B — what happens after that?

Transfer Konmari

About 55 million transfers are created every single month on wetransfer.com. Most of these transfers expire after a week which means they need to be deleted. Sounds pretty straightforward but, with an average of 20 transfers uploaded every second, that’s 20 transfers we need to delete per second just to keep up.

And sadly, no, we can’t just hit ‘delete all’.

We keep track of uploaded transfers, their senders and their intended recipients in a centralized MySQL database. The database tables have a simple structure with several child tables (containing things like download details and file metadata) belonging to a single parent table called transfers.

Often transfers can contain hundreds of files and potentially millions of downloads. So when it comes to deleting 20 per second, that’s a serious amount of rows to take out in one go.

And we’ve got to stay on top of it. Not deleting the metadata would be a violation not only of people’s trust and expectations but also of regulations like the GDPR. Plus, these tables grow to multiple terabytes in size making them problematic to store.

Deleting so much in such a short space of time and at such a regular cadence was taking its toll.

For a long time, we would actually take the site offline for a couple of hours throughout the year and remove the rows manually, since trying to do it while the site was up would result in mysterious downtimes.

So, as WeTransfer continued to expand and develop, we decided to get to the bottom of the issue once and for all.

Problem #1 Too slow

Knowing which transfers we needed to delete was straightforward since the tables have an indexed delete_at column. All our database tables are ActiveRecord models and they already had the relevant relations defined between them. Naturally, our first attempt was something like this:

until finished() do
  some_transfers_to_delete = Transfer.where(delete_at < Time.now).limit(BATCH_LIMIT)
  some_transfers_to_delete.destroy_all
end

It worked, but not as well as we'd hoped. Our logging showed that roughly 66% of the database time taken to destroy an object was spent in the COMMIT phase. A large percentage of the time was also spent allocating ActiveRecord objects and then immediately garbage collecting them.

So our next iteration deleted multiple transfers per transaction and skipped allocating ActiveRecord models altogether:

until finished() do
  some_transfers_ids_to_delete = Transfer.where(delete_at < Time.now).limit(BATCH_LIMIT).pluck(:id)
  ActiveRecord::Base.transaction do
    Recipient.where(transfer_id: some_transfers_ids_to_delete).delete_all # this will generate a DELETE FROM recipients WHERE transfer_id IN (1,2,3 …);
    Download.where(transfer_id: some_transfers_ids_to_delete).delete_all
    FileEntry.where(transfer_id: some_transfers_ids_to_delete).delete_all
    Transfer.where(id: some_transfers_ids_to_delete).delete_all
  end
end

This helped us reach an acceptable speed, deleting just under 6000 rows per second across all the tables. Sadly, it didn't last and we encountered an unusual surprise when running the script for longer periods of time.

Problem #2 Too fast

InnoDB, the storage engine for MySQL, returns a positive response to a COMMIT as soon as the transaction has been written to its redo log. At that point, even if there is a hardware failure, data can still be recovered from the transaction. The engine also marks the relevant pages in its in-memory cache of disk pages (the buffer pool) as 'dirty'.

One of the background threads in InnoDB is responsible for writing these dirty pages to the underlying storage, after which the transaction can be deleted from the redo log again. That is, until you reach the checkpoint_age.

The checkpoint age is the number of bytes above which no more transactions are accepted in InnoDB. If you reach this limit, your database will be unresponsive while background threads work to bring the checkpoint age down as fast as they can. If your application depends on the database (like ours does), this usually means downtime. When issuing commands which change many different pages (like deleting millions of rows), it's quite easy to raise the checkpoint age faster than the background threads can bring it back down. Clearly, we had to slow down our delete scripts, but by how much?

Problem #3 Just right?

Running the script too fast would risk downtime if we misjudged our traffic at any point. But leaving it slow enough to be reasonably sure we wouldn't hit maximum checkpoint age meant running much slower than we wanted to. To tackle the problem we resorted to a very straightforward feedback controller, limiting

the checkpoint age to about 66% of its maximum value:

until finished() do
  some_transfers_ids_to_delete = Transfer.where(delete_at < Time.now).limit(BATCH_LIMIT).pluck(:id)
  ActiveRecord::Base.transaction do
    Recipient.where(transfer_id: some_transfers_ids_to_delete).delete_all # this will generate a DELETE FROM recipients WHERE transfer_id IN (1,2,3 …);
    Download.where(transfer_id: some_transfers_ids_to_delete).delete_all
    FileEntry.where(transfer_id: some_transfers_ids_to_delete).delete_all
    Transfer.where(id: some_transfers_ids_to_delete).delete_all
  end
  sleep 10 if checkpoint_age() > ((2/3) * MAX_ALLOWED_CHECKPOINT_AGE)
end

After this, the checkpoint age no longer reached unsafe levels - success. But there was one problem left.

While most transfers are only downloaded a handful of times, some are downloaded in the millions. Deleting all those rows in one go could still exceed the maximum checkpoint age with a single query, rendering our protective check pretty useless.

We found this out the hard way when a single transfer (uploaded by a certain mononymous musician) was downloaded over 19 million times. Luckily MySQL supports LIMIT clauses on DELETE queries. So instead of having a single large query of an unknown length, we can chop it up into an unknown (but finite) number of transactions of a known length. We can also verify our checkpoint age after each of those queries to make sure we're not overloading the database:

until num_deleted < MAXIMUM_NUMBER_OF_ROWS_PER_TRANSACTION do
  ActiveRecord::Base.transaction do
    num_deleted = ActiveRecord::Base.connection.delete("DELETE FROM … WHERE … LIMIT …)"
  end
  sleep 10 if checkpoint_age() > MAX_ALLOWED_CHECKPOINT_AGE
end

With this final addition, the problem appeared to be solved. The code has been up and running unattended without a single problem yet (knock on wood).

Conclusion

Going from too slow to too fast, we finally found the goldilocks of transfer deleting speeds. And though it's taken several iterations to get here, the amount of technical debt it's freed up is well worth it. We've also been able to apply automated checkpoint age monitoring to other write-heavy operations, proving that investing in the deep understanding of our tools certainly pays off (and deleting one million things per hour is unbelievably satisfying).