Migrating millions of users to Auth0 without downtime
We were offering them under the "by WeTransfer" slogan, but technically, those products shared very few components. They had their own infrastructure and database. A user could have an account in those 3 products, but they would have nothing in common, except for the email address of course.
Our goal was to allow a user with an account in one product, to be able to sign in automatically to the rest of the products by using Single Sign On (SSO). This would not only improve the user experience and the value proposition, but would also allow us to work on a unified billing system and other product initiatives.
Just how did we do that with no downtime and with 80 million monthly active users across the three products?
Identity and Authentication Provider
Our first step in the journey of implementing SSO for our suite of products was to find the identity and authentication provider that worked for us. First, we listed some of the technical aspects that our provider had to support:
Migration of users without downtime
Migrate users with their current passwords
Allow our users to migrate their 2FA setup
Email customisations and translations
Hosting of PII in Europe
Have excellent APIs, SDK and up to date documentation
After implementing some Proof of Concepts with different providers and analysing the pros and cons, we decided to go with Auth0.
We knew we could fulfil our objective with Auth0, but with some custom implementation, since our authentication flows were already quite complex and different to each other. Our first interaction with Auth0 was a workshop with one of their Professional Services Architects, to be able to discuss all the different flows and understand all their features and workarounds to our needs. That was a key part of the implementation, since we were able to map all our flows to the different services they offer.
When we started thinking about implementing SSO for all our products, we knew that it was going to be challenging, but as it usually happens, we were having more open questions during the implementation. For example, how do we support sign in with Slack, Google and Apple for existing users?
Our main goal was for our users to sign in without noticing that a lot of things changed in the authentication flow. How was that going to be possible, when we had 3 products, with web and mobile apps? We wanted to release the best UX for our current and future users.
We decided to divide the release into 2 different phases:
Phase 1: Lazy migration
Phase 2: Bulk import
Phase 1: Lazy migration
The objective of this phase was to allow users to sign in automatically with Auth0 using their existing known password. That was very straightforward for users with an account in only one product. But for those users with an account in more than one product, we needed a way to validate that it was the same person owning those accounts. For that, we implemented what we called the migration and merging flow.
Auth0 has something super powerful called custom databases. With a custom database enabled in your account, if a user doesn't exist in Auth0, you can implement get_user(email, callback) and login(email, password, callback) scripts that will be called when a user signs in or resets a password. In the login script, we checked if the user existed in any of our products with that password. If there was a match, we allowed the user to sign in and continue. If a user had more than one account, we redirected the user to a self-hosted service using Auth0 Rules.
In that service, we forced the user to reset their password and re-authenticate with the new generated password. By doing that, we made sure that the user had access to their inbox and that they owned that email address. Once the user finished that flow, the user would be redirected back to Auth0, where the Rule would resume and mark the user as "migrated".
Those rules were the most challenging part of the migration, since they had to handle errors properly, and support those scenarios where a user might exit the flow early and resume later. We didn't want a user to be blocked in that flow forever.
Auth0 Rules can be implemented in NodeJS, and you can include any of the packages listed here. It is also crucial to handle all exceptions properly, since an unhandled exception could make the containers running your scripts start failing, causing downtime in your authentication flow.
We ran the lazy migration for a few months, which migrated our most active users.
Phase 2: Bulk import
Once we saw that the number of users being migrated per day was under a certain threshold, we moved into phase 2 of the release.
Auth0 has an endpoint as part of the Management API, that allows you to manually import users to Auth0. Our idea was to run this job, while still running the lazy migration because we still had thousands of users going through the lazy migration every day and the import script could take days to finish.
Our first task was to make sure that both migrations could co-exist. We did that by updating the rules and bulk importing the users with a boolean bulkImported: true as part of their app_metadata. We used that flag to skip the lazy migration for bulk imported users.
The bulk import job accepts different types of passwords you can use, so your users can sign in with their old known password. To simplify the import and make sure a user with accounts in multiple products owns those accounts, we wanted to force these users to reset their password. One way to do that with Auth0 is to set an invalid bcrypt hash as part of the user JSON. For example:
Once the user tries to sign in, they will see a customisable message. For example:
Now, there was another challenge. The bulk import endpoint only accepts JSON files of up to 500 KB and can process 2 concurrent jobs at a time. We also had to work within the rate limits Auth0 sets on their API, so we had to make sure we built a script that could run for days, and could retry if something failed.
We separated the script into 2 parts:
Exporting those users without an Auth0 user ID from our 3 databases into an AWS S3 Bucket. We exported our users from our database and ran a Python script to build the JSON files using the schema accepted by Auth0.
We built a NodeJS script that would run in our Kubernetes cluster. The script would read the files from S3, save the file names in a MySQL database and then call the Auth0 Management API endpoint to import the users.
We used the MySQL database to keep the state of each file processed. If a file failed to be imported to Auth0, we marked it to be retried later.
We finally used a semaphore to make sure we were processing 2 jobs at a time. Once a job was created, we polled until the job was finished. It is important to keep in mind the API rate limits for the polling.
Once the bulk import to Auth0 finished, we turned off the custom database import and we refactored the rules since the lazy migration was not needed anymore.
Since it would take months or years for all the users to be active again, we decided to backfill the Auth0 user ID in each database once all the users were migrated to Auth0. Once that was finished, we completed the migration of millions of users to Auth0, with no downtime.
What did we learn?
Migrating millions of users with no downtime was quite challenging and full of learnings. Here are the main things we learned from the process:
Supporting social identities while doing a lazy migration is hard. Each identity works differently and a user could use different identities with the same email. Auth0 supports account linking to be able to merge users with multiple identities with the same email. If social is not key for your business, leave the social identities for another phase, or start with only one. You can add the rest later once you get more familiar with the challenges of supporting social identities.
If you have a big user base with a high amount of requests, pay attention to the Management API rate limits. It is very easy to reach those limits, which will start dropping requests, affecting the users migrating and all authentication flows. You can reduce the number of requests by caching information in the JWT ID token and leaving only authentication-related information in Auth0. Avoid having business-related information in Auth0, such as billing information.
Think about accounts ownerships between products. Is it possible for an attacker to take control of another account with the same email? How do you avoid that? We decided to ask for credentials or require a password reset in those scenarios where we couldn't guarantee account ownership.
Before starting your migration, a good preventive approach could be to generate the unique identifier for a user between products beforehand. You can execute a script that generates the id and updates those users that match an email across databases. This might help you in the future if you have trouble syncing emails after the lazy migration or if something else goes wrong and you need to match users between databases.
Prepare your support team for this release. There will always be bugs or scenarios you couldn't have anticipated. Some users might have very old devices you couldn't test. A tool for the support team to manually fix users might come in handy.
Testing multiple products while supporting lazy migration was not an easy task. Build scripts to populate users and make sure you can set up different social accounts. Slack workspaces is a good way of creating multiple social accounts for testing.
Add external logs and monitoring to your rules. We used AppSignal and Kibana. In AppSignal, we were able to create different dashboards to monitor what was happening during the lazy migration. It was also a good way to understand user behaviour which comes in handy when you need to release a fix, for example.
You can read more about our Auth0 implementation in their Case Study.