The past week has been extremely exciting and nerve-wrecking. My team has finally completed the migration from on-premise to the cloud. It’s the first time that I’ve done anything like this and I’m blessed to have someone senior to lead us through the migration period.
ps: I wrote but forgot to post so this was actually 2-3 weeks ago
I’m a part of the MyCareersFutureSG team, so our users are the working population of Singapore, and we host hundreds of thousands of job postings, so there are definitely some challenge in migrating the data.
It’s the first time that I’ve handled such huge amounts of data when migrating across platform and the validation and verification process is really scary, especially when we couldn’t get the two checksum to match. It’s also the first time that I’ve done multiple Kubernetes cluster base image upgrade rollover. There were multiple occasions where we were scared that the cluster will completely crash but it managed to survive the transition.
Let me sum up the things I’ve learnt over the migration.
- When faced with large amount of data, divide and conquer. Split data into smaller subsets so that you have enough resource to compute.
- When rolling nodes, having two separate auto scaling groups will allow you to test the new image before rolling every single node.
- If you want to tweak the ASG itself, detach all the nodes first so that you will have an “unmanaged” cluster, then no matter what you do to the existing ASG, at least your cluster will still stay up.
- When your database tells you that the checksum doesn’t match, make sure that when you dump the data, it’s in the right collation, or right encoding format
- Point your error pages at a static provider like S3, because if you point it at some live resource, there’s a chance that a mis-configuration will show an ugly 503 message. (something that happened briefly for us)
- Data less than 100GB is somewhat reasonable to migrate over the internet these days
- Running checksum hash on thousands and thousands files is quite computationally and memory intensive, provision enough resources for it.
Overall, the migration actually went over quite well and we completed ahead of time. Of course, the testing afterwards is where we find bugs that we have never found before because it’s the first time in years that so many eyes are on the system at the same time.
The smoothness is also thanks to the team who has carefully planned the steps required to migrate the data over, as well as setup streaming backups to the new infrastructure so that half of the data is already in place and we just need to verify that the streamed data is bit perfect.
Since it’s been a couple of weeks since this happened, I realize that I am lucky to be blessed with the opportunity to do something like this. Cause I’ve just caught up with my friends and most of the times, their job scopes don’t really allow them to do something that far out of scope. Which… depending on your stage of life it could be viewed as a pro/con. I’m definitely viewing this 4 day migration effort over a public holiday weekend as positive cause it’s something not everyone can experience so early on in their career!