Brazil Datacenter Migration Adventure
Preamble (it will be funnier if you know that fact): If you are not familiar with Brazil’s roadway system, especifically BR-101, which is the road who connects Rio de Janeiro with São Paulo, this is how it works:
The year was 2012, I was working on a company that had every single piece of hardware co-located in a datacenter in Rio de Janeiro, I was the infrastructure manager responsible for maintaining the zoo and also making sure that everything was running smoothly, part of my job was also checking for new hardware deals and finding better places to run them.
If you have already been to Rio de Janeiro, you know that services operating down there are far from being good, and when the subject is data centers, it is even worse. I had every single possible issue you could imagine in a data center, except fire. The fleet was composed by about 50 servers and some networking equipments disposed in 4 racks, mostly running on JBOD (just a bunch of disks) on cheap chassis, nothing fancy, no redundant power supplies, nothing. We were using a chassis brand called "Nilko", made in Brazil, so poor in quality that I had to leave the server assembly room directly to the hospital because I cut my thumb on the chassis, but this is a story for another post...
As I said, part of my job was also looking for better datacenter deals, and I found a datacenter in São Paulo that had a better service with lower costs, so we decided to migrate. If you think that migrating a Cloud Computing infrastructure from one region to another nowadays is complex, now imagine carrying the actual servers from one premise to another in a moving truck overnight and having to re-assemble everything on the new datacenter while the service availability was partially degraded. Just imagine the pressure of that!
Wait? A moving truck? Yes. That's what we could afford at the time, and without insurance! I don't think I need to write about the risk of that, and I will leave that to your imagination :)
After doing some logistics, we decided to move 50% of the servers overnight, a typical ride from Rio to São Paulo takes approximately 6 hours and we would start at midnight, for obvious reasons: 12 AM to 6 AM was the lowest resource fleet utilization. But as soon as the Sun shines, users usually wake up and start using computers, demanding more infrastructure capacity of my servers. If anything goes wrong we were doomed, and we started running against the clock since the first unscrewed bolt.
Migrate 4 racks of servers from 12 AM to 12 PM? Challenge accepted. My team was at midnight on the IDC and it was a clockwork operation. I started de-registering webservers from the Load Balancers and copying VLAN and routing configuration while part of the team was removing the servers from the racks, another part was wrapping them on bubble wrap and another guy was responsible for removing the , making sure to annotate which corresponded to which server and their corresponding assembly order in the rack. I wanted to copy the exact infrastructure footprint, but in another DC.
1 hour was the total disassembly time. At 1 AM the truck driver was on the data center dock parked backwards with his truck and we started placing the servers on the truck. At 1:30 AM he left to São Paulo on a trip on BR-101, which would take him 6 hours to complete.
Me and my team composed of 5 guys went back home because we had a flight to São Paulo at 6 AM departing from SDU (Santos Dumont Airport), arriving at 7AM, we took taxis to the data center and we arrived 10 minutes before the moving truck.
I couldn't believe on what I was seeing when the back door of the truck opened. The servers were there, intact, perfectly delivered, on the same position, ready and waiting to be installed on the new racks.
We resumed the assembly by doing the reverse operation we did in Rio. One of the guys started adding the on the racks, others were adding the servers and we were turning them on one by one to see if they would behave healthy. We decided to do all the
One mistake that we did that delayed the assembly operation was not preparing the racks networking in advance. Networking cables were the majority of the cables in the racks and I couldn't have two staff deploying servers and networking at the same time, so we had to install all the servers and then do the network cables.
Only two machines didn't turn on, one of them fell off in the floor (I believe that was the cause) and the other one presented some power supply issues, it was already malfunctioning since before the migration.
It was a very interesting operation that could have costed