My name is Matthew Cianfarani, and I am the Vice President of Web Services for VATSIM. Yesterday, during the launch of bookings for one of our most significant events, we hit some technological errors that caused stress for some of our users.
My team and I would like to take a few minutes to offer some transparency into what happened - What we did amazingly well, and what we can improve on for the future.
First, I would like to acknowledge that I have a fantastic team. From my Assistant, Aidan Stevens (Who was the mastermind of this entire project) through to contributing developers like Alex Long, Harrison Scott, Andrew Ogden & Eoin Motherway they have spent the past weeks putting their heart and soul into the re-development of the Cross the Pond website. The project consists of over 26,000 lines of code. I have watched my team working at all hours of the night to bring you, the members - an experience we can all be proud of.
Bookings opened at 17:00z - And we eagerly watched the site come to life. Within five minutes, 51 Percent of all slots had been booked. It was a fantastic feeling for my team - To watch weeks of hard work come to fruition.
However, shortly after, we began to see something strange across the fleet of servers that run VATSIM’s Web Services. A system that we have been running for years that manages our Single Sign-On service had started to malfunction. The load on the SSO system was just too high. This began to create a spiraling effect, causing other systems across the network to function suboptimally.
These errors within SSO were what caused the troubles some of our members saw today.
Let’s take a moment and look at some data as to how the actual CTP website performed, and share some information as to how we architected the site. At this point, I am handing the next few paragraphs to my Assistant, Aidan Stevens.
The Cross The Pond website and infrastructure was architected with stability and scalability at its forefront. This includes a brand new server stack, eons ahead of our prior setup. VATSIM Web Services deployed two application servers, behind an application layer load balancer with an additional and separate database (MySQL) server as well as a separate queue and cache (Redis) server. The aforementioned servers are all within the same VLAN, allowing us to make all database, queue, and cache functionality work over private bandwidth, optimizing performance and security. This stack is complimented by AWS S3 storage and Pusher for WebSocket functionality. All of this sits behind CloudFlare Argo, optimizing routing, requests, and caching. In short, It rocks. And it held up under pressure.
As soon as we opened bookings on the site, we began to see the requests flow in. At the peak, our load balancer was handling 114 requests per second, that’s almost 7000 requests per minute!. All of these database intensive requests were serviced with no downtime on the application servers, the database server, nor the cache and queue server. We saw in excess of 600,000 requests flow through CloudFlare Argo, with nearly 200,000 of those being between 1700z and 1800z.
As bookings began to be reserved and released, our WebSocket functionality kicked in. This is the system that made the slots automagically disappear and reappear in real-time, right before your very eyes. Overall, we saw just below 10,500 messages flow through our WebSocket service, Pusher.
That sure is a lot of reservations and releases! Each of these messages were triggered, queued, sent, and then received by each client and subsequently handled appropriately, whenever a slot was removed from the list or re-added.
What about the gremlins in SSO?
If you have made it down this far in the post, thank you! SSO is an old and antiquated system. It works and performs admirably well for the vast majority of our use-cases. Without it, VATSIM would be stuck in a web-based stone age.
However, today we learned that this system does not perform well under load. Especially, record-breaking loads. We honestly could not have predicted that this would have failed today.
Our Promise to our members is that we will continue to work hard to upgrade, replace and revamp our Web Systems. We have already done a lot of this work behind the scenes, and CTP was our first “Public Facing” project.
We can, and will do better.
Vice President - Web Services