The journey to 90% serverless at Comic Relief
The 90% figure is plucked out of thin air — it’s just me trying to pick a really big number without accurately calculating a percentage
What is Red Nose Day?
Since its launch in 1988, Red Nose Day has become something of a British institution. It’s the day when people across the land can get together and raise money at home, school and work to support vulnerable people and communities in the UK and internationally.
Now that Red Nose Day 2019 is over, I want to give you a bit of a run down on my learnings, the tooling that we used and the path taken, on what turned out to be nearly a full cloud migration from a containerised/EC2 ecosystem and into the world of serverless.
Upon starting at Comic Relief in April 2017, I joined a well-developed team that were already working in the cloud-native microservice world, with some legacy products operating on a fleet of EC2 servers (such as the website) and the rest either running on Pivotal Web Services (Cloud Foundry) or outsourced (Donation Platform).
Comic Relief was generally using Concourse CI (awesome tool) to deploy to Cloud Foundry or was using Jenkins for its legacy infrastructure. We frequently used Travis CI to run unit tests and do some code style tests, but have now moved over to CircleCI — thanks to a thorough comparison by Carlos Jimenez.
Nearly all of the code was written in PHP and was either Symfony or Slim Framework. I had quite a bit of experience of NodeJS at my previous role at Zodiac Media, was a big fan of running the same language on the front as on the back and not having to context switch when swapping between the two.
The project was my first introduction into Lambda and Serverless Framework and also my first lesson as to what works well in serverless. I created a serverless backend in NodeJS that accepted a POST request and then forwarded that on to RabbitMQ.
My colleague Andy Phipps (Senior Frontender) created a frontend in React which we dropped into S3 and created a CloudFront distribution in front of it. We then built a Concourse CI pipeline that ran the SLS DEPLOY command and ran some necessary mocha tests against the staging deploy and then deployed to production and did much the same for the frontend.
It was a liberating experience and became the foundation for everything that we would do after.
The next thing we wanted to do was to send an email from the contact service to our email service provider. I created another Serverless service that accepted messages either from RabbitMQ or via HTTP that again just took the message and formatted for our email service provider and forwarded on.
We then realized that we were using the same mailer code in our Symfony fundraising paying-in application, so we stripped out the code from payin and pointed it to our new mailer service. At this point, I started to realize that the majority of web development was some mix of presenting data and handling forms.
We then halted active development and steamed into Sport Relief 2018; this allowed us to test our assumptions towards Serverless and gain some real-world experience under heavy load.
The revelations were as follows:
- Cloudwatch is a pain to debug quickly, but a good final source of truth.
- Self-hosted RabbitMQ was OK but took a lot to manage for a small team. Why weren’t we using SQS?
- We were duplicating a lot of the boilerplate code across the two projects that we had created.
- Our API’s needed to be documented automatically as part of our pipelines and in code.
- Serverless Framework was the future.
We then went into a significant business restructure, which meant that a lot of our web-ops team ended up leaving. The result was a remit to simplify the infrastructure that we were using so that a smaller group could manage it, bringing ownership, responsibility and control to the development team. The approach was championed by our Engineering lead Peter Vanhee, without that, where we are now would never have happened.
The next obvious target was our Gift Aid application; the most basic description of it is a form that users submit their details so that we can claim gift aid on SMS submissions. The traffic hitting this application is generally very spikey of the back of a call to action on a BBC broadcast channel, ramping up from 0 to 10’s of thousands of requests in a matter of seconds.
We traditionally had a vast fleet of application and varnish servers to back this (150+ servers on EC2). As one of the most significant revenue sources, this gets a lot of traffic in the 5 hours that we are on mainstream TV and also in the lead up to it, so there was very little wiggle room to get it wrong.
At this point, we diverged from RabbitMQ and started deploying SQS queues from serverless.yml. My colleague Heleen Mol built a React app using create-react-app, and react-router hosted again on S3 with CloudFront in-front of it, this was and is the foundation of every public facing application we make, it can handle copious levels of load and takes zero maintenance.
At this point, it was apparent that we needed a good way to document our API’s alongside the code. We had previously been using swagger on our giving pages. However, it seemed a bit of a pain to set up, and I wanted something static that could be chucked into S3 and forgotten. We settled on apiDOC as it looked like it would be quick to integrate and was targeted at RESTful JSON API’s.
The primary donation system was previously outsourced to a company called Armakuni, who had built an ultra-resilient multi-cloud architecture across AWS and Google Cloud Platform.
It really seemed like the next logical step to bring the donation system in-house. This allowed us to share components and styling from our Storybook and Pattern Lab across our products, severely reducing the amount of duplication.
It should be noted that at this time we already had a payment service layer that had been built in previous years in Slim Framework which ran the Sport Relief app donation journey, our giving pages and shop.
As Peter (engineering lead) was heading away on paternity leave, the suggestion arose that if there was any time after moving over the giftaid backend to Serverless that I could create a proof of concept for the donation system in Serverless Framework. We agreed that as long I had my main tasks covered, then I would be able to give it a go with any time I had left. I then went about smashing out all of my tasks quicker than I was used to, to get onto the fun stuff ASAP!
After talking to the super knowledgable guys at Armakuni after the wrap up from Sport Relief, it was clear that we needed to recreate the highly redundant and resilient architecture that Armakuni had created, but in a serverless world.
Users would trigger deltas as they passed through the donation steps on the platform, these would go into an SQS queue, and then an SQS fan out on the backend would read the number of messages in the queue and trigger enough lambda’s to consume the message, but most importantly not overwhelm the backend services/database.
The API would load balance the payment service providers (Stripe, Worldpay, Braintree & Paypal), allowing us to gain redundancy and reach the required 150 donations per second that would safely get us through the night of TV (it can handle much more than this).
I initially put in AWS parameter store to store payment service provider configuration, this was free and therefore very attractive in a serverless world, but proved woefully incapable under load and was swapped out for storing configuration in S3.
I then created a basic frontend that would serve up the payment service provider frontend based on which provider the backend. Imported all of the styles over from the Comic Relief Pattern Lab and was good to demo it to Peter and the team on his return.
Upon Peter returning, we went through the system, discussed it’s viability and did some necessary load tests using Serverless Artillery, concluding that we could do what we thought we couldn’t!
A business case was put together by Peter and our Product Lead Caroline Rennie, and away we went. At this point, Heleen Mol and Sidney Barrah came on board and added meat to the bones, getting the system ready to go live and the ever impending night of TV.
Due to the nature of Red Nose Day, you don’t get many chances to test the system under peak load. We were struggling to get observability of what was going on in our functions using Cloudwatch.
At this point, Peter recommended that we try a tool that he had come across, which was IOPipe. IOPipe gave us unbelievable observability over our functions and how a user is interacting with them; it changed how we used Serverless and increased our confidence levels substantially.
At this point we also integrated Sentry, which alongside IOPipe gave us the killer one-two punch of being able to get a 360 view of errors within our system, allowing us to quantify bugs for our QA team (lead by Krupa Pammi) and trace the activity that caused them quickly and efficiently. I can’t think of a time where I have been able to have such an overview of everything going wrong, pretty scary, but excellent.
The next big part of the puzzle was the decision that we were copying way too much code between our Serverless projects. I had a look at Middy based on a recommendation from Peter, but at the time there wasn’t a vast amount of plugins for it, so decided to spin out our own lambda wrapper rather than having to learn and make plugins for a new framework and possibly run into Middy’s limitations (probably none).
I am still not sure yet how bad of an idea this was, however, it seems to work at scale, is easy to develop with and simple to onboard new developers, which is enough for it to stay for the time being.
Lambda wrapper encompasses all of the code to handle API Gateway requests, connect and send messages to SQS queues and a load of other common functionality. Lambda wrapper resulted in a massive code reduction across all of our Serverless projects. It also meant that the integration of Sentry & IOPipe was common and simple across all of our projects.
To add extra redundancy to the project, we introduced an additional region and created a traffic routing policy based on a health check from a status endpoint. We figured the chance of losing two geographically separate AWS regions was very low.
We also backed up all deltas to S3 on a retention policy of 5 days, to ensure that we could replay all deltas in the event of an SQS or RDS failure. We added timing code to all outbound dependencies using IOPipe and also created a dependency management system so that we could quickly pull out dependencies (such as Google Tag Manager or reCAPTCHA) from external providers at speed.
Based on a suggestion from AWS. We also added a regional AWS Web Application Firewall (WAF) to all of our endpoints, this introduced some basic protections, including stuff we already had covered, but higher up the chain, before API Gateway was even touched.
Another piece of the puzzle was to get decent insights into our delta publishes and processing, this gives us another way to get a good overview of what is happening with our system. We used InfluxDB to do this and consider it as an optional dependency of our system.
It was important for us to understand what our applications critical dependencies were, thus forming our application health check status and whether we would fall over to our backup region. InfluxDB is fantastic, however, is self-hosted. When AWS Timestream comes along, this will be out the door.
So the night of TV came and went on the 15th of March and the system performed nearly exactly as expected. The one unexpected, but now apparent weak point was the amount of reporting that we were trying to pull from the RDS read replica using Grafana and our live income reporting, we lowered our reporting requirements and were back on track within no time.
We originally used RDS so that we could achieve compatibility with our legacy payment service layer, in the future we will probably replace this with something more Serverless. Relying on AWS Timestream for more real-time analysis (when it arrives).
So to sum up this epic and overly long rundown of the journey to 90% Serverless:
- Try to get everything Serverless if you can, our highest monetary cost is RDS. It’s still nice to be able to run the SQL queries that we know and love, Athena and S3 are probably a solid replacement.
- Try to ingest data and work on it away from user interaction. You can provide the user with an endpoint where they can check on the status of processing. Manage as much state as you can with your frontend. This will hopefully give you redundancy and protection as a default.
- Lambda allows you to load test at a significant scale, do it often, make it part of your deployment/feature release strategy. Serverless pushes the load down the line and has a habit of finding weak points in your chain, so make sure you know where your weak points are going to be. Serverless Artillery is the way forward, do better than us and do it as part of your pipeline to production for the win!
- Continuously deploy, deploy on a Friday at 5 pm, don’t let fear stop you, create the tooling and automation tests to allow you not to worry. We use Nightwatch, Cypress and Mocha to significant effect. It should be noted that you need decent logging and a fast way to rollback code to be able to do this in a manageable way (Concourse CI).
- Serverless infrastructure cost is dependent on usage, so why not deploy your entire infrastructure on a pull request level and run tests in the PR against it. We do this, and it means that developers can be sure that before their code is merged, it works in real life and on our real-world infrastructure.
- Don’t host anything if you don’t have to, everything as a service. I am physically averse to calls about infrastructure outages at any time of the day. Also, go multi-region if you can, serverless makes this a doddle, and it reduces the voice of the crowd who will remind you that S3 took out US-EAST-1 in September 2017.
- Pick a piece of your architecture, migrate it to Serverless, get comfortable with it, rinse and repeat.
- The best system is the one that allows me to be in the pub after 17:30 or be at home with my family not checking my laptop, Serverless for the backend and a React application stored in S3 for the frontend gives you this.
- Concourse CI is probably one of the most expensive pieces of infrastructure that we are running, it doesn’t fit in with our fully Serverless headspace. Replacing it would be great. However, the power and flexibility it gives us to deploy reliably and continuously are unmatched. Sometimes in life, you can’t be all one thing, in this case, Serverless. We use Concourse UP to simplify it’s deployment and management, meaning that we don’t have to mess around with bosh.
- Don’t try to optimize/abstract your services too early when it comes to Serverless. I remember at my first job where all the servers had names, they were cared for and loved and were then quickly replaced with EC2 when AWS entered the fray. Serverless brings the same down to your code and services; they should perform a function, be replaced with ease and doted over just the right amount. Compose small but relevant services and be ready for the day where you type sls remove on that much-loved service!
The biggest lesson for me is that in my day job, I exist to solve business issues. I think sometimes as technologists we forget this. Serverless is the fastest way to decouple oneself from rubbish problems, move up the stack and move on to the next issue.
Be sure to watch this presentation by our Engineering Lead Peter Vanhee talking through the current architecture at Serverless Computing London, as well as this presentation featuring our Product Lead Caroline Rennie around the previous donations platform and the problem space.