The journey to 90% Serverless @ Comic Relief

The 90% figure is plucked out of thin air, it’s just me trying to pick a really big number without accurately calculating a percentage.

Now that Red Nose Day 2019 is over, I want to give you a bit of a run down on my learnings, the tooling that we used and the path taken, on what turned out to be nearly a full cloud migration from a containerised/EC2 ecosystem and into the world of Serverless.

Upon starting at Comic Relief in April 2017, I joined a well-developed team that were already working in the cloud-native microservice world, with some legacy products operating on a fleet of EC2 servers (such as the website) and the rest either running on Pivotal Web Services (Cloud Foundry) or outsourced (Donation Platform). Comic Relief was generally using Concourse CI (awesome tool) to deploy to Cloud Foundry or Jenkins for the legacy infrastructure. We frequently used Travis CI to run unit tests and do some code style tests, but have now moved over to CircleCI (thanks to a thorough comparison by Carlos Jimenez). Nearly all of the code was written in PHP and was either Symfony or Slim Framework. I had quite a bit of experience of NodeJS at my previous role at Zodiac Mediawas a big fan of running the same language on the front as on the back and not having to context switch when swapping between the two.

Pretty much the first job that I was tasked with on arrival was to bring the contact form, which was part of the Drupal site, into the Serverless world. The project was my first introduction into Lambda and Serverless Framework and also my first lesson as to what works well in Serverless. I created a Serverless backend in NodeJS that accepted a POST request and then forwarded that on to RabbitMQ. My colleague Andy Phipps(Senior Frontender) created a frontend in React which we dropped into S3and created a CloudFront distribution in front of it. We then built a Concourse CI pipeline that ran the SLS DEPLOY command and ran some necessary mocha tests against the staging deploy and then deployed to production and did much the same for the frontend. It was a liberating experience and became the foundation for everything that we would do after.

The contact service backend

The next thing we wanted to do was to send an email from the contact service to our email service provider. I created another Serverless service that accepted messages either from RabbitMQ or via HTTP that again just took the message and formatted for our email service provider and forwarded on. We then realised that we were using the same mailer code in our Symfonyfundraising paying-in application, so we stripped out the code from payinand pointed it to our new mailer service. At this point, I started to realise that the majority of web development was some mix of presenting data and handling forms.

The mailer pipeline

We then halted active development and steamed into Sport Relief 2018; this allowed us to test our assumptions towards Serverless and gain some real-world experience under heavy load. The revelations were as follows,

  • Cloudwatch is a pain to debug quickly, but a good final source of truth.
  • Self-hosted RabbitMQ was OK but took a lot to manage for a small team. Why weren’t we using SQS?
  • We were duplicating a lot of the boilerplate code across the two projects that we had created.
  • Our API’s needed to be documented automatically as part of our pipelines and in code.
  • Serverless Framework was the future.

We then went into a significant business restructure, which meant that a lot of our web-ops team ended up leaving. The result was a remit to simplify the infrastructure that we were using so that a smaller group could manage it, bringing ownership, responsibility and control to the development team. The approach was championed by our Engineering lead Peter Vanheewithout that, where we are now would never have happened.

The next obvious target was our Gift Aid application; the most basic description of it is a form that users submit their details so that we can claim gift aid on SMS submissions. The traffic hitting this application is generally very spikey of the back of a call to action on a BBC broadcast channel, ramping up from 0 to 10’s of thousands of request in a matter of seconds. We traditionally had a vast fleet of application and varnish servers to back this. As one of the most significant revenue sources, this gets a lot of traffic in the 5 hours that we are on mainstream TV and also in the lead up to it, so there was very little wiggle room to get it wrong. At this point, we diverged from RabbitMQ and started deploying SQS queues from serverless.yml. My colleague Heleen Mol built a React app using create-react-app, and react-router hosted again on S3 with CloudFront in-front of it, this was and is the foundation of every public facing application we make, it can handle copious levels of load and takes zero maintenance.

The Gift Aid Site

At this point, it was apparent that we needed a good way to document our API’s alongside the code. We had previously been using swagger on our giving pages. However, it seemed a bit of a pain to set up, and I wanted something static that could be chucked into S3 and forgotten. We settled on apiDOC as it looked like it would be quick to integrate and was targeted at RESTful JSON API’s.

Comment Block from Payments API
apiDoc documentation from Payments API

The primary donation system was previously outsourced to a company called Armakuni, who had built an ultra-resilient multi-cloud architecture across AWS and Google Cloud Platform.

With our clear remit to consolidate our stack and the successes that we had already had in building a donation system for our Sport Relief app in Symfony that was a fork of our Paying In applicationIt really seemed like the next logical step to bring the donation system in-house. This allowed us to share components and styling from our Storybook and Pattern Lab across our products, severely reducing the amount of duplication.

It should be noted that at this time we already had a payment service layer that had been built in previous years in Slim Framework which ran the Sport Relief app donation journey, our giving pages and shop.

As Peter (engineering lead) was heading away on paternity leave, the suggestion arose that if there was any time after moving over the giftaid backend to Serverless that I could create a proof of concept for the donation system in Serverless Framework. We agreed that as long I had my main tasks covered, then I would be able to give it a go with any time I had left. I then went about smashing out all of my tasks quicker than I was used to, to get onto the fun stuff ASAP!

After talking to the super knowledgable guys at Armakuni after the wrap up from Sport Relief, It was clear that we needed to recreate the highly redundant and resilient architecture that Armakuni had created, but in a Serverless world. Users would trigger deltas as they passed through the donation steps on the platform, these would go into an SQS queue, and then an SQS fan out on the backend would read the number of messages in the queue and trigger enough lambda’s to consume the message, but most importantly not overwhelm the backend services/database. The API would load balance the payment service providers (StripeWorldpayBraintree & Paypal), allowing us to gain redundancy and reach the required 150 donations per second that would safely get us through the night of TV (it can handle much more than this). I initially put in AWS parameter store to store payment service provider configuration, this was free and therefore very attractive in a serverless world, but proved woefully incapable under load and was swapped out for storing configuration in S3.

I then created a basic frontend that would serve up the payment service provider frontend based on which provider the backend. Imported all of the styles over from the Comic Relief Pattern Lab and was good to demo it to Peter and the team on his return.

The donation platform

Upon Peter returning, we went through the system, discussed it’s viability and did some necessary load tests using Serverless Artilleryconcluding that we could do what we thought we couldn’t! A business case was put together by Peter and our Product Lead Caroline Rennie, and away we went. At this point, Heleen Mol and Sidney Barrah came on board and added meat to the bones, getting the system ready to go live and the ever impending night of TV.

Serverless Artillery load testing reporting toInfluxDB and viewed in Grafana

Due to the nature of Red Nose Day, you don’t get many chances to test the system under peak load. We were struggling to get observability of what was going on in our functions using CloudwatchAt this point, Peterrecommended that we try a tool that he had come across, which was IOPipe.IOPipe gave us unbelievable observability over our functions and how a user is interacting with them; it changed how we used Serverless and increased our confidence levels substantially.

IOPipe function overview

At this point we also integrated Sentry, which alongside IOPipe gave us the killer one-two punch of being able to get a 360 view of errors within our system, allowing us to quantify bugs for our QA team (lead by Krupa Pammi) and trace the activity that caused them quickly and efficiently. I can’t think of a time where I have been able to have such an overview of everything going wrong, pretty scary, but excellent.

A Sentry bug from the Donation Platform

The next big part of the puzzle was the decision that we were copying way too much code between our Serverless projects. I had a look at Middy based on a recommendation from Peter, but at the time there wasn’t a vast amount of plugins for it, so decided to spin out our own lambda wrapper rather than having to learn and make plugins for a new framework and possibly run into Middy’s limitations (probably none). I am still not sure yet how bad of an idea this was, however, it seems to work at scale, is easy to develop with and simple to onboard new developers, which is enough for it to stay for the time being. Lambda wrapper encompasses all of the code to handle API Gateway requests, connect and send messages to SQS queues and a load of other common functionality. Lambda wrapper resulted in a massive code reduction across all of our Serverless projects. It also meant that the integration of Sentry & IOPipe was common and simple across all of our projects.

To add extra redundancy to the project, we introduced an additional region and created a traffic routing policy based on a health check from a status endpoint. We figured the chance of losing two geographically separate AWS regions was very low. We also backed up all deltas to S3 on a retention policy of 5 days, to ensure that we could replay all deltas in the event of an SQS or RDS failure. We added timing code to all outbound dependencies using IOPipe and also created a dependency management system so that we could quickly pull out dependencies (such as Google Tag Manager or reCAPTCHA) from external providers at speed.

IOPipe tracing

Based on a suggestion from AWS. We also added a regional AWS Web Application Firewall (WAF) to all of our endpoints, this introduced some basic protections, including stuff we already had covered, but higher up the chain, before API Gateway was even touched.

Another piece of the puzzle was to get decent insights into our delta publishes and processing, this gives us another way to get a good overview of what is happening with our system. We used InfluxDB to do this and consider it as an optional dependency of our system. It was important for us to understand what our applications critical dependencies were, thus forming our application health check status and whether we would fall over to our backup region. InfluxDB is fantastic, however, is self-hosted. When AWS Timestreamcomes along, this will be out the door.

So the night of TV came and went on the 15th of March and the system performed nearly exactly as expected. The one unexpected, but now apparent weak point was the amount of reporting that we were trying to pull from the RDS read replica using Grafana and our live income reportingwe lowered our reporting requirements and were back on track within no time. We originally used RDS so that we could achieve compatibility with our legacy payment service layer, in the future we will probably replace this with something more ServerlessRelying on AWS Timestream for more real-time analysis (when it arrives).

So to sum up this epic and overly long rundown of the journey to 90% Serverless,

  • Try to get everything Serverless if you can, our highest monetary cost is RDS. It’s still nice to be able to run the SQL queries that we know and love, Athena and S3 are probably a solid replacement.
  • Try to ingest data and work on it away from user interaction. You can provide the user with an endpoint where they can check on the status of processing. Manage as much state as you can with your frontend. This will hopefully give you redundancy and protection as a default.
  • Lambda allows you to load test at a significant scale, do it often, make it part of your deployment/feature release strategy. Serverless pushes the load down the line and has a habit of finding weak points in your chain, so make sure you know where your weak points are going to be. Serverless Artillery is the way forward, do better than us and do it as part of your pipeline to production for the win!
  • Continuously deploy, deploy on a Friday at 5 pm, don’t let fear stop you, create the tooling and automation tests to allow you not to worry. We use NightwatchCypress and Mocha to significant effect. It should be noted that you need decent logging and a fast way to rollback code to be able to do this in a manageable way (Concourse CI).
  • Serverless infrastructure cost is dependent on usage, so why not deploy your entire infrastructure on a pull request level and run tests in the PR against it. We do this, and it means that developers can be sure that before their code is merged, it works in real life and on our real-world infrastructure.
  • Don’t host anything if you don’t have to, everything as a service. I am physically averse to calls about infrastructure outages at any time of the day. Also, go multi-region if you can, serverless makes this a doddle, and it reduces the voice of the crowd who will remind you that S3 took out US-EAST-1 in September 2017.
  • Pick a piece of your architecture, migrate it to Serverless, get comfortable with it, rinse and repeat.
  • The best system is the one that allows me to be in the pub after 17:30 or be at home with my family not checking my laptop, Serverless for the backend and a React application stored in S3 for the frontend gives you this.
  • Concourse CI is probably one of the most expensive pieces of infrastructure that we are running, it doesn’t fit in with our fully Serverless headspace. Replacing it would be great. However, the power and flexibility it gives us to deploy reliably and continuously are unmatched. Sometimes in life, you can’t be all one thing, in this case, Serverless. We use Concourse UP to simplify it’s deployment and management, meaning that we don’t have to mess around with bosh.
  • Don’t try to optimise/abstract your services too early when it comes to Serverless. I remember at my first job where all the servers had names, they were cared for and loved and were then quickly replaced with EC2 when AWS entered the fray. Serverless brings the same down to your code and services; they should perform a function, be replaced with ease and doted over just the right amount. Compose small but relevant services!

Check out these articles for more information on our journey to serverless,

Below is a presentation by our Engineering Lead Peter Vanhee talking through the current architecture at Serverless Computing London.

And another presentation featuring our Product Lead Caroline Rennie around the previous donations platform and the problem space.

Our takeaways from AWS re:Invent

Although AWS re:Invent 2018 is now over, the announcements will have a wide ranging effect on how we build our products over the coming years.

We use AWS for our Serverless microservices built using the awesome Serverless Framework, so it is really great for us to see the rapid iteration and improvements to the backing services that we use on a daily basis (check out our post on going serverless).

Jacqui Lowe explaining data on a PowerPoint slide

Proving the concept: Avoiding the hackday prototype graveyard

Back in February we held a hack day which led to this post on practical tips for organising one. One of the problems we tried to find a solution for on the day was from our Impact & Investment team. They were receiving a lot of applications for funding that didn’t meet the eligibility criteria we’d set out. This was wasting the valuable time of both those spending the time completing an application and the team at Comic Relief who review them.

Organising productive hackdays – how to make the most of your hackathons

Hackdays and hackathons are a fantastic way to engage your development team in new challenges. At Comic Relief, we’re pretty pleased with the outcomes of our hackdays – while we may not always get a product out of the day, we never leave a hackday without gaining clearer view of what we need to do to solve user and business problems.

This post is just a few practical tips for anyone who’s looking to organise a hackday – sharing what we’ve learnt from organising our latest hackday. 

Accessibility: Access All Areas

Accessible digital experiences are something we strive for at Comic Relief – we’re not perfect at it, but we’re trying to make sure that we can embed inclusive design at the heart of our product development*. In this article, I’ll be sharing some of the peaks and troughs of our accessibility work and the progress we’ve been making to ensure our digital experiences are accessible to all users.

Recycle bins

Waste not want not: Upcycle your tech!

Working in the charity sector you learn to be pretty resourceful when you need to be, and that doesn’t stop at blagging free stuff (obviously we never do that ;)).

One of the most significant things we learnt from amalgamating our campaign sites onto a single platform was the efficiency that emerged from reusing code and functionality.

So when our Schools and Youth team approached us with an objective that was new to all of us we did what anyone else would do, look at what we’d done already and could copy!

Crowd of people in an auditorium

Everything you need to know about Mind the Product conference 2017

It’s always reassuring when you meet a person from your field who gets you and the daily gripes you face in your day-to-day job. So imagine how it feels when there are 1500 of you thrown together into one grand auditorium – it makes you understand how cults come into fruition.

How ‘Going Live’ became my mental blocker

For the past four months, the Platform Squad at Comic Relief has been working on a content migration from the old Drupal 7 code base to our beautiful new Drupal 8 platform. Anyone who’s been near this blog in the past year will have heard tons about the new platform (available here on Github) – but what today’s post is about is the final stage of the migration, ‘Going Live’.

Why we made our platform product open-source

Over the last year a key objective for the Technology team at Comic Relief has been to build products not websites. Tech Lead, Peter Vanhee, explained in a previous blog post how we’re using Drupal 8 to create a reusable platform product for building campaign websites. Since then the team have been working to deliver another website using the platform codebase and also preparing to open-source the codebase.