Guest post: how automation and testing reduce risk in moving cloud vendors

Choosing and building a production environment for a large transactional service can be difficult and time-consuming.

The MOT testing service is one of these large transactional services. Every week there are close to 1 million MOTs performed in garages all over the country, 24 hours a day, 7 days a week.

The test results are an important part of a related service: paying car tax. Without a valid MOT, you cannot tax your car.

The Driver and Vehicle Standards have recently moved to the cloud - in this case using Amazon Web Services (AWS) - to build a scaled-out highly available platform, and extensively test and transition the MOT testing service in 10 weeks.

Switching vendors is a process involving risk. But this can be mitigated by thorough testing. In the transition to cloud services, automated testing had a big role to play. The result was increased confidence, and a fast deployment for a service that should be straightforward to maintain. And by automating the provisioning of our infrastructure, we’ve embedded a powerful tool to reduce vendor lock-in, helping future-proof our service and ensuring we can be nimble and adaptive to future needs.

Here’s what we did.

Automation then some more automation

To ensure we could adequately support the service, we knew we had to build the new technical environments in an agile way. Too many of us had seen designed-up-front production environments that couldn’t easily be iterated and haven’t worked at scale. We needed automation to allow us to provision new infrastructure; to repeatedly deploy the application; and to scale up or down as needed.

We started with automation first created for other departments. Then we added some new Terraform automation for AWS to allow automated provisioning of networks, security groups, firewalls, virtual machines, Elastic Block Storage (EBS) volumes and for the operating system.

On top of this we extended our Puppet Configuration Management to ensure all of the application was also automated for its deployment and configuration. This means we can rebuild any node (or all nodes) in a known state from code in our git repos. This is what some are calling Infrastructure as Code since the full stack including virtual infrastructure, application and configuration are controlled by code, tested and automated.

This effort to automate means that we now have environment consistency and we have certainty about state changes. We can even ask the tools to indicate state changes before they are applied so production changes can be checked if necessary. This is vital for us to be able to respond to scale changes quickly.

Testing the thing

Extensive non-functional testing for any beta service is a non-negotiable feature. This was strongly supported for the MOT testing service by the Senior Responsible Owner and Director of Digital Services and Technology at DVSA, James Munson. The performance testing regime in particular involved setting low, medium, high and extreme peak targets - for example the type of peak usage that happens twice a year during new car registration months - and then repeatedly testing and re-testing the service against these on the new platform in AWS.

This allowed the team to identify soft limits in the configuration, bottlenecks in the scaling and application constraints that could be addressed and re-tested. This process happened throughout the whole 10 weeks, and continues for new features today as the service is improved. Testing is not trivial especially for a complex transactional service, but analysis of the results is difficult without deep insight. To find these limits we built first-class monitoring and alerting using Prometheus and Kibana side-by-side with the AWS CloudWatch monitoring. This allowed the team to see bottlenecks tier-by-tier. It continues to allow operational issues to be identified and investigated proactively.

Underpinning the performance work was the platform choices available in AWS. We continually selected and tweaked configuration for standard features such as RDS Provisioned IOPS, Elasticache clusters, and VM sizes up to 32 vCPU. It is this immediate elasticity that allowed us to continue iterating the service at pace.

There were of course some teething issues with the new service in the first few days which were all identified quickly and resolved. Given the 10 week duration I was surprised there weren’t more issues.

Deploying change in a pipeline

We insisted from the beginning that the path from feature development with a development team to production environment should be an automated end-to-end pipeline. So we built a new pre-production environment, accept environment and integration environment and CI build process to continuously build, test and deploy versioned RPMs to Integration. These are then promoted through the pipeline to accept and from there on to production once signed-off.

To improve the sign off process we introduced mandatory firebreaks for the automated deployment to Pre-Production and Production environments. The promotion will copy the already built RPM and perform a dry-run. Once this goes green and has been accepted the button can be pressed in Jenkins to puppet apply to the environment to deploy the state changes.

The new pipeline has dramatically reduced the packaging and deployment slowdown in addition to reducing the risk of change. It has also allowed regular releases to the production MOT service increasing the benefits for garages.

Cutting over

The previous MOT mainframe service has recently been switched off after 10 years of use. All garages have been using the new MOT digital service since August, during one of the two peak months of the year. This demonstrates the confidence DVSA had in the AWS platform and the Kainos WebOps team to deliver it was well-placed.

The cutover from MOT1 was planned for and executed over a single weekend in August. This was made possible by the hard work of the team but would not have been completed in this timeframe without the investment in automation and the decision to replicate the 1Tb MySQL data from the previous cloud platform to an RDS replica in advance of the cutover weekend.

The team is working hard to further improve the MOT service to bring it to live status. And I’m super excited to see how other government departments and agencies will use public clouds like AWS to improve their digital services in the future and reduce costs.

Don't forget to sign up for email alerts from this blog.

3 comments

Comment by Vinay posted on 15 December 2015

Sounds really cool!

Link to this comment
Comment by Karenlouiseb posted on 30 December 2015

Interested to know more about how you managed the actual cutover. Did you use the old spreadsheet method or employ one of the deployment tools to help smooth the way?

Link to this comment
Comment by Keith posted on 05 April 2016

We are doing similar things in Ordnance Survey. Would like to link up and compare notes. Find me on LinkedIn Keith Watson if you are interested.

Link to this comment

Guest post: how automation and testing reduce risk in moving cloud vendors

Automation then some more automation

Testing the thing

Deploying change in a pipeline

Cutting over

Share this page

3 comments

This blog is now archived

What we do

Open Standards

Other GDS blogs and resources

Comments and moderation

Recent posts

Automation then some more automation

Testing the thing

Deploying change in a pipeline

Cutting over

Sharing and comments

Share this page

3 comments

Related content and links

This blog is now archived

What we do

Open Standards

Other GDS blogs and resources

Comments and moderation

Recent posts