Behind the Scenes: Infrastructure Upgrade

A little backstory

We have been using Amazon Web Services Platform since we moved from our own servers back in 2011.  It has come a long way since then including adding Virtual Private Cloud, a number of new instance types, Aurora, CodeDeploy, and a plethora of other services (most of which we do not have a use for right now).  Prior to November 2016, the public marketing site (what sits on tave.com root domain) and the application (manager and client access) were all in the same code base sitting on top of ElasticBeanstalk in EC2-classic, along with Memcached on ElastiCache and MySQL on top of RDS.  ElasticBeanstalk handled our provisioning and deployment and overall it served us well.

November 2016 Updates

One of the reasons we have been holding off on doing any more upgrades to our existing infrastructure was because we really wanted to move it into Amazon’s Virtual Private Cloud to be able to take advantage of its better networking security, the latest instance types which were only available to VPC and Aurora (also only available to VPC setups).  We also wanted to break apart the public website from the application codebase for a while and when we redid the marketing site on top of WordPress, we finally had enough reasoning to go ahead and make the split.  In doing so, some infrastructure changes had to happen.  Since the new WordPress-based site would need its own set of servers that are completely segregated from the application on the tave.com root domain (where the application also sits), we had to introduce a layer 7 proxy to be able to parse the incoming URL and route to the appropriate server pool.  For example anything in /app goes to the application pool whereas anything on / or /blog goes to the public site pool.  AWS has a service that can do this called Application Load Balancer (aka ALB, ELBv2), but there are some huge caveats.  The biggest one is you can’t route anything to outside the VPC that the ALB sits in and since the application pool was still in Ec2-classic, we had to create this proxy server pool.

So back in November, in an effort to take the first step of moving us into a VPC and off of EC2-classic, we created the proxy server pool along with the public site pool and its database all with in VPC on top of CloudFormation. We didn’t need any downtime for this since we just updated the DNS entry to point to the new proxy pool inside the VPC and the proxies routed to the existing application pool when necessary.

April 2017 Upgrades

In doing the updates back in November, we soon became concerned with the size of the CloudFormation template file for what represents such a small portion of our overall infrastructure.  We have a public site pool, application pool, schedule tasks server, and a worker pool (runs tasks behind the scenes like automations, calendar feed fetching and generation, etc.).  In addition to that, we wanted to break out the worker pool into 3 separate pools: 1 for email sending, 1 for the calendar feed generation and fetching and the last pool for everything else.  This template file was already becoming unwieldy and all it had was the core VPC networking, the proxy pool and the public site pool.  We wanted to rethink this before adding the application pools (app servers, workers, and scheduled tasks server).  So we started looking at provisioning tools out there like Ansible, Chef, Puppet, etc to see if they could provide what we were looking for — a structured way of composing templates in a hierarchical manner.  We ended up just sticking with CloudFormation for a couple reasons (which I won’t go into here), but this time we decided to write a quick node script that pre-parses the templates and uploads them to s3 along with replacing the stack references with those uploaded destinations.  So we went from 1 monolithic template file to 7 template files:

  • The VPC core networking which references the stacks below
  • Bastion server layer
  • Proxy layer
  • Public site layer
  • Application Layer (1 for app itself and 1 for the background worker task queue servers)
  • Generic Instance Pool template that the others reference

This took it from 1 stack to 14 stacks which break down like so:

  • Parent stack that has the core VPC networking and references the other stacks.
  • Bastion stack that sets up networking for the bastion server and its child stack for the instance pool.
  • Proxy stack that sets up networking for the proxy servers and its child stack for the load balancer and instance pool.
  • Public site stack that sets up networking for the marketing site and its child stack for the load balancer and instance pool.
  • Application stack which sets up networking for the application web servers along with load balancer, worker servers and the worker queue servers.

Since we created everything (except for Database and Cache servers) in CloudFormation, we didn’t need ElasticBeanstalk anymore. This allowed us to combine our 7 layers (which each had their own way of deploying), into a common and consistent deployment process on CodeDeploy.  Finally, now everything will be in a VPC.

We had been holding off on doing any more reserved purchases until we were ready to move everything inside the VPC.  This is why we were just band-aiding things as they would come up.  We knew our database was becoming overloaded at times, so we figured we would just upgrade everything at once (mostly because we hate taking downtime).

Database Upgrade

The core of our data storage sits on MySQL RDS using their Multi-AZ setup.  We want to take advantage of a number of advantages that Aurora has to offer, so we are migrating to that.  Along with this, we are increasing the instance size by about 8 fold.  This should provide much quicker reads and writes and give a better experience overall.  Aurora’s replication is also MUCH faster than MySQL’s so adding additional read slaves as necessary becomes trivial.

Application Webserver Upgrade

We are taking advantage of the newer c4 instance types that are available now which are slightly faster than what we have in production currently.  On top of that we doubled the size of the app webserver pool.  So not only are there faster servers there, there are about twice as many. We generally don’t use automatic autoscaling since to “get it right” requires lots of tedious testing that we honestly don’t have time for.  So we scale horizontally manually when we see page load times increasing — everything is already in place for this.

Worker Pool Upgrade

Some of the slow downs for some of the background tasks were due to some workers taking over the CPU resources of the instance and therefore reducing the amount of CPU resources for other (and sometimes more critical) workers like sending email.   We decided it would be best to distribute these workers into separate pools so that they won’t affect the more critical background tasks.   Instead of one common pool of workers, we are going to start with 3 separate pools: email,  calendar generators and remote calendar fetchers and the rest of the workers.  With this change, we are also doubling the total number of instances in the worker pool.  More importantly here was that we implemented the ability to break out individual workers into their own pools which allows us to move other more CPU or memory intensive workers to their own pool so they don’t affect other workers.

Standby Site Upgrade

Standby was moved into a VPC as well in the us-west-2 region.  We increased the database instance size there and doubled the capacity of the app server pool.

Public Site Upgrade

Since we like things being similar in our infrastructure, we also went ahead and migrated the WordPress database that the public site uses to Aurora.  Along with this change, we are doubling the capacity of that pool as well.  We also moved over the blog from the support site to the public site infrastructure.  The old support.tave.com is now hosted on Intercom’s Help Center.

Conclusion

With all these changes we are finally fully within a VPC and page loads and general app usage should feel quicker.  Of course, if you have any questions about our infrastructure or why we chose one thing over another, feel free to message us in app and we’ll be happy to share!

2 replies
  1. Shaina DeCiryan
    Shaina DeCiryan says:

    Thanks for upgrading the infrastructure behind Táve. It has become such an indispensable part of my wedding + lifestyle photography business, and although I don’t understand the majority of what’s going on here- I appreciate knowing you folks are on the ball with constantly improving the program. Thanks for being so transparent about the process, and best of luck with the April 2017 maintenance!

Comments are closed.