GrubHub’s infrastructure-as-code feeds pandemic resiliency

GrubHub’s adoption of SRE methods paid off quickly as the COVID-19 pandemic struck the U.S.

GrubHub’s adoption of SRE methods paid off quickly as the COVID-19 pandemic struck the U.S. and continued to improve along with the company’s client base in the course of past yr.

The pandemic sparked a major improve in client site visitors on the on the web food stuff delivery company as shoppers significantly stayed out of general public places this kind of as eating places and alternatively purchased takeout on the web. In the 1st quarter of 2020, GrubHub noted 23.nine million active diners, an improve of 24% more than the 1st quarter of 2019.

That amount achieved 30 million in the third quarter, according to the company’s earnings releases. The company’s 1st quarter gross food stuff sales went up 8% as opposed to the 1st quarter of 2019 to $1.six billion by the third quarter, it noted gross food stuff sales of $two.4 billion, up sixty eight% more than the third quarter of 2019.

Alex TrevinoAlex Trevino

“Just before the pandemic… we would see better purchase volume in, you know, city facilities, significantly less so suburban [places],” explained Alex Trevino, complex direct in GrubHub’s SRE workforce. “Now there is better suburban-originating site visitors.”

This surge in activity represented an equally substantial improve in desire for again-conclusion IT companies, which GrubHub utilizes to hook up customers to far more than three hundred,000 eating places in the US and London via world wide web and cellular applications. It serves people applications from a mix of two AWS locations, US-East and US-West, as perfectly as a legacy self-owned information centre, and GrubHub engineers have penned their very own container orchestration tooling to manage about nine,000 Docker containers in the cloud.

Even so, engineering teams had by now designed GrubHub’s AWS infrastructure and three hundred software microservices to routinely scale to accommodate large advancement.

“Getting ready for better site visitors and making resiliency is a ongoing workout, section of day-to-day lifestyle here,” Trevino explained. “Just one of the items that we do, due to the fact our company is far more active when the temperature is colder, primary up to Labor Day, we go by way of the workout of reviewing all of our techniques to make positive that we’re scaled correctly.”

So, alternatively of getting to answer to day-to-day scalability problems, GrubHub web-site trustworthiness engineers (SREs) were able to concentration their initiatives far more strategically amid the pandemic, on initiatives this kind of as expanding DevOps teams’ use of infrastructure-as-code applications from Pulumi.

Infrastructure-as-code device speaks developers’ language

GrubHub’s IT employees did not experience drastic upheavals for the duration of the pandemic, but furthering the use of infrastructure-as-code, in which infrastructure means are provisioned and up to date alongside software code by way of CI/CD pipelines, served them accommodate some of the improvements that did manifest.

Andrew BlumAndrew Blum

“Documentation getting appropriate and up to date far more regularly has come to be immensely important, specifically due to the fact we have individuals that are no more time working in the exact same time zone,” explained Andrew Blum, senior SRE at GrubHub who led the infrastructure-as-code rollout. “We [cannot] quit by each individual other’s desks and choose each individual other’s brains.”

Infrastructure-as-code centralizes the two infrastructure provisioning and documentation within just the company’s Git supply handle program, exactly where SREs have also constructed-in enforcement for documentation updates alongside pull requests.

“We retailer our documentation with the code,” Blum explained. “When you go to make a modify … you also have a issue in the [pull ask for] that says, ‘Did you modify documentation for this?'”

Just before Pulumi, GrubHub SREs utilized custom Python scripts to automate infrastructure. Pulumi offered a far more systematic alternate to this custom scripting whilst preserving the Python interface, as opposed to utilizing a domain-distinct language (DSL), which is the solution taken by competitors this kind of as HashiCorp’s Terraform.

“It truly is pretty wonderful to be able to use our very own paradigms, and a all-natural programming language as opposed to some distinct DSL,” Blum explained.

GrubHub SREs had 1st adopted infrastructure-as-code applications from Pulumi in 2019 but began to steer developers towards utilizing it alternatively of requesting infrastructure means from the SRE workforce by way of enable desk tickets in mid-2020.

There was some initial resistance to this modify among developers, Blum explained, but the familiar programming language served ease the transition.

“They are utilizing their exact same applications and workflows to interact with this,” he explained. “And it presents them handle and ability to do the items they have to have to do to drive their features and products out.”

GrubHub’s shift to infrastructure-as-code also served SRE teams delegate repetitive infrastructure management get the job done to developers whilst improving upon program trustworthiness by way of repeatable, automatic deployments that were issue to excellent checks and other exams in CI/CD pipelines.

Infrastructure-as-code improves network management, trustworthiness

Just one of the most considerable techniques to make the transition to infrastructure-as-code past yr was the company’s NS1 Area Name Program (DNS), which translates human-readable world wide web addresses, this kind of as “,” to IP addresses affiliated with the again-conclusion infrastructure.

In the past, beneath a former DNS service provider, SREs created and up to date DNS servers and documents by way of a helpdesk ticketing program and a traditional console UI, somewhat than by way of infrastructure-as-code. Making use of Pulumi to update DNS slash down on manual glitches and offered dependable centralized management, improving upon the system’s trustworthiness.

Just before adopting infrastructure-as-code, a pick amount of engineers had access to the former DNS provider’s console, but not all of them knew the total context of DNS improvements, Blum explained in a 2020 NS1 convention presentation.

There [are] no shock DNS improvements. We have a process that lets us see who did it, why and when.
Alex TrevinoSpecialized direct, SRE, GrubHub

Underneath the Pulumi program, by contrast, just about every modify to DNS have to go by way of peer assessment. This encourages collaboration in between application developers and people far more steeped in the intricacies of DNS, improving upon the accuracy and performance of updates.

“There [are] no shock DNS improvements,” Trevino explained in an interview. “We have a process that lets us see who did it, why and when.”

Infrastructure-as-code has also made protection certification updates and other improvements to the company’s AWS load balancers far more dependable, Blum included in the interview. Very similar to DNS, only a handful of engineers have access to the creation AWS console updating by way of Pulumi necessitates a modify to a one file that corresponds to a grouping of means that could cross AWS locations, somewhat than various manual improvements to each individual resource by way of the console.

“There’s a good deal of uniformity and other items that [infrastructure-as-code] delivers to the table that you can absolutely pass up when you are undertaking it by hand,” Blum explained.

GrubHub SREs aspire to a total-fledged GitOps solution to software and infrastructure updates, the purist definition of which necessitates any updates to Git code repositories to be quickly and routinely deployed.

Which is still a get the job done in development, Blum explained.

“There’s a good deal to deal with — we have a good deal of means that were at first created by hand [and] section of this job is to produce tooling to import anything, in addition creating new means,” he explained. “This is a lengthy job that we are enterprise.”