Thuan Pham talks about signs of mounting technical debt he saw in 2016.
So how does an organization like Uber pay down technical debt? There are many tactics for retiring technical debt; what many have in common is a sense of slowing things down.
Retiring debt can take days, weeks, months or even years. For example, Uber instituted a first “Fix-It Week” in October 2017 where engineers, data scientists and developers stopped focusing on building new features and instead worked on repairing code and tools that were broken, sloppy or poorly documented.
Whole teams may pause their projects for months or longer to address accumulated debts. In other cases, new projects are initiated or new organizational processes put in place. Often these projects and processes have an integratory effect, bringing people together, rather than fragmenting them into a fleet of speedboats.
Paying down technical debt can save time, but it can also save money. While retiring debt may mean some projects move more slowly in the short term, doing so can be crucial for a company like Uber that is looking to shore up its finances ahead of an IPO.
One way Pham built support for his efforts to retire technical debt was to analyze just how much the company was spending on engineering and developer time simply to fix things.
“We have charts and graphs that predict that if we don’t do anything with the amount of debt that we have right now, the amount of interest will continue to increase to the point where by the end of 2019, we will have to hire 1,000 engineers just to run in place. And so that will be like a breaking point or bankruptcy kind of thing.”
—Thuan Pham, in early 2018
Let’s look at some examples of how Uber sought to pay off technical debt.
UberEats, Uber’s food-delivery service, went from idea to product in less than a year, and product to multi-billion-dollar business in about two years.
“This is one of the fastest-growing companies in humankind. Two years ago, it was virtually non-existent,” said Mengerink. “When you look at this and you look at the data structures underneath, it was really just truly hacked into place. Hackathon was a perfect way of looking at it.”
Uber had hived off a small team to experiment with food delivery ideas. (Some of the first failed miserably, including one trial where Uber drivers picked up a bunch of pre-made burritos and drove around, hoping someone would order one. “A burrito in 5 minutes? Magical, you just push a button!” remembered Pham. “Didn’t work because people don’t want to eat burritos every day.”)
After hitting upon restaurant delivery, the team built some food-delivery features into Uber’s rider app and found strong product-market fit. But by August 2015, the Uber app was getting very complex; its screen was crowded with features. A suggestion arose to build a standalone UberEats app.
Ryan Sokol, who was then not part of the UberEats project, recalled the situation: “I was consulted about how long it would take, and I told them anywhere from three to six months, and I would err on six. And they were like, ‘Okay, we have three months. We're going to go do this.’”
UberEats launched in Toronto in late 2015, and in January 2016, Sokol moved over to head the UberEats team. Expansion to U.S. cities was imminent. But what he found was a precarious situation. He said:
In three months, you don't build solid software; you build prototype software. It's meant for beta customers, it's really janky. UberEats was built on three or four different databases, three or four different languages; whatever was available, they just sort of glued it all together in that three-month period. … When I got here in January, if you wanted to push new code, it was basically going to crash all the systems. We didn't have any processes in place; we didn't have any sort of rhyme or reason to how we had done things. There was a lot of tribal knowledge, and it was almost impossible to keep the system up if we wanted to do anything. If you didn't touch it, it was fine. But if you blew on it, it would fall down.
Just six weeks into UberEats’ launch, Sokol decided on a radical move — a move based both on past experience and intuition. He asked his team to pause for six weeks and essentially re-do the app, using one language (Go) instead of four, and one database as well. The suggestion to slow down didn’t go over well at first.
I got a lot of griping about it. People were like, ‘Why?’ … They wanted to still work in Python, they still wanted to work in Node.js, they wanted to work in all these languages, but I said no. If we become one unified voice [with Go], we'll be able to pass on the tribal knowledge better, we'll be able to get it documented better, we'll be able to train new people to come in to have to learn less to get productive. … I convinced people to go into a war room with me, and we just came out when we were done.
The results didn’t immediately pay off with the launch of the service in Los Angeles in March. Rolling out UberEats in L.A., he recalled, was almost as hard as launching Toronto because there were few efficiencies gained; much of the work to prepare for L.A. had been undoing hard-coded features that only worked in Toronto. But as UberEats moved into Chicago, Houston, San Francisco and other cities in rapid-fire succession after that, the benefits started to accrue. He recalled:
The third city was launched in a matter of weeks, and then it started to be "Oh, we can launch cities in a matter of days.’" Today, we can launch cities in a matter of minutes. So we've progressed a lot since those early days. But I do credit the decision for us to slow down and harden what we had built — really sort of shore up the foundation — as foundational to success.
The move to microservices was one key to Uber’s growth. But it also came with costs; namely, redundancies and complexities that might not have seeped into a system that was managed in a more top-down or centralized fashion.
By 2018, Uber’s software was a matrix of more than 3,000 microservices.N Many of these services, though, performed overlapping functions. For example, Uber offered promotions to entice drivers to drive. The company had promotions to incentivize riders to ride. There were more promotions for restaurateurs, and more yet for diners using UberEats.
If another Uber team wanted to create an incentive system for another product or customer segment, it was not unusual for them to take some code from the other incentive teams as a baseline, then modify that for their own purposes. The result, said Pham, was “you've got five or six systems that do incentives that are 75 percent similar to one another.”
“It would be fine if all six [systems] were amazing. It’s okay to have six amazing cars in your driveway. But if you have six 30-year-old cars, and one is polluting badly, one’s brakes are broken down and stuff, it’s not useful. You would rather have one car that works right.”
— Ganesh Srinivasan
In October 2017, Pham tasked Srinivasan to corral these close cousins by developing a “Product Platform” — essentially a set of common tools for engineers and developers to do their work. Besides improving efficiency and reducing duplication, the goals included reducing security vulnerabilities and UI inconsistencies. Six hundred people were assigned to Product Platform.
Click here to read an Uber Engineering blog post on Product Platform
Pham also challenged Mengerink and other senior leaders in the engineering organization to clean up Uber’s core infrastructure, such as unifying the compute platform, simplifying the storage stack, and reducing the number of “line [network] protocols” required for all the microservices to interoperate with one another. Pham believed there should be just one or two line protocols, but over time, Uber had developed about a dozen.
“That happens because all of these hundreds of teams make their own independent decisions without checking with one another,” said Pham. “Because they have to go really fast and get things done. They did whatever they felt was workable for them, and they did that with the best of intentions. But as a result, we have this haphazard web of things that are not uniform or standardized. So, the system becomes clunky and hard to manage.”
Convincing people to focus on tasks like reducing the number of line protocols, he admitted, wasn’t easy.
“Doing this back-end work takes time, and is the kind of work that is unsexy,” said Pham. “It doesn’t really shift the functionality. You're just making it more efficient. It's very hard to get an engineer to do that kind of work; work that is not visible. So, that is a challenge we have.”
Product Platform wasn’t the only new organizational structure put in place to address technical debt. Pham also backed the creation of Project Ark, which was the brainchild of M. Waleed Kadous, engineering strategy lead for the Office of the CTO at Uber. N
Hatched in the summer of 2017, Project Ark brought together about 20 senior representatives of various “organizations” within the Engineering division — for example, UberEats, Maps, Uber Freight, Storage, Data, Finance and more. The idea was to convene a representative body to tackle issues of technical debt. Kadous recalled:
I had started to see a lot of inefficiency. I would compare … what I thought would be reasonable at my previous employer, Google, and my engineers [at Uber] were only capable of maybe 60 percent or 70 percent of what an equivalent engineer at Google would do. Not because they were any less talented. The talent levels are more or less the same. But it was because the systems and tools and approaches and methodology that Google had built … had really empowered engineers to be efficient. Get stuff out the door without doing a lot of what we call "yak shaving." That means you've got to do these weird contraptions and weird steps. You want to put a service out and there's this document that has 13 steps in it and running this script and doing this and that. It was very reliant on organizational memory. The systems weren't elegant to use. There were frequently issues that were unexpected.
In September 2017, the Project Ark members got together for a day and decided on six major areas to focus on:
1. Engineer productivity: Engineers spent a lot of time working through unnecessarily complex processes and dealing with unstable systems.
2. Engineer alignment across teams: Teams often did things in different ways and when it came time to connect them things didn't work out. If they talked first, things would run more smoothly.
3. Knowledge access and documentation: It was hard to find out how to do things, and it was sometimes hard to even find out who was responsible for a service.
4. Duplication: There was often more than one way to do a thing at Uber.
5. Unmaintained critical systems: There were some key components at Uber that everyone relied on that were maintained on more or less a volunteer basis.
6. Culture: Uber’s “Let Builders Build" culture prioritized shipping stuff out the door with little thought to long-term maintainability. We needed to address that. N
Project Ark members thought about their task in terms of three time horizons: quick wins; things that could be tackled in time to have meaningful input to the 2018 planning cycle, and long-term issues.
Quick wins would be important, said Kadous, in part because there was skepticism around the Ark effort. A previous cross-team project known as Panama, he said, “had mixed success.”
Kadous sought to understand why Panama had fallen short. One reason, he said, was the group was not representative of all stakeholders, so getting buy-in on its decisions was difficult. Another reason was that there was little follow-through.
“Once a decision was made – like, for example, ‘these are the standard set of languages that should be used at Uber’ – nobody followed through to actually make sure that the set of languages we practically used was reduced,” he said of Panama.
Project Ark team members met once a week; there was also an “Ark Admin” group of three to four people who met three times a week to check on the progress of things. Asked why the project was christened “Ark,” Kadous said with a laugh: “I think it was because I felt we were sinking. It was meant to help us get to safer ground. I think maybe that’s the best way to put it.”
Among the initiatives Project Ark launched: eliminating the programming languages Python and JavaScript from Uber products, and setting up a three-year plan to reduce the number of code repositories from 12,000 down to four, and ultimately to one.
Not everyone was excited about Project Ark, admitted Kadous.
“Uber had built a culture of ‘Let Builders Build’ and ‘do whatever you want to get the job done,’ and we were coming in and saying, ‘no, for the greater good you're going to have to give up some of your autonomy.’ And not all people were happy with that. And so, there is resistance to change.”
Kadous also struggled with the fact that Project Ark was, essentially, a volunteer effort. Some participants were more involved than others.
“The problem with a representative structure is that people have their own organizations, and the dynamics are such that they get pulled back to their primary responsibilities,” he said. “And so, there's always been a struggle to work out how to keep people engaged. Of those 20 representatives, there were six or seven that actually were really, really committed and the other 13 were kind of on the margins. This is still a problem I haven't solved completely, and it's one of the problems of a representative body. You can either choose to be representative, or choose to have a committed group. It's really hard to do both at the same time.”
Slowly, though, Project Ark started to make a difference.