- 24 July, 2024
In Part 3 we discussed scaling technical teams in general, while in Part 4 we focused on the Product team. In this final Part, we discuss the specifics of growing your Engineering team.
Focus on the customer is key for a great Engineering team
Engineering is not just the “code-writing arm of Product” but needs to be deeply involved in understanding how your company is delivering value to your customers. Successful companies don’t use Engineering as a “feature factory” (which is an embodiment of the dreaded waterfall model), but assign it an equally important role as Product in delivering value to their customers.
As mentioned in Parts 2 and 4, company-wide metrics around customer value (focusing on the fundamentals) should be put in place and jointly owned by the various functions, including Engineering. Quarterly or annual targets should be expressed in terms of impact on these metrics. This is key for engineers since in order to be effective, they need to understand why they’re doing what they’re doing, and be empowered to tackle that in the best way they see fit.
Additionally, in order to have a deep intuition of the value the product is delivering, engineers need to regularly put themselves in the shoes of their customers. Consider setting up an “icecreaming environment” (formerly called “dogfooding”) and promoting regular use of your product internally by engineers. In some cases, this can be done as part of the normal development process, e.g. if your company is developing a B2C product. If that’s not possible, you may want to set up recurring sessions (1 to 2 hours weekly is a good target) during which the engineers are going to use your product the way regular users would. A sandbox environment pre-populated with realistic user data is a good investment to enable internal use of your product.
There needs to be a healthy tension between Product and Engineering. A not-so-obvious advantage of having engineers deeply understand the fundamentals is that Engineering can act as a first line of critique for new ideas from Product. This is a significantly cheaper and faster way of weeding out or improving ideas that won’t add customer value.
Good engineering needs a culture of excellence and learning
Promote a culture of doing things right, and at the right time. For instance, if your engineers find themselves rewriting the same stuff all the time, this will sap their resources and will become increasingly difficult and painful as your customer base grows. At this stage of growth, systems need to be architected to be able to scale and last multiple years, even as requirements change over time.
You need to implement rigorous design and careful planning. Technical designs should be documented and reviewed by other teams (or at least by your most senior engineers) via an internal RFC (Request for Comments) process. A technology roadmap for long-haul efforts should be put in place, with sufficient bandwidth for the Engineering team to pursue this in addition to roadmap work.
When operating at scale, you need to make judicious technology choices and avoid the proliferation of different technologies. This proliferation can generate huge costs at scale since it will be very difficult – if not practically impossible, in most cases – to retire technologies once you’ve introduced them, meaning you’ll have to support them all “forever”. Remember that you’re optimising your portfolio of tools for the long run, not on a project-by-project basis.
Once you’ve hit PMF, it’s a good idea to start Investing in developer productivity tools. This can be a small team (e.g. 2 people for a team of 100 engineers) that will own, among other things, putting in place a fast and reliable CI/CD pipeline.
Be proactive and learn from your mistakes
It is important to anticipate problems instead of reacting to them: a good practice for a fast-growing company is to run periodic “10x litmus tests”: for every important part of your product and infrastructure, engineers ask themselves whether this part will continue to function if you increase your customer base, traffic and/or data by an order of magnitude. If the answer is “no”, time should be allocated to fix the scalability issues before that system runs out of steam. In most cases, this is way cheaper to do proactively than once things start to break, and it will ensure that you provide the best possible experience to your customers.
Failure to do the above may lead to life-threatening situations for your company. If you need to reimplement something because it’s reached its scalability limits, this usually cannot happen overnight. Consider the impact to your business if that component fails completely or is unable to handle the increased load. We’ve had experiences in the past where critical systems reached scalability limits and literally threatened to stall the company growth; the heroics needed to handle these situations are something we’d gladly like to avoid in the future.
The above is part of paying back tech debt regularly. The cost of this debt increases dramatically as your company scales. We’ve experienced this several times in the past: in one case, we had to rewrite a system and migrate over 15,000 customers, which took over a year. Had we invested earlier, when we had fewer customers, it would have been a matter of 1-2 months.
Production incidents are an unavoidable part of running a service. While nobody likes them (least of all your customers), you need to turn them into assets by putting in place a learning and improvement process. Run a post-mortem following each (major) incident, decide on corrective actions, track and prioritise these actions to make sure that same type of incident is unlikely to occur again. (Sometimes companies do well on all of these except tracking and implementing the actions, which makes the post-mortems essentially useless.)
You should strive to always be the first one to know when there’s an issue, not wait to be told about it by your customers. To achieve this, you need to have excellent visibility over your system, which can be achieved by investing into your monitoring and logging infrastructure and improving it as part of the post-mortem process described above.
All of the above is part of the best practices in a mature engineering team. You can slowly and painfully discover the best way to implement these practices yourself, if you start with mostly junior engineers. Or, you can inject the right kind of senior talent, i.e. engineers who have seen how things get built well at scale, to avoid avoidable mistakes & raise the bar internally. Clearly, the latter approach is preferable since you want to spend your time and energy on adding value to your customers.
Control how you spend your Engineering bandwidth
Since post-PMF you need to “industrialise” what you’re doing, you need to make sure that you’re spending your Engineering bandwidth on things that will have the maximum impact for your customers, across the board.
We’ve lived through many situations where Engineering teams would get overwhelmed with tailor-made requests for specific customers, to the point where teams would spend half of their time on such requests which would benefit only one or a very small number of customers. Unless you control this, there’s a natural tendency for it to happen since every one of these requests will be presented as a “must” for closing that deal; if this is allowed to happen at zero cost for the commercial teams, they will naturally put the pressure on Engineering instead of selling smarter and harder. Commercial teams should be selling “what’s on the truck”.
One highly effective way to control this is to pre-allocate part of the engineering bandwidth (10-15% would be a good target) to custom requests, while the rest remains dedicated to roadmap and technical work. Custom requests can be managed in a Kanban lane, with commercial teams owning priorities in this lane and knowing that they will not be able to get more bandwidth.
Don’t spend time on organisational experiments
Some companies experience the temptation to experiment with new/unproven team structures, in order to “increase engineering productivity”. For instance, the Spotify Model for scaling Agile (Scaling Agile @ Spotify) has been touted as a way to decouple teams and increase productivity. In reality, this model has not even worked at Spotify itself! (Based on conversations we’ve had with a number of Spotify executives over the years.)
In many cases, there’s a naive assumption underlying this, namely that Engineering is the main productivity bottleneck. While it’s very often true that there are Engineering bottlenecks and inefficiencies, one needs to keep in mind that Engineering productivity can only be improved linearly. On the other hand, unless Product is very focused on customer value (the fundamentals we discussed in Part 4), it can generate an exponential increase in complexity and work; thus negating any Engineering improvements you may have been able to achieve.
We recommend keeping a “standard and boring” hierarchical organisation for Engineering, supplemented with two things:
- An easy way to form and dissolve ephemeral teams (possibly without changes in reporting for members of these teams), to be able to respond to new challenges.
- Pushing back some dependencies to the teams requesting them, in order to avoid certain teams (in particular, platform teams) becoming bottlenecks.
While you should strive to allow teams to operate as autonomously as possible, at some point – typically once you get to 30-50 engineers – you will need to start introducing platform teams. For instance, teams for devops, security, and application-level infrastructure. As mentioned above, you should strive to have a flexible allocation of responsibilities to avoid the platform teams turning into bottlenecks.
Increase your release agility as your team grows
The last point we’d like to mention here is that, as your team grows, you should strive to increase your release agility. This may seem counterintuitive but is an important and achievable goal while you’re scaling your Engineering team.
In one of our past jobs, we’ve measured release failure rates and found out that they are inversely related to the number of previously unreleased commits in those releases. Teams that released incrementally and very frequently were thus significantly more productive than teams that only did infrequent “Big Bang” releases. Using feature flags and gradual rollouts can help diminish the pressure on new code that’s released in production.
One of the main enablers for frequent releases is putting in place good automated tests. To slow down your release frequency, nothing beats a 1-2 week manual test cycle!
The best way to achieve good automation is to get rid of your Quality Assurance (QA) team, if you ever started building one. Separate QA teams may make sense for companies shipping packaged software since the cost of defects can be extremely high. However, in today’s SaaS world, a separate QA team or discipline creates overhead and quickly becomes a bottleneck. Engineers should do all QA; since they dislike doing manual testing, this will provide a strong incentive to automate QA over time.
Conclusion
You’ve now reached the end of this series in which we covered the following:
- In Part 1, we discussed what it means to move from pre- to post-PMF, and what new challenges you’ll need to face.
- Part 2 we provided insights into how you can scale your company culture to deal with those challenges.
- Part 3 discussed scaling aspects that are common to both Product and Engineering.
- Part 4 we focused specifically on how you need to evolve your Product discipline.
- This Part detailed scaling aspects that are specific to your Engineering team.
We hope you’ve enjoyed this series and that this material has provided you with food for thought and further exploration.