How Many Bugs Are Too Many? A Data-Driven Approach to Quality Debt at Harness
Speed isn't everything - quality and bugs must be addressed! Lest you really, *really* enjoy technical debt.
As a startup, one of our unfair advantages is speed. But, speed also brings quality challenges, a critical issue in the software delivery space. Our platform is mission-critical for our customers. Imagine getting hit with a product bug when you’re trying to deploy an urgent hotfix to your customers. Not a good state to be in!
Startups also tend to add significant technical debt to the codebase in their initial journey, resulting in speed and quality being in a zero sum game. Harness had a similar share of challenges a couple of years ago.
- To begin with, Devs were testing features & were focusing more on positive scenarios.
- Although we supported many platforms & integrations, we weren’t testing all the combinations. Our initial focus was more on common customer use cases. Negative testing, edge cases, and performance and scale testing didn’t get the desired level of attention.
- We did not have good automation coverage across automation layers like unit testing, integration testing, API testing, and End-to-End (E2E) scenarios, resulting in higher quality debt.
- At one end, we were rapidly scaling with more customers and releasing new features and at the other end, we were releasing every day without the requisite amount of automation for daily sanity/regression testing.
Customers were finding key gaps/defects in production. We were also introducing regression issues, which we failed to find within our development & release cycles.
Harness was growing faster than the pace at which our quality debt was being addressed.
We acknowledged the fact that the above strategy had its limitations, and came up with a clear quality strategy for both the new products and products with higher customer adoption.
The expectation was that the strategy should help us to keep up with the pace of delivering new features, as well as improving the quality of existing features.
Quality Improvement Strategy, Phase 1
When we started this initiative, we had limited bandwidth (as you do, in any high-growth startup!) so we decided to improve quality on two key dimensions in phase 1:
- Create a sanity and regression suite covering all critical E2E scenarios, and have related tests setup in the QA environment for regular execution.
- Improve the quality of all new features so that we don’t keep adding debt to the existing backlog.
For all new features, we set stringent feature sign-off criteria:
- For critical features, QEs are involved from the design stage and provide P0/P1 scenarios for Devs to test as part of Dev testing cycle (“shift left” testing).
- Test scenarios are reviewed by QE, Devs, and Product Managers.
- We set automation coverage criteria for backend/frontend unit tests (~75-80%), E2E automation (100% of P0/P1 scenarios).
- If we can’t automate a scenario across any layers, additional risk assessment and fixes are implemented.
- All P0/P1 and P2 regression bugs are fixed.
- Critical features go through Alpha/Beta testing behind a feature flag before the feature is made available for all customers.
This strategy resulted in:
- A decrease in the number of CFDs, critical CFDs (P0/P1), and regression defects.
- A commensurate increase in Internally-Found Defects (IFDs) that could potentially have been a CFD (refer to rows 2 & 3 in the below table).
Please note: We began this exercise at the start of the 2nd quarter of 2020-2021, hence the value derived is from 3 quarters of the 2020-2021 year. 2021-2022 is in progress (data covers 3 quarters - quarter starts from Feb). We have been adding multiple products every year without impacting quality goals.
Quality Improvement Strategy, Phase 2
We continued to review the CFDs and observed a pattern that, while we achieved significant reduction in CFD count for all new features (~<2%), we were still finding many CFDs in existing features that were developed 12-24 months before this time period (let’s call these legacy features).
This was impacting our customer experience. As such, we started with an approach that could give us clear and consistent results across legacy features as well.
We picked Product A, a legacy feature that had very high customer usage. We thoroughly analyzed it and came up with a strategy to make a stronger impact on the quality of our product.
Please note that we still had bandwidth constraints, hence our quality efforts were focused on specific components in each quarter (from mid-Q4 of 2020). This was extended to other components in our product over the next few quarters.
Key Constraints and Comparative Data
- Team size increased from earlier quarters to Q4 of 2020.
- Marked increase in number of new features being delivered.
- New team members needed 4-6 weeks to gain good knowledge in the product.
- A few of the existing team members were moved to a different project area.
- The strategy was initiated in the 2nd month of the quarter, and the automation efforts showed results only in the 3rd month of the quarter.
Product A has two project modules (let us call them A1 & A2).
While we considered Product A as a whole for our quality improvements, we were able to allocate desired bandwidth in module A1, but could not allocate bandwidth for module A2.
Key Results & Comparative Data
Product A - Module A1
While the above numbers were on Product A & Module A1 as a whole, we were able to achieve more than 50% reduction in CFDs for the components chosen in specific modules in their next respective quarters.
We understand that the quality problem we encountered is very common in the startup space, hence sharing our approach & learnings which could help others who are keen on pursuing a similar journey.
Here’s what we used as sort of a ‘quality playbook’ - and what we recommend for your organization:
- Continue stringent quality focus for all new features even when you have high quality debt for existing features.
- Perform detailed RCA on CFDs, and pick the top 3-5 components which have the most CFDs to get the highest ROI for the efforts.
- Focus on improving test coverage and automation coverage for respective components.
- For the chosen components, assign Dev & QE owners.
- Owners must come up with elaborate test cases similar to how we test new features covering negative/edge scenarios.
- Get the test cases reviewed by Devs, PMs, and Customer Success Managers as each of them can provide different dimensions.
- Capture current automation coverage across backend and frontend unit tests, integration tests, E2E scenarios, and API scenarios.
- Target automation coverage:
- Backend/Frontend unit tests (~75-80%).
- E2E automation (100% of P0/P1/critical P2 scenarios).
- 100% of API scenarios.
- Add QA pipelines that would be executed for all release builds.
- Act on critical action items from every CFD analysis.
- Developers & QE should jointly analyze the code and find opportunities to improve code/implementation logic, improve error handling, etc.
- Code walkthroughs help QE come up with additional scenarios and/or eliminate certain use cases which are not valid.
- Conduct bug bashes for thorough & ad hoc testing.
- Start by picking your top 10 customers to analyze their setup, configurations, and how they are implementing specific features. This helps us improve test coverage and cover these scenarios in regression testing.
- Fix all P0/P1 & critical P2 bugs.
- Set up a regular review process and accommodate required course corrections for better outcomes.
- Continuous RCA for CFDs & IFD regression, key action items should be added in the backlog for further considerations.
- Add and automate test cases for all the CFDs and IFD regressions.
Continuous product development that has 50% new features. For complex enterprise SaaS products, Defect Detection Efficiency is usually between 85-98%.
- Defects found in production are considered for the first 90 days after production release.
- In projects where we do not capture defects found in feature/PR testing, the DDE will be very low as we tend to find most of the defects in the feature/PR testing phase.
- ROI needs to be analyzed for efforts to capture all the defects in the feature testing phase vs measuring DDE of 95% and above.
- A DDE of 95% and above is a great metric for new products.
- Defect removal efficiency of 90% is also a strong metric, as we are fixing 90% of all the defects found and eliminating the opportunity for these to be found in production.
- Sub-goal of DRE of 100% for P0/P1 and 90% for P2s is practical.
- Regression issues are usually between 1-2% of all the defects. Focus should be on ensuring that these are non-P0/P1 issues.
- Adhering to the test automation pyramid model will help us measure defect leakage across each phase of development and release.
Quality strategy in start-ups should ensure that we keep up with the required pace in the initial phase of product development as well as 90-95% DDE as we scale with customers.
Harness is a mission-critical SaaS platform with many enterprise customers doing north of a million deployments on a monthly basis.
We looked at the issues around quality and put together dashboards to identify where the bug clusters are, how to classify and prioritize them, identify gaps in the processes, and what we need to make sure any new code that is added to the codebase goes through the correct PR checks, testing, automation etc.
For further reading and for more background on the topic, read Product Quality — Measure what Matters. You can also download my Template for Actionable Root Cause Analysis of Software Defects.
Happy shipping quality products at speed!