Managing Emergence Within a Sprint

Executive Summary

Some test anomalies are clearly defects in that they fail to meet a stated need, or they break something that worked before.  Defects are outside the scope of this paper.  We will be discussing non-defect anomalies, which represent a form of emergence.

When an anomaly is discovered during exploratory testing, and that anomaly does not map to an existing scenario within a story, some teams debate over how to treat it. 

Background

Cards v04.png

Emergence, Defects, and Technical Debt

One of the development teams I support has a primary set of Product Owners and proxies who approve delivered software.  After approval, however, they have another set of subject matter experts who perform exploratory testing to see if actual usage inspires tests they would not have considered during backlog grooming.

If we know it’s not a defect, but do we treat it as a modification to an existing story, or as a new story?   And how urgently should we address the emerging need?   The following steps can help resolve these questions:

·        Evaluate whether the “emergence” is actually a defect that should be corrected immediately; if it is, skip the rest of this paper.  A defect would involve failure to successfully develop expected functionality or non-functional characteristics.

·        If it isn’t a defect, valuate whether resolving the emergence is trivial, potentially a couple of hours or less.

·        Evaluate how soon the developed functionality is likely to be deployed into Production

·        Evaluate whether the new scenario has a high probability of resurfacing in Production

·        Evaluate whether the new scenario would have a high impact upon the user, data integrity, or other important entities (in other words, risk assessment).

·        If the emergence is trivial, note it in an existing story and resolve it the current iteration if possible. 

·        If non-trivial to address, it will merit a new story or sub-story.  Normal backlog/iteration management will determine whether to address it in the current iteration, or in a future one.

A new story should only be placed into the current iteration if release is imminent, is highly likely to create issues in Production, and those issues may have high impacts.  Since it was not initially defined in the Acceptance Criteria, the team must negotiate with the Product Owner; usually the new scenario can only be addressed right away if something else is allowed to be deprioritized out of the iteration.

An Analysis Flow for Emergence

The following process flow is not rigid.  It is not a required checklist.  You may wish to alter it, and you probably should. 

However you tailor this flow, please do not hold a separate meeting for each box in the flow.  Usually all the activities can be completed in a few minutes by the team, but the thought processes described in the diagram should be present whenever a team asserts that a previously unidentified scenario requires expanding a completed User Story.  The flow assumes that a User Story has been developed, demonstrated, and accepted.

Managing Post-Acceptance Emergence Poster.png

Most of the time, an emerged scenario will simply result in a new story to be created the next iteration (or later.)  When this occurs, however, the iteration retrospective should re-evaluate whether story elaboration is adequate, whether the emergence represents a normal and manageable aspect of the work, and whether the effort to anticipate or head off some types of emergence is justified.

A Quick Review of Verification and Validation

Bugs and Defects

Bugs and Defects

Verification ensures that a delivered piece of functionality meets what was requested in the User Story.  Verification always checks functionality, but also checks stated non-functional requirements:  security, Section 508 compliance, stated performance needs, and others.  Frequent communication, daily involvement of the Product Owner, and well-selected stories for each sprint should minimize verification issues.  Verification not only checks that the story works as intended (unit test), but also checks that the new functionality does not negatively impact existing units (integration/regression test).

Validation checks that a delivered piece of functionality is fit for use in the target environment.  In other words, it is what was needed, it runs fast enough and reliably enough, without disturbing (or breaking) the previously shipped code or other systems.  Validation doesn’t confine itself to stated parameters, but can be used to refine them.  Validation also doesn't start at the end of the development work (such as "wheels-up testing" in aviation.)  Validation starts the minute the first Feature or User Story is drafted.

Validation activities check for unexpected behaviors as well as normally expected behaviors.  Many techniques may be used:  performance tests, usability trials, exploratory tests, load tests, stress tests, edge tests, corner tests and others.  Validation not only addresses individual stories, it can also flush out issues with architecture, environment configuration, and others.  Thinking about validation early, just as stringently as we think about verification, can help us anticipate issues.  We can learn to avoid issues by early and frequent validation, or mitigate them through preparedness.

Verification and Validation.png

Understanding whether the emergence relates to verification (developed what was intended) or validation (how well it behaves) can speed up discussions about what an emergence is, and what to do with it.

 

Check Whether the Story Was Actually Completed

Does the anomaly result from skipping elements of the standard Definition of Done?

Keep Validation Legitimate

I once encountered testers who would roll a keyboard on their heads – literally – to see if the application would lock up. Under that version of the OS they were using, it invariably would lock up.

What was the probability of something like that happening in the field? Greater than zero, actually, since the users would be in mobile field offices where coats, books, and other things could easily get dropped on a keyboard.

What was the impact? Very small; the users could just reboot their notebook computers if this happened.

Scope of control? Zero. The only fix would be for the organization to re-write MS-Windows or change out the operating system. Both options would have been ridiculous.

Despite the negligible risk, one tester refused to release the system, essentially holding it hostage for weeks over this “defect.”

If so, then the story’s status should be pulled back from “Accepted” to “In Progress.”  It never should have been accepted.  Remember, the Agile Center of Excellence (CoE) Definition of Done may be augmented by teams, but no item may be deleted from it.

Does the anomaly result from skipping elements of the standard Definition of Ready?

If you can’t seem to get your stories together fully before starting work, you may have a story splitting situation on your hands.  (By “story splitting,” we are referring to the industry discipline of defining stories of manageable size before or during an iteration.  We are not referring to Agile Central’s electronic technique to handle work teams couldn’t finish.) 

Splitting may be appropriate because the story was written with far too many scenarios and parameters.  The story may have failed to meet the “Independent” criteria, having many related stories buried in one.  Such stories should be split so that they are manageable, logical, can be reasonably defined for development, and can be completed in a relatively short period of time.  (See the Splitting Story topic under https://www.scaledagileframework.com/story/)

Does the anomaly violate functional statements in the User Story or Acceptance Criteria? 

If so, then it is a defect and should be handled as such.  The story should be placed back into “In Progress.”  Skip the rest of this paper.

Does the anomaly violate express non-functional requirements (referred to as NFRS)?

·        Does it violate standard NFRs mandated by policy (Security, Section 508, others)?

·        Does it violate agreed-to, general NFRs for the release or feature?

·        Does it represent failure to address NFRs stated within the User Story?  They may include:

o   performance,

o   stability,

o   reliability,

o   portability,

o   usability,

o   quality,

o   and other.

(Based on the list above, you can see why NFRs are often referred to as "ilities.")

If none of these situations are true, then the emergence may represent true discovery.  That is a good thing!  Discovery of unanticipated functional or non-functional needs are less expensive mid-iteration than 6 months after the code is put into Production. 

Emergence also can represent slack approaches to design, overburdening of the team, or mistaking haste for good velocity.  That is not a good thing.  You should evaluate where emergence is coming from.  That may feel like sacrificing velocity now, but it will gain velocity later.

Verification and validation both address NFRs, but validation particularly considers the sum of all factors to ensure what was asked for a built will actually work as intended.

Early Steps Should be Obvious, But…

If you’re working in an agile environment, checking for whether Definition of Done has been met should be obvious.  So should checking to see if the work met stated functionality.  Checking for failure to meet stated non-functional requirements may not be so obvious.

"Feature Creature" focus can drown out important "ilities"

"Feature Creature" focus can drown out important "ilities"

One problem with agile-as-practiced is the rise of the “Feature Creatures,” teams so focused on functionality that they ignore critical considerations such as performance, security, usability (including standards for persons with disabilities), capacity, testability, maintainability, and others.  Some teams hate these non-functional requirements because they screw up perceived velocity.

Well, too bad.  They matter.  Failure to consider them will lead to expensive retrofits.  In the case of security, that always leaves holes.  For many other NFRs, it is ghastly expensive, which may serve the need of a contract development team, but does not serve the needs of the people who sponsored the system development in the first place.  So write your stories to consider both functional and non-functional needs early, and check them often.

Emerging New Needs

Emergence often contains diamonds in the rough.

Emergence often contains diamonds in the rough.

Exploratory testing will surface new functional needs, and possibly even more non-functional ones.  Skilled exploratory testers can be valuable for catching such emergence early and often.  Once their discoveries have been determined to not actually be defects, they may be defined as new attributes or scenarios within stories, or broken out into new stories.

If the emerged need isn’t a clearly, obviously, critically important aspect of the functionality, it may belong in a new story.  Make that story a dependent or related one, and track it.  Don’t obsess about having to fix it in the current iteration, especially if your iterations are under 3 weeks in length.  You’ve discovered an important need, and did it within days of the functionality being developed.  And you’ve planned for a degree of emergence anyway.  Right?  RIGHT?

Technical Debt

envelope v002.png

Some emergence will represent technical debt, realization that some design decisions we made under constraints have to be addressed now. (For those not familiar with technical debt, it is different from defects.) Occasionally we have to accept and mitigate some technical debt because we don’t have the ability to resolve it yet.   But we should anticipate the possible consequences of not resolving it now.  If we do that well, tech debt-related emergence shouldn’t baffle us.  We should have some mitigation in place, such as calendar reminders to circle back and address the emergence, test cases to evaluate when it’s coming back to bite us, code commentary to remind us how we planned to address it, and so on.

Tech debt - a "sword of Damocles"

Tech debt - a "sword of Damocles"

Technical debt should be incurred based on an informed understanding of risk.  Deliberate, responsible technical debt is often necessary in order to deal with time constraints, technical shifts, and other situations outside the direct control of the development team.  Responsible technical is paid down as soon as possible, however, as part of risk mitigation. Casually accepting the risks of technical debt is irresponsible.

Ignored and neglected technical debt will create hard-to-manage emergence.  It will cause defects, sometimes of critical severity, and usually costly to correct.  When we see “emergence” caused by Technical Debt, we should stop calling it an emergence.  It is a design defect, outright.  It didn’t “emerge.”

Address Emergence Based on Risk

Neglected technical debt represents high risk.  Defect-based issues represent a range of risks, as does true emergence.  We should evaluate the relative risk of both defects and emergence.  We should decide how we will deal with that risk on a factual basis. 

We will focus on true emergence for now, and discuss pure defects at another time.

 [The following set of activities looks complicated and time-consuming when written out, except to those with skills in risk management.  Once risk evaluation becomes habitual, however, such risk assessments can be done quickly.  At first, you may mark off risk checklists, follow flow charts, or fill out risk profiles.  Those tools are aids to get started; they can’t replace mastery of this basic, necessary skill.]

Assess the Risk Represented by the Emerged Scenario

As stated above, an emerged need or scenario discovered after acceptance, represents risk.  There are risks to resolving certain ones right away, and risks to waiting to resolve others.  These risks should be analyzed against a few parameters. 

·        Probability – how likely is this situation to arise in actual use? (for example, 1 out of 10 times would be higher probability than 1 out of 1,000 times)

·        Proximity – how soon is this situation likely to occur in actual use?  If it won’t come up for six months, we have more time to deal with it than if it is likely to show up in 3 days.

·        Impact – what will happen when this situation occurs?  How much cost or inconvenience to system users is likely?  How much damage to data or system stability is likely?

·        Scope of Control – do we have the ability to develop, sponsor, or negotiate a solution to this situation? If we can’t prevent the scenario from happening, how can we lessen its negative impacts by actions that are within our control?

Meteor.png

Here’s an example:  a meteor falling on our hosting site would have catastrophic impact, but very low probability.  It’s also out of our direct scope of control; we can’t prevent it.  Therefore, we may try to mitigate the possible damage by minimizing irrecoverable data loss.  We may not have the resources to meteor-proof a building (preventive mitigation), but we could reduce recovery time by requiring off-site storage and disaster recovery protocols.  Such protocols are part of contingency planning.  The more frequently we are able to update redundant storage within our budget, the more we are able to mitigate potential data loss.  We would want to get these contingency and mitigation capabilities in place with an urgency proportional to the risks they mitigate.  How soon -- within the next quarter or within the next 24 hours – would depend on those factors.

By contrast, consider a situation in which a flawed acceptance criteria could result in an accounting error.  In the real world, this has relatively high Probability, the Impact still could be serious, and Proximity (when it might crop up) could be pretty soon after release.  The scope, however, is within our control.  We would want to increase the quality of our stories and acceptance criteria, we’d want to perform solid system architecture and design work, run high quality tests, have a strong configuration management and deployment pipeline, and ensure robust feedback loops all along the way.  Those are all things we could start working on during the current sprint.  We wouldn’t wait until the next quarter, we would be diligent about them now.

Probability and Impact are generally used to create a risk rating.   A common approach is to use a scale between 1 and 5; 1 represents the lowest Probability or Impact possible, 5 represents the greatest. 

·        If an item has low Probability (1) and fairly low Impact (2), then the rating would be 1*2=2. 

·        If a risk has a high Probability (5) and a fairly high impact (4), then the rating would be 5*4=20. 

On this scale, the issue with a rating of 2 would represent lower risk than the one with 20.  Some organizations portray the range of rated risks along a color band in order to get a sense of their relative importance, as shown below

risk profile.PNG

While the risk rating gives us a sense of the emergence’s (or defect’s) importance, considering Proximity and Scope of Control tell us how soon we’d have an issue, and whether we can control it.

Remember that these ratings are relative at the project level.  Rarely does an agile development team need statistical cost, time, and resource data to pinpoint a precise risk profile; that is more in the domain of accountants and others concerned with number crunching.  Agile development teams need to understand the relative importance and risks of any emerging functionality or quality needs.  They only need to prioritize, not create projections accurate out to several decimal places.

Based on Risk, Decide What to Do About the Emerged Needs

First, analyze the tradeoff – does the cost or delay of a solution outweigh the possible impact, probability, severity?  In other words, is addressing the new scenario even worth the effort?  Use your sizing techniques to do this (Planning Poker©, t-shirt sizing, simplified function points, etc.) 

If addressing the emergence is important (high risk and worth the investment), decide how to manage it.  The following criteria gives ideas as to how to make tradeoff decisions (and the process flow at the top of this paper, along with the decision table in Appendix A.)  These appear complex until one becomes skilled in risk management; then they become so simple that following a step-wise procedure becomes unnecessary.

(The next four bullets were shown in the Process Flow earlier in this blog)

·        Identify the anomaly as a defect to fix in the current iteration (remediate the risk.)   (Do this if the shipped code does not conform to the User Stories and existing scenarios or if it violates definition of done, AND it has high severity, high probability, mid-high proximity.)

·        Identify the anomaly as a defect and defer it for later (remediate the risk.)   (Do this if it does not conform to the User Stories and existing scenarios or violates definition of done, AND has only moderate severity, impact, or proximity.)

·        Identify the anomaly as emergence, place it in a a story, and fix it now, possibly pushing out other stories (remediate the risk.)   (Do this if the emerging need/scenario was not anticipated by Acceptance Criteria, AND it has high impact, high probability, and high proximity.)

·       Identify the anomaly as emergence, place it in a a story, and fix it later (remediate the risk.)   (Do this if the emerging need/scenario was not anticipated by Acceptance Criteria, AND it has low-mid impact/probability/proximity)

(The next bullets were not shown in the Process Flow earlier in this blog)

·        Avoid the risks related to not addressing the emergence.  (Do this if addressing the emergence is not feasible.)  This may involve adjusting the scope of the planned work to eliminate risky scenarios.

·        Watch the risks related to not addressing the emergence (Do this if there is a very low or uncertain probability or impact, since fixing it will push out other stories that all probably have higher priority.  Document the emergence as low- priority story in case it surfaces later and poses higher risk.)

·        Mitigate the risks if the emergence is apparently out of our scope of control to correct.  For example, if changes to external systems/environments are impacting our system, then a mitigation would be to invite representatives of those components to our IP sprints or backlog grooming.  That way we may be at least aware of emerging changes and impacts on our work.

·         Accept the risk if the emergence has extremely low probability and impact, and fixing the emergence will push out other stories of higher value.  Also accept the risk if the cause of the emergence is completely out of our scope of control, is rare, and can’t be anticipated.  Be careful that this does not become the default!  Technical debt often accumulates because its risks are simply accepted without understanding their impact.

Both Mitigating and Accepting the risks may call for Contingency planning in case the worst happens.

Summary

Whether or not you strictly follow the evaluation steps, and whether or not you memorize all the components of risk rating and management, you must address emergence on a factual basis.  The flows and steps are tools to help you progress to that point.  Otherwise, the total cost of a decision to address emergence will remain arbitrary, and emergence can overwhelm your projects.

Along with developing these skills among your team, work on the bigger picture.  Cooperate with your project management roles to figure out time and cost tradeoffs.  Work with the development team and business sponsors to determine how the rate of emergence impacts your Feature mapping and Value Stream mapping.  Work across teams to figure out how to get a handle on emergence collectively, and how to keep it from disrupting interdependencies and cooperation across teams.  You may not eliminate all emergence.  You can control your reactions to emergence, and hopefully diminish uncontrolled emergence.