More than a decade after the publication of Nudge, the organizational relevance of insights from Behavioural Science is no longer questioned. But in terms of applying these insights, two challenges remain. The first relates to creating organizations that truly understand the science and embed it deeply within their structures and methods. The second challenge is one of scaling: how can we ensure the results of experiments and pilots successfully scale into interventions that reliably improve consumer welfare? In this article we will focus on the latter challenge and identify three specific scaling challenges.
First, one of the biggest themes from the research on judgment and decision-making is the notion of context dependence. We now know that numerous elements in a given context (e.g. the medium of the message, the time of day, information framing and ambient factors, to name a few) influence peoples’ choices. A rich literature of preference reversals has shown, for example, that consumer preferences can reverse with seemingly irrelevant changes to the context. For business leaders and policymakers, the implication is clear: the fact that a particular intervention worked well in one particular context does not guarantee its success in a completely different context.
Second, the lack of diversity of pilot-study participants poses a real challenge. While it is important to understand the efficacy of an intervention among a representative group, it is equally important to understand the heterogeneity of the population. Small-scale studies may focus on the set of people for whom the treatment effects are believed to be the most significant, but a goal for large-scale replications should be to figure out the effects for a wide array of segments of the population. The fact is, in some cases interventions may not scale well to the entire population, but can be very effective on a subgroup. After all, successful business models and policies do not have to be one-size-fits-all.
Third, there is often a temptation to adopt a ‘kitchen sink’ approach whereby multiple interventions that have been successfully tested independently are deployed simultaneously. Unfortunately, multiple-insight interventions can interact in complex and unpredictable ways, and can actually backfire.
Following are three examples that illustrate the challenges of scaling behavioural interventions.
EXHIBIT A: Credit Card Reminders
Research shows that consumers tend to spend more when using a credit card rather than when paying with cash. One reason for this ‘credit card premium’ is the fact that continued use of the credit card weakens memory of past expenses. Consequently, giving people feedback (reminders) on how much they have spent on prior expenses should improve mental tracking and mitigate spending differences.
In response to growing concerns about credit card debt, in 2010 the government in South Korea mandated that credit card providers introduce a text messaging service to remind consumers about recent transactions. In addition to preventing the fraudulent use of credit cards, the policy was designed to improve rational spending behaviour. The expectation was that the text-alert system would help people control their spending and hence reduce credit card balances.
The result: The policy had the intend outcome for only about 12 per cent of the population—the heaviest spenders. For the remaining 88 per cent, it actually resulted in increased spending. This backfiring effect can be attributed to an important difference in the manner in which the reminder was delivered. In the pilot studies, it was available on the same screen where spending decisions were being made; while in the scaled-up intervention, it appeared separately on the user’s mobile device. This created a degree of ‘digital dependency’ whereby consumers believed that they could easily access their past spending if they needed to, thereby reducing the motivation to track it.
Also read: How brands decode behaviour and unlock the secrets of making users tick
EXHIBIT B: A Retirement Savings Intervention
Mexico is facing a poverty crisis among its elder citizens. All salaried employees in Mexico must make a 6.5 per cent mandatory contribution to their pensions, but projections show that they need to make an additional five per cent voluntary contribution in order to retire comfortably. Unfortunately, the voluntary contribution rate in is abysmally low.
In working with the pension authority in Mexico CONSAR and ideas42, Rotman professors Avni Shah, Matthew Osborne and one of the authors (Dilip Soman) redesigned the quarterly statement that every salaried employee receives. Research shows that simplifying communication and making it more engaging increases the likelihood that recipients will consume the information and act on it. Accordingly, the redesigned statement was made significantly more engaging than the original, in two ways. First, it provided a simplified visual illustration (in the form of a categorical thermometer) showing whether the recipient’s current savings were adequate for retirement.
Second, the statement included one of several interventions that had been shown to be successful elsewhere: gain versus loss framing, a wallet cut-out to increase implementation intentions, an appeal th. at made the family’s welfare salient and a fresh-start intervention encouraging recipients to start saving after a particular temporal landmark.
The result: In a large-scale trial with members of two pension funds, the intervention was a success with one of the funds. It increased the contribution incidence, the contribution amounts and the contribution frequency. However, it backfired in the other fund. Why? Because of a specific design feature unique to the Mexican pension system. Mexicans need to first choose a pension fund and then make their contribution decisions. The quarterly pension statements displayed the performance of each fund in a tabular form. By making these statements more engaging, the researchers increased the attention that was paid to the table. If the fund was high performing, then the engaging statement ended up improving voluntary contributions because it increased the motivation to save. If, on the other hand, the recipient had chosen a low-performing fund, the intervention resulted in demotivation because it highlighted attention to the fund’s lower performance.
EXHIBIT C: Heterogeneity in text message reminders
In ongoing work with Mexican pension contributors, the same group of researchers compared a control condition with one in which recipients received the redesigned quarterly statement. In previous research, text reminders have been shown to successfully to convert intention into action. Thus in a third condition, subjects additionally received a text message with a variety of call-to-action messages.
The result: Compared to the control group, a condition in which recipients received a text message emphasizing their family’s financial security in addition to the redesigned statement increased contribution rates significantly. However, did both men and women care equally about their families? What about people who didn’t have children? Did age matter?
Using a machine learning technique, the researchers identified heterogeneous treatment effects across the scaled-up population. Their results were unsurprising, but worked to show exactly for which sub-segments in the population the intervention was particularly successful and where it was not. For instance, they found that the ‘family security text message’ intervention increased contribution rates for people who were aged between 28 and 42. It did not have a significant effect for those above the age of 43, and it had a backfiring effect for people under the age of 27.
Knowing exactly where the family-security text message (as well as each of the other messages tested) was most effective would allow the Pension Authority to effectively target sub-segments of the population.
How might these scaling challenges have been at work for an intervention based on the most robust findings from Behavioural Science: the framing of monetary outcomes as either gains or losses?
In a field study in a high-tech manufacturing facility, one of us (Tanjim Hossain) manipulated the manner in which a productivity bonus was presented to workers. In particular, the same bonus was presented to a random subset of workers as a potential gain and to another subset as a potential loss. The authors found that both the gain and loss framing increased productivity compared to a control condition (in which there were no incentives.) Incentives clearly worked; more importantly, framing the incentive as a potential loss led to a small but significant increase in productivity over presenting it as a potential gain.
It is quite important that this small increase in productivity happened in a real-work setting. If this small gain in productivity could be scaled to large populations of workers, it would indeed have a significant impact on welfare. The question remains as to whether scaling-up happens at exactly the same rate or whether there would be a ‘voltage drop’. Could the intervention actually backfire? In this study, the intervention was implemented for only about a month. While the impact was evident throughout the experiment, a negative framing of incentives might be less likely to be successful if the intervention is permanent or much longer in duration.
We believe that the effect of loss aversion on productivity depends on the interaction between the framing (gain vs. loss) and the underlying incentive scheme. Hence, one needs to be careful in choosing the economic part of the intervention while scaling up to ensure that it doesn’t backfire. Moreover, the effectiveness of loss framing may also depend on the size of the economic incentives. In the study mentioned, the size of the bonus was above 20 per cent of any of the workers’ base salaries. The treatment effect for smaller-sized incentives may be insignificant or opposite in direction.
Our views of why scaling challenges exist are grounded in organizational realities. In particular, we will highlight three reasons for this, as well as one to do with the nature of evidence.
The first reason is solution-mindedness. Unlike scientists, governments and businesses are typically under enormous time pressures to solve problems. Second, as applications of behavioural insights have spread, organizations are increasingly relying on non-specialists to design and deploy behavioural interventions. In an effort to help non-specialists, a number of heuristic frameworks have been developed. Frameworks are elegant in that they allow a non-specialist to try to design interventions based on the learnings of others. However, they can often be counterproductive because they change the process of intervention design from one that begins with a careful audit of the context to a checklist-based approach.
Third, most organizations that we have worked with operate in silos of capability. Behavioural scientists are typically located in a different department from data scientists and design teams. Some of the scaling challenges described herein will be best addressed if we have large and diverse teams working on these projects. Perhaps most importantly, the cost of experimentation—collecting and analyzing data, time commitment and organizational buy-in, among others—is high; therefore, many organizations embrace off-the-shelf results.
The nature of insights in the behavioural sciences is significantly different from evidence in other sciences. Consider fields such as Physics or Medicine. As an example, ‘objects released from a height will fall to the ground, irrespective of who releases them, the height of release, their colour and whether there are other objects being released at the same time.’ The theory of gravity has a large bracket of context surrounding it, and how the theory works in foreseeable but uncommon contexts is also well understood. For example, releasing the same object on the surface of the moon (or in outer space) will not result in the same sort of drop. Most people properly understand the difference between the two contexts.
Unfortunately, the field of Behavioural Science has very narrow brackets of context. Therefore, it is important to emphasize that the documented robust demonstration of an effect in a particular context with a particular type of end user at a particular point in time does not guarantee that it will successfully replicate across contexts.
Therefore, unlike the natural or medical sciences, where a rigorous meta-analysis of available evidence is typically enough to help predict what would happen in a scaled-up scenario, behavioral interventions often call for a sequential approach to evidence. A meta-analysis is helpful to serve as a starting point for developing interventions. These interventions would then probably need to be tested in low-cost and quick-win environments such as laboratory studies or online panels. The successful interventions in these domains might then be scaled into pilot studies – and then into larger pilots – before being enshrined in policy.
We therefore caution the practitioner to avoid the temptation of simply using an off-the-self solution from the proverbial Nudgestore, and instead using a more customized and tailored approached based on the mantra popularized by The Behavioural Insights Team – Test, Learn and Adapt!
In particular, it is of utmost importance that pilots are done in the context in which the intervention is going to be scaled up (in situ evidence). In situ testing in South Korea could have flagged the potential backfiring effect of reminders at a stage where it could have been corrected. Finally, scaling up must be accompanied by continuous testing, learning and adapting in order to detect and respond to potential backfiring due to context changes, interactions between different interventions and heterogeneity.
In closing
We will end with a few proposals. First, rather than viewing researchers purely as producers and organizations purely as consumers of research, we should strive to co-create evidence. Second, researchers should be incentivized to articulate all of the dimensions of the context (or features of the situation) under which their documented effects hold. They should further be incentivized to test for the effect under different contexts.
Third, when in situ testing is not possible, we should document the dimensions of the context of the scaled-up situation that might be different from the testing phase and invite the original researcher to speculate on whether these differences might change the results. Fourth, we should test for heterogeneous treatment effects of interventions and use a family of appropriately targeted interventions to scale solutions.
Finally, we should encourage researchers and organizations to create project teams of behavioural scientists, designers and data scientists to collectively design, test, scale and monitor interventions. And we should strive to reduce the cost of experimentation within organizations. All of these efforts will be worthwhile to bring the power of behavioural insights into organizations of all types.
Dilip Soman is Canada Research Chair in Behavioural Science and Economics, Founding Director of the Behavioural Economics in Action Research Centre at Rotman [BEAR] and Professor of Marketing at the Rotman School of Management. He is the author of The Last Mile: Creating Social and Economic Value from Behavioural Insights (Rotman-UTP Publishing, 2015). His most recent book is Behavioural Insights in the Wild (Rotman-UTP, 2022), which he co-edited.
Tanjim Hossain is a Professor of Marketing in the Department of Management at the University of Toronto Mississauga, where he serves as the Chair of the Department, with a cross-appointment to the Marketing Area at the Rotman School. He also serves as a chief scientist at the BEAR centre.
[This article has been reprinted, with permission, from Rotman Management, the magazine of the University of Toronto’s Rotman School of Management]