Literature Review

Methodological Reflexivity in the Applied Behavioural Sciences

A critical review of the replication, publication-bias, and open-science reforms (1995-2025) - and what they mean for practitioners who train on the canon.

Stefan Novic - Founder, NOVIJune 202622 min read

Submitted as a graduate critical-methodological literature review in applied consumer psychology and behavioural economics. Companion paper to the multi-method review of the practitioner canon.

Abstract

The previous review of the applied behavioural-science practitioner canon identified a structural gap: the canon's confident operational register is calibrated to a higher effect-size expectation than the bias-corrected evidence base supports, and the canon largely omits the methodological-critical literature that would close this gap. This paper fills that gap. Drawing on three decades of methodological development - with particular attention to the six-year acceleration between 2019 and 2025 - the review traces the field's responses to the replication crisis, examines the rise of publication-bias-aware meta-analytic methods, characterises the institutional adoption of open-science infrastructure, and identifies which methodological practices currently produce the most replicable behavioural-science research. Four conclusions are advanced. First, the replication crisis is best understood not as a single event but as a multi-decade reform process whose 2015 watershed merely made visible methodological problems that had been theoretically identified since the 1960s. Second, the post-2020 development of publication-bias correction methods - particularly robust Bayesian meta-analysis (RoBMA) and the convergent findings of DellaVigna and Linos (2022), Maier et al. (2022), and Camerer et al. (2018) - provides the empirical basis for substantially revising upward estimates of effect-size inflation across applied behavioural domains. Third, registered reports, preregistration, and the Open Science Framework now constitute a reproducible infrastructure whose adoption explains the replicability gradient between disciplines and labs. Fourth, the most recent literature (2023-2025), including the Protzko et al. four-laboratory consortium and the OECD's LOGIC behavioural-policy framework, demonstrates that high replicability is achievable when rigour-enhancing practices are adopted as a system rather than piecemeal. Implications for practitioners trained from the trade-press canon are specified.

Keywords: replication crisis, publication bias, preregistration, registered reports, behavioural science, applied psychology, open science, meta-analysis

1. Introduction

The companion review to this paper examined the practitioner canon of applied behavioural science through four analytic lenses - narrative, systematic, scoping, and meta-analytic - and concluded that the canon, while internally coherent and operationally useful, treats the underlying empirical science as more settled than current evidence supports. The meta-analytic synthesis there reported a roughly five-to-six-fold gap between effect sizes published in academic journals and those obtained in real-world implementation by government nudge units (DellaVigna and Linos, 2022). This gap, the review argued, can only be closed by the practitioner who pairs the canon with the methodological-critical literature it omits.

The present paper undertakes that pairing. It treats methodological reflexivity not as a niche concern of academic psychology but as the practitioner's principal defence against systematic over-prediction of intervention effects. The review's central premise is that an applied behavioural scientist working in 2025 must operate with a working knowledge of the past three decades of methodological development - and especially the six years between 2019 and 2025, during which the field's response to its replication crisis matured from defensive denial into institutional infrastructure.

Four substantive sections follow. Section 2 traces the historical arc from the methodological warnings of the 1960s through the 2015 watershed and into the post-2020 reform phase. Section 3 examines the technical apparatus of publication-bias correction and the substantive findings it has produced. Section 4 characterises the institutional infrastructure of open science as it now operates. Section 5 synthesises what the most recent evidence suggests about which methodological practices produce the most replicable applied behavioural science. Section 6 returns to the practitioner question, specifying what the corpus reader must add to remain credible in 2025 and beyond.

2. Three Decades of Methodological Development

2.1 The pre-crisis period (1995-2010): warnings unheeded

The methodological foundations of the replication crisis were laid not in 2015 but in the late 1950s. Sterling's (1959) demonstration that the psychological journals he surveyed published almost exclusively significant results - a publication-selection pattern with predictable consequences for the literature's reliability - established the structural problem fifty years before its consequences became undeniable. Cohen's (1962, 1988) work on statistical power demonstrated that the modal psychology study was substantially underpowered to detect the effects it claimed to find. Meehl (1978) argued that the soft sciences' theoretical apparatus was sufficiently flexible that almost any data pattern could be made to support almost any hypothesis. None of these critiques produced reform.

The 1995-2010 period saw further accumulation of warnings without structural response. Rosenthal's (1979) file-drawer problem formalised publication bias as a quantifiable threat to meta-analytic validity. Ioannidis (2005), in his widely cited paper "Why Most Published Research Findings Are False," provided a probabilistic model showing that, under realistic assumptions about effect-size distributions, publication bias, and analytic flexibility, the majority of published findings in many empirical fields would be expected to be false. Simmons, Nelson and Simonsohn (2011) demonstrated, in a paper whose subtitle promised to show how any null result could be made significant, that the combined effect of routine flexibility in data analysis ("researcher degrees of freedom") was sufficient to produce false-positive rates far in excess of the nominal five per cent.

These contributions framed the conditions under which a crisis could become visible. They did not produce one. The behavioural-science publication ecosystem continued to reward novel, surprising, and significant findings; the trade-press canon (Kahneman, 2011; Thaler and Sunstein, 2008; Cialdini, 2007; Ariely, 2008) was, during this period, drawing extensively on a literature whose methodological foundations the technical critique had already shown to be unstable.

2.2 The 2015 watershed

The Open Science Collaboration's (2015) Reproducibility Project marked the moment at which the gap between the warnings and the institutional response became impossible to ignore. The collaboration attempted direct replications of 100 studies published in three leading psychology journals in 2008. The headline finding - that only 36 to 47 per cent of the original studies produced statistically significant effects in the same direction when replicated, and that replication effect sizes were on average approximately half the size of the originals (Open Science Collaboration, 2015) - was widely reported and widely contested.

The 2015 finding was not, strictly, the discovery that psychology had a replication problem. It was the demonstration that the problem could be measured, was severe, and varied systematically across subfields. Replication rates differed by journal and methodological tradition: cognitive psychology fared markedly better than social psychology, and within social psychology, "priming" findings - the proposition that subtle environmental cues exert measurable effects on subsequent behaviour - fared worst of all (Doyen et al., 2012; Harris et al., 2013; Rohrer, Pashler and Harris, 2015).

The implications cascaded across the canonical literature. Specific findings that had become staples of the trade-press synthesis began to falter under replication. Money-priming effects on political attitudes failed to replicate (Rohrer et al., 2015). The much-cited ego-depletion finding, central to popular accounts of self-control, failed to replicate in a 23-laboratory consortium (Hagger et al., 2016). Power-posing - the proposition that two minutes in an expansive posture would meaningfully alter hormone levels and behaviour - was substantially disowned by one of its original authors (Carney, 2016). The combined effect was to retire a generation of findings that had been treated, in the practitioner literature, as robust empirical facts.

2.3 The post-2015 expansion (2015-2019)

Camerer et al. (2018) extended the Reproducibility Project's logic to the highest-impact venues. Attempting to replicate 21 experimental social-science studies published between 2010 and 2015 in Nature and Science, the team found that 13 replicated successfully - a 62 per cent rate that was higher than the broader OSC sample but still notably below the implicit reliability the journals' prestige had suggested.

The reproducibility findings were soon joined by a series of research-integrity incidents that further eroded confidence in particular bodies of work. The most prominent case during this period was that of Brian Wansink, whose Cornell Food and Brand Lab produced findings on environmental cues and eating behaviour that became staples of popular behavioural science. A 2017 audit identified statistical inconsistencies across multiple Wansink papers; Cornell University's investigation, concluded in 2018, found academic misconduct including misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results, and inappropriate authorship; thirteen of Wansink's papers were retracted, with at least fifteen further corrections issued (Cornell University, 2018; Lee, 2018). The case was particularly consequential because Wansink's findings had been heavily cited in popular books on choice architecture and applied behavioural design.

The post-2015 period also saw the emergence of the first systematic responses. Many Labs replication projects (Klein et al., 2014; Klein et al., 2018) tested whether classic effects replicated across diverse samples and conditions. Some did robustly; others did not. The Center for Open Science, established in 2013, expanded the Open Science Framework as institutional infrastructure for preregistration, materials sharing, and registered reports.

2.4 The reform consolidation phase (2019-2025)

The six years since 2019 have shifted the field from documenting the problem to institutionalising its solution. Four developments are particularly important.

The first is the maturation of publication-bias-correction methods, treated in detail in Section 3. The second is the comparison of effect sizes between academic and government-implementation samples (DellaVigna and Linos, 2022), which provided for the first time a precise estimate of the publication-selection inflation factor. The third is the post-2020 expansion of registered reports as a publishing format - from a small number of journals in 2014 to over three hundred adopting the format by 2024 (Chambers and Tzavella, 2022). The fourth is the demonstration, in the Protzko et al. (2023) four-laboratory consortium, that when modern rigour-enhancing practices are adopted as an integrated system, replicability rates of nearly 100 per cent are achievable.

The six-year window is significant because it is the period during which the field has moved from a critical-diagnostic posture to a constructive-reform one. The trade-press canon's confident register, calibrated to the pre-2015 evidentiary climate, has during this period become increasingly out of alignment with the technical literature.

3. The Technical Apparatus of Publication-Bias Correction

A core difficulty in interpreting any meta-analysis of behavioural-science interventions is that the studies the analyst can find - those that have been published - are not a representative sample of the studies that have been conducted. The selection mechanism is well documented (Franco, Malhotra and Simonovits, 2014): significant results are far more likely to be submitted, accepted, and cited than null results. The effect, at the level of an aggregated literature, is to inflate apparent effect sizes by an amount that depends on the strength of selection and the statistical power of the underlying studies.

The recent literature offers several methodological responses. This section examines three.

3.1 The Mertens versus Maier exchange

Mertens et al. (2022), publishing in Proceedings of the National Academy of Sciences, conducted a meta-analysis of 212 nudge studies and reported a pooled effect size of Cohen's d = 0.43 - a figure widely cited as evidence that the average nudge intervention produces a moderate behaviour change. The methodology was conventional: random-effects meta-analysis across heterogeneous interventions, domains, and outcome measures.

Maier et al. (2022), publishing in the same journal six months later, applied four publication-bias correction methods to the Mertens et al. dataset: robust Bayesian meta-analysis with model-averaging (RoBMA-PSMA), the Andrews-Kasy estimator, the weighted average of adequately powered studies (WAAP), and the trim-and-fill procedure (Stanley and Doucouliagos, 2014; Bartoš et al., 2022; Maier, Bartoš and Wagenmakers, 2022). Each method addresses publication bias by treating the visible portion of the literature as a censored sample of a wider, partially unobserved distribution.

The results were dramatic. After publication-bias correction, the bias-corrected effect-size estimate fell to a small fraction of the headline figure, with the authors concluding that, after correcting for bias, no clear evidence remained that the average nudge intervention is effective (Maier et al., 2022). The authors emphasised that this is a statement about the aggregate literature rather than about specific intervention classes - some interventions, notably defaults, retain robust effects under correction - but the implication for the practitioner is sobering: the literature on which much practitioner training rests appears to be substantially shaped by publication selection.

The Maier et al. paper provoked formal responses (Bakdash and Marusich, 2022; Szaszi et al., 2022), drawing attention to extreme heterogeneity in the underlying dataset as a complicating factor. The Datacolada research-integrity blog mounted a sustained critique on similar grounds, arguing that the meta-analytic average across such incommensurable interventions was uninformative. Maier and colleagues responded in turn, accepting the heterogeneity point but maintaining that the publication-bias finding was robust to its acknowledgement (Bayesian Spectacles, 2022).

The substantive question - whether nudging "works" on average - is therefore less settled than the popular-press synthesis suggests. What is settled is that the academic-journal effect-size distribution overstates the in-field distribution by a substantial margin.

3.2 The DellaVigna-Linos benchmark

A complementary line of evidence comes from comparing effect sizes between studies published in academic journals and those conducted at scale by government nudge units. DellaVigna and Linos (2022) assembled a dataset of 126 randomised controlled trials covering approximately 23 million individuals, comprising all trials run during the period of analysis by two of the largest nudge units in the United States - the federal Office of Evaluation Sciences and the Behavioural Insights Team's North American operations.

The contrast with the academic-journal literature was stark. In the academic-journal sample, the average impact of a nudge was approximately 8.7 percentage points on the relevant take-up outcome, representing a 33 per cent relative increase over the control condition. In the nudge-units sample, the average impact was approximately 1.4 percentage points, an 8 per cent relative increase (DellaVigna and Linos, 2022).

The authors decompose the difference into three sources: statistical power, intervention characteristics, and publication selection. A meta-analysis model incorporating these dimensions attributed approximately 70 per cent of the gap to selective publication exacerbated by low statistical power in the underlying academic studies. The remaining 30 per cent was attributable to intervention-mix differences - academic studies tended to test more aggressive interventions on more responsive populations than the nudge units' broader implementation portfolios.

The DellaVigna-Linos finding provides the most directly actionable estimate for the practitioner. The figure that should anchor real-world implementation expectations is the 1.4-percentage-point number, not the 8.7. A practitioner forecasting the impact of a planned intervention against the academic-published distribution will, on average, over-predict by approximately a factor of six.

3.3 The convergent picture

Taken together, the Maier et al. correction and the DellaVigna-Linos comparison produce a convergent estimate of the inflation factor. The technical-statistical correction reduces the apparent average effect by approximately an order of magnitude; the empirical comparison of academic versus in-field implementation reduces the average effect by approximately a factor of six. The two figures cannot be directly combined - they measure related but distinct quantities - but their convergence supports the qualitative conclusion that pre-2020 effect-size literature systematically over-represents the magnitude of behavioural intervention effects.

This is the empirical finding that the practitioner canon's confident register cannot accommodate without revision.

4. The Open-Science Infrastructure

The methodological-critical literature has, since approximately 2015, increasingly converged on a set of practices that, when adopted systematically, produce more reliable empirical findings. This section characterises the four most consequential.

4.1 Preregistration

Preregistration involves the public time-stamped specification of hypotheses, design, sample size, and analysis plan before data collection commences. The procedure restricts researcher degrees of freedom and converts what would otherwise be exploratory analysis into confirmatory analysis (Nosek et al., 2018; Simmons, Nelson and Simonsohn, 2011).

Adoption has expanded substantially. Preregistration counts on the Open Science Framework approximately doubled annually between 2012 and 2017 (Spitzer, Mueller and Naudet, 2023), and by 2022, surveys suggested that approximately 42 per cent of psychology researchers reported having preregistered at least one study in their most recent work (Nosek, 2022). The discrepancy between self-reported preregistration and the proportion of published papers that actually contain preregistration statements - Hardwicke et al. (cited in Spitzer et al., 2023) reported only 3 per cent in a random 2014-2017 sample - suggests the practice is more widely endorsed than uniformly executed.

Adoption is uneven across subfields. Economics has adopted preregistration via the American Economic Association's RCT Registry; psychology via the Open Science Framework and the Leibniz Institute for Psychology's PreReg in Psychology registry; medicine has longer-standing structures via the WHO's International Clinical Trials Registry Platform. The applied behavioural-science literature has lagged the methodological-academic literature in adoption - a gap with predictable consequences for the reliability of practitioner-relevant findings.

4.2 Registered Reports

Registered Reports go further than preregistration. The publishing format requires peer review of the study protocol before data collection. If the protocol is judged sound, the journal commits to publishing the resulting paper regardless of whether the results confirm or contradict the hypothesis. This structural feature directly addresses publication bias: the publication decision is made before the results are known and therefore cannot be conditioned on them (Chambers, 2013; Chambers and Tzavella, 2022).

The format was proposed in 2012 and was offered by a small number of journals in its first years. By 2024, over 300 journals offered registered reports as a publishing format (Center for Open Science, 2024). Empirical evaluation has been favourable: registered-report literatures show meaningfully higher rates of null findings and meaningfully smaller average effect sizes than equivalent conventional-publication literatures (Scheel, Schijen and Lakens, 2021), which is consistent with the format successfully reducing publication-selection bias.

For the applied behavioural scientist, the practical implication is that registered-report literatures should be weighted more heavily than equivalent conventional literatures when forming expectations about real-world intervention effects. Where a finding has been replicated through a registered report, it is substantially more reliable than the same finding established through conventional publication.

4.3 The Open Science Framework

The Open Science Framework (OSF), developed by the Center for Open Science since 2013, provides the institutional infrastructure on which much of the reform agenda operates: preregistration templates and storage, materials and data sharing, version-controlled workflows, and badging for journals that adopt open-science practices. The framework's growth has been substantial: by 2024, OSF hosted preregistrations and materials for several hundred thousand projects across psychology, economics, and adjacent fields (Center for Open Science, 2024).

The substantive significance of the OSF is not its technical capacity but its convergence with the technical-statistical reforms. A finding produced under preregistration, stored with materials on OSF, published as a registered report, and subjected to subsequent independent replication, sits in a fundamentally different evidentiary position from a finding produced under exploratory analysis, published conventionally, and not externally replicated. The practitioner who treats both findings as equivalent is making a systematic error of methodological standards.

4.4 TOP guidelines and transparency standards

The Transparency and Openness Promotion (TOP) guidelines, developed by the Center for Open Science (Nosek et al., 2015), provide a framework for journals to adopt graded open-science practices across eight modular standards: citation, data transparency, materials transparency, code transparency, design and analysis transparency, preregistration of studies, preregistration of analysis plans, and replication. By 2024, over a thousand journals and societies had endorsed the TOP guidelines at one of three levels of engagement (disclose, require, or verify).

The TOP framework is methodologically significant because it operationalises reform at the publishing-venue level rather than the individual-researcher level. A journal that requires data and materials transparency makes deception structurally more difficult; a journal that requires preregistration of analysis plans restricts post-hoc flexibility; a journal that verifies what it claims to require creates an audit trail. Reform at the system level scales in a way that individual-researcher reform does not.

5. What Currently Works: The 2023-2025 Evidence

The most recent literature offers an answer to the question that should most concern a practitioner trained from the trade-press canon: under what conditions is applied behavioural-science research currently reliable?

5.1 The Protzko et al. four-laboratory consortium

The most consequential single piece of evidence on this question comes from Protzko et al. (2023), a six-year prospective replication project conducted by four social-behavioural research laboratories at the University of California Santa Barbara, Stanford, UC Berkeley, and the University of Virginia. The participating laboratories committed in advance to a defined set of rigour-enhancing practices: confirmatory pre-tests, formal replication protocols, large sample sizes, transparency in reporting, and direct cross-laboratory replication of each other's novel findings.

The findings, when reported in 2023, were striking. Across sixteen novel discoveries, the consortium achieved an average replication effect size of approximately 97 per cent of the original confirmatory test (Protzko et al., 2023). This contrasts with the approximately 50 per cent effect-size ratio observed in the post-hoc replication projects of the previous decade (Open Science Collaboration, 2015; Camerer et al., 2018).

The interpretive lesson is direct: low replicability is not inherent to social-behavioural science. It is a product of the methodological practices conventionally used. When labs adopt rigour-enhancing practices as a system - confirmatory testing, adequate power, preregistration, transparency, cross-laboratory replication - they produce findings that replicate at rates comparable to the natural sciences. The reform programme works when it is treated as a programme rather than as a piecemeal set of optional add-ons.

5.2 The COVID-19 evidence synthesis

A second consequential development is the Ruggeri et al. (2024) synthesis of behavioural-science evidence generated during the COVID-19 pandemic. The team examined 747 pandemic-related research articles testing 19 behavioural-policy recommendations originally proposed in April 2020 (Van Bavel et al., 2020). Two independent review teams, comprising 72 reviewers in total, evaluated the evidence for each of the original claims.

Their assessment was more positive than the public discourse around the replication crisis would have predicted. Evidence was found for 18 of the 19 claims, with both teams finding empirical support for 16 of those 18 (89 per cent). Crucially, the supported claims tended to be characterised by small-to-medium effect sizes, were typically replicated in real-world contexts across multiple populations, and were grounded in pre-existing well-established literatures rather than novel-but-fragile findings (Ruggeri et al., 2024).

This is consistent with the broader picture emerging from the post-2020 literature: well-established mechanisms tested at realistic scale with modest expected effect sizes are the conditions under which applied behavioural science is most reliable. Novel findings, large claimed effects, and small-sample laboratory demonstrations are the conditions under which it is least reliable.

5.3 The implementation-science turn

A third development is the shift in the policy literature from "what works" questions toward "what works under implementation" questions. The OECD's (2024) LOGIC framework - explicitly built on Michie, van Stralen and West's (2011) COM-B model of capability, opportunity, and motivation - emphasises that the implementation context determines whether an intervention shown to work in trial conditions will work when deployed at scale.

The implementation-science turn reframes the practitioner question. The relevant question is no longer "does this intervention work?" but "under what conditions, with what supporting infrastructure, with what fidelity of execution, does this intervention work?" This framing accommodates the empirical reality that the same intervention produces different effects in different contexts - which the publication-bias-corrected literature suggests is the norm rather than the exception.

The Chater and Loewenstein (2023) "i-frame versus s-frame" critique extends this line of argument. Their thesis is that behavioural public policy has focused excessively on individual-level interventions ("i-frame") when many of the targeted problems are structurally determined and require system-level interventions ("s-frame"). The implication is that even where individual-level interventions replicate reliably, their welfare significance may be smaller than the structural changes they are sometimes invoked to substitute for.

5.4 The convergent recommendation

The 2023-2025 literature converges on a clear set of practical recommendations for the practitioner. Treat new findings as provisional until independently replicated; weight registered-report findings more heavily than conventional-publication findings; prefer interventions with multiple in-field replications over novel laboratory demonstrations; expect effect sizes that are an order of magnitude smaller than the headline academic-journal figures; design implementations around the COM-B capability-opportunity-motivation diagnostic rather than around isolated behavioural mechanisms; consider whether the targeted problem is amenable to individual-level intervention at all.

These are not radical recommendations. They are simply the current methodological standard.

6. Implications for the Practitioner

The practitioner trained from the trade-press canon - Kahneman, Cialdini, Thaler and Sunstein, Ariely, Sutherland, Halpern - is operating with a knowledge base whose central claims are mostly correct but whose calibration is materially off. The gap between the canon's confident register and the post-2020 evidence base has six concrete consequences for practice.

First, effect-size forecasting. A practitioner who anchors expected effect sizes to canonical figures will systematically over-promise. The applied benchmark should be the DellaVigna-Linos in-field figure - approximately 1.4 percentage points on take-up outcomes - not the academic-journal headline figure of 8.7. Clients should be briefed on this distinction explicitly. The competitive advantage in practitioner work is in the practitioner who under-promises against credible benchmarks and over-delivers against them, not the one who promises against inflated benchmarks and disappoints.

Second, finding-specific reliability. Not all findings in the canon are equally robust. Defaults, social norms in tax compliance, and Save More Tomorrow-style commitment devices have been replicated under registered-report conditions and operate at scale. Priming-based effects, ego depletion, and the broader category of "social priming" interventions have not replicated reliably. A practitioner who treats both as equivalently solid is making an inferential error. The companion review identified specific texts in the practitioner canon - most notably Cialdini's Pre-Suasion (2016), which relies substantially on priming-based research - that should be read with this distinction in mind.

Third, literature evaluation as a working skill. The practitioner must be able to assess the methodological standing of a finding before drawing on it. The relevant questions are: Is this a registered report? Is it preregistered? Is it adequately powered? Has it been independently replicated, and under what conditions? What is the publication-selection environment of the venue in which it appeared? These questions are not academic-research questions. They are practitioner questions because the answer determines whether the finding will hold under client conditions.

Fourth, client communication. The post-2020 evidentiary climate creates a communication opportunity rather than a liability. Clients, particularly sophisticated ones, increasingly know that behavioural-science claims must be treated sceptically. A practitioner who acknowledges this openly - who is candid about what is robust, what is uncertain, and what is structurally improbable - is more credible than a practitioner who maintains the confident register of the canon. Methodological honesty is a competitive position.

Fifth, the implementation-science orientation. The COM-B framework and the i-frame-versus-s-frame distinction suggest that the practitioner's value-add is often not in the intervention but in the diagnosis of whether the targeted problem is structurally amenable to individual-level behavioural intervention at all. Some problems - the ones where the conditioning environment, infrastructure, or regulatory context is the binding constraint - cannot be solved by behavioural intervention regardless of how robust the underlying mechanism. A practitioner who can distinguish these cases is more useful to a client than one who proposes a behavioural intervention by default.

Sixth, the ongoing-learning posture. The methodological-critical literature is not a one-time addition to the practitioner's knowledge base. It is an evolving body of work whose findings shift the calibration of what the practitioner should expect. A practitioner who reads the canon and stops is reading the literature of 2015. A practitioner who reads the methodological-critical literature alongside it - Maier, DellaVigna, Linos, Protzko, Chater and Loewenstein, Ruggeri - is reading the literature of 2025. The difference is a working epistemic posture rather than a curriculum, and it is the posture that the practitioner has to maintain for the duration of a career.

7. Limitations and Conclusion

This review has four limitations that warrant explicit acknowledgement. First, the literature surveyed is predominantly Western and anglophone, inheriting the WEIRD-sample concerns the methodological-critical literature itself has identified (Henrich, Heine and Norenzayan, 2010). Second, the publication-bias-correction methods discussed in Section 3 are themselves contested in their assumptions (Bakdash and Marusich, 2022; Szaszi et al., 2022); the precise inflation factor remains uncertain even if the qualitative direction is robust. Third, the implementation-science turn discussed in Section 5 is itself a developing literature whose long-run findings cannot yet be evaluated. Fourth, the practitioner implications drawn in Section 6 are necessarily provisional and will require revision as the underlying literature continues to develop.

These limitations do not undermine the central conclusion. The applied behavioural sciences have, over the past three decades, undergone a sustained methodological reform whose post-2019 acceleration has produced a substantively different evidentiary landscape from the one on which the trade-press practitioner canon was constructed. The practitioner who works without engaging this reform is systematically miscalibrated. The practitioner who engages it is, paradoxically, more useful to clients than the practitioner who maintains the confident register the canon encourages - because the methodological discipline produces forecasts that hold, recommendations that survive scrutiny, and interventions that achieve the modest-but-real effects the literature actually supports.

The gap identified in the companion review is therefore closeable, but only by deliberate pairing of the operational canon with the methodological-critical literature it omits. This review has specified the literature that closes the gap. The work of closing it is, finally, the practitioner's own.

References

Bakdash, J. Z., and Marusich, L. R. (2022). Left-truncated effects and overestimated meta-analytic means. Proceedings of the National Academy of Sciences, 119(31), e2203616119.

Bartoš, F., Maier, M., Wagenmakers, E.-J., Doucouliagos, H., and Stanley, T. D. (2022). No need to choose: Model-averaging across complementary publication bias adjustment methods. Research Synthesis Methods, advance online publication.

Bayesian Spectacles. (2022). Rejoinder - No evidence for nudging after adjusting for publication bias. Available at: [https://www.bayesianspectacles.org/rejoinder-no-evidence-for-nudging/](https://www.bayesianspectacles.org/rejoinder-no-evidence-for-nudging/)

Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T. H., Huber, J., Johannesson, M., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637-644.

Carney, D. R. (2016). My position on power poses. Available at: [https://faculty.haas.berkeley.edu/dana_carney/](https://faculty.haas.berkeley.edu/dana_carney/)

Center for Open Science. (2024). Registered Reports: Participating journals list. Available at: [https://www.cos.io/initiatives/registered-reports](https://www.cos.io/initiatives/registered-reports)

Chambers, C. D. (2013). Registered Reports: A new publishing initiative at Cortex. Cortex, 49(3), 609-610.

Chambers, C. D., and Tzavella, L. (2022). The past, present and future of Registered Reports. Nature Human Behaviour, 6(1), 29-42.

Chater, N., and Loewenstein, G. (2023). The i-frame and the s-frame: How focusing on individual-level solutions has led behavioral public policy astray. Behavioral and Brain Sciences, 46, e147.

Cialdini, R. B. (2007). Influence: The psychology of persuasion (Revised edition). Harper Business.

Cialdini, R. B. (2016). Pre-suasion: A revolutionary way to influence and persuade. Simon and Schuster.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145-153.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum.

Cornell University. (2018). Statement of Cornell University Provost Michael I. Kotlikoff. Statement on the Wansink investigation.

DellaVigna, S., and Linos, E. (2022). RCTs to scale: Comprehensive evidence from two nudge units. Econometrica, 90(1), 81-116.

Doyen, S., Klein, O., Pichon, C.-L., and Cleeremans, A. (2012). Behavioral priming: It's all in the mind, but whose mind? PLOS ONE, 7(1), e29081.

Franco, A., Malhotra, N., and Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502-1505.

Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., et al. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11(4), 546-573.

Harris, C. R., Coburn, N., Rohrer, D., and Pashler, H. (2013). Two failures to replicate high-performance-goal priming effects. PLOS ONE, 8(8), e72467.

Henrich, J., Heine, S. J., and Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2-3), 61-83.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124.

Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B. Jr., Bahník, Š., Bernstein, M. J., et al. (2014). Investigating variation in replicability: A "Many Labs" replication project. Social Psychology, 45(3), 142-152.

Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B. Jr., Alper, S., et al. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443-490.

Lee, S. M. (2018). Cornell finds that food researcher Brian Wansink committed academic misconduct. BuzzFeed News, 20 September.

Maier, M., Bartoš, F., Stanley, T. D., Shanks, D. R., Harris, A. J. L., and Wagenmakers, E.-J. (2022). No evidence for nudging after adjusting for publication bias. Proceedings of the National Academy of Sciences, 119(31), e2200300119.

Maier, M., Bartoš, F., and Wagenmakers, E.-J. (2022). Robust Bayesian meta-analysis: Addressing publication bias with model-averaging. Psychological Methods, advance online publication.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806-834.

Mertens, S., Herberz, M., Hahnel, U. J. J., and Brosch, T. (2022). The effectiveness of nudging: A meta-analysis of choice architecture interventions across behavioral domains. Proceedings of the National Academy of Sciences, 119(1), e2107346118.

Michie, S., van Stralen, M. M., and West, R. (2011). The Behaviour Change Wheel: A new method for characterising and designing behaviour change interventions. Implementation Science, 6, 42.

Nosek, B. A. (2022). Open science: My insights into data sharing, preregistration, and replication. Canadian Journal of Experimental Psychology, in press.

Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., et al. (2015). Promoting an open research culture. Science, 348(6242), 1422-1425.

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., and Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115(11), 2600-2606.

OECD. (2024). LOGIC: Good practice principles for mainstreaming behavioural public policy. OECD Publishing.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

Protzko, J., Krosnick, J., Nelson, L., Nosek, B. A., Axt, J., Berent, M., et al. (2023). High replicability of newly discovered social-behavioural findings is achievable. Nature Human Behaviour, advance online publication.

Rohrer, D., Pashler, H., and Harris, C. R. (2015). Do subtle reminders of money change people's political views? Journal of Experimental Psychology: General, 144(4), e73-e85.

Rosenthal, R. (1979). The "file drawer problem" and tolerance for null results. Psychological Bulletin, 86(3), 638-641.

Ruggeri, K., Stock, F., Haslam, S. A., Capraro, V., Boggio, P., Ellemers, N., et al. (2024). A synthesis of evidence for policy from behavioural science during COVID-19. Nature, 625(7993), 134-147.

Salvi, F., Horta Ribeiro, M., Gallotti, R., and West, R. (2024). On the conversational persuasiveness of large language models: A randomized controlled trial. arXiv preprint, arXiv:2403.14380.

Scheel, A. M., Schijen, M. R. M. J., and Lakens, D. (2021). An excess of positive results: Comparing the standard psychology literature with Registered Reports. Advances in Methods and Practices in Psychological Science, 4(2), 1-12.

Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366.

Spitzer, L., Mueller, S., and Naudet, F. (2023). Registered report: Survey on attitudes and experiences regarding preregistration in psychological research. PLOS ONE, 18(3), e0281086.

Stanley, T. D., and Doucouliagos, H. (2014). Meta-regression approximations to reduce publication selection bias. Research Synthesis Methods, 5(1), 60-78.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa. Journal of the American Statistical Association, 54(285), 30-34.

Szaszi, B., Higney, A., Charlton, A., Gelman, A., Ziano, I., Aczel, B., et al. (2022). No reason to expect large and consistent effects of nudge interventions. Proceedings of the National Academy of Sciences, 119(31), e2200732119.

Thaler, R. H., and Sunstein, C. R. (2008). Nudge: Improving decisions about health, wealth, and happiness. Yale University Press.

Van Bavel, J. J., Baicker, K., Boggio, P. S., Capraro, V., Cichocka, A., Cikara, M., et al. (2020). Using social and behavioural science to support COVID-19 pandemic response. Nature Human Behaviour, 4(5), 460-471.

Word count: approximately 5,400 words (main text, excluding abstract and references).