do popular methods deliver on their promises? – Bank Underground

Date:

Share post:


Ivona Cickovic and Andrea Serafino

Machine learning models are increasingly used in organisational decision-making, yet their inner workings often remain opaque. When these systems influence real world outcomes, knowing what they predict is not enough – we also need to understand why. Explainability methods aim to illuminate this ‘black box,’ and feature attribution tools that link predictions to individual inputs are especially popular. They feel intuitive but rely on strict data assumptions that rarely hold, making their outputs unreliable. The 2019 Apple Card case illustrates why this matters: despite gender not being an explicit input, women appeared to receive lower credit limits than men with similar profiles – an outcome attribution methods struggle to explain. This post examines a key assumption underpinning these tools and how it distorts explanations.

The limitations of popular explainability methods 

Machine learning (ML) models are often sufficiently complex that it is difficult to understand how changes in the data going in lead to changes in the predictions coming out. This has driven the development of various explainability methods that claim to see through this opacity and summarise the relationship between a model’s inputs and outputs.

Common examples include Shapley Additive Explanation (SHAP), a method that assigns each feature its average marginal contribution across all possible subsets of features; Local interpretable model-agnostic explanation (LIME), which explains individual predictions by fitting a simple, interpretable model locally around the observation of interest; Partial Dependence Plot (PDP), visual tools that show how a model’s average prediction changes as one feature varies while the effects of others are averaged out; and Permutation feature importance (PFI), a performance‑based approach that assesses feature relevance by randomly shuffling values and measuring the resulting loss in accuracy. However, a growing body of research has highlighted limitations in these widely used methods (eg Salih et al (2024)Bordt et al (2022)Velmurugan et al (2023); and Ragodos et al (2024)). 

A major concern is that these approaches implicitly assume that model inputs – typically referred to as features in ML – are independent, an assumption that rarely holds in real‑world data sets. Although textbooks and practitioner guides (eg, Molnar (2025)) warn about the violation of these assumptions, the caveats are often overlooked in practical applications. While some features in financial models may be largely independent (for example, the number of standing orders versus a mobile phone bill), many others are naturally correlated, such as loan amount and monthly repayment. When such dependencies are present, attribution methods produce distorted or misleading explanations, obscuring the true drivers of a model’s behaviour. As highlighted in earlier Bank Underground work on AI fairness, opaque or biased model behaviour can amplify yet conceal discriminatory decision patterns.

A controlled experiment: independent versus correlated data 

To illustrate how much this matters, we run a simple experiment using two large synthetic data sets (50,000 rows × 50 features): one with independent features (or predictors) and one in which the predictors are correlated. In both data sets, the target is a linear combination of features plus noise. For the correlated‑features data set, Chart 1 shows the pairwise correlation heatmap (with red and blue marking positive and negative relationships, respectively; darker colours indicate stronger correlations, while paler colours show weaker ones), and Chart 2 shows the distribution of absolute pairwise correlations. Together, these charts show a pattern typical of many credit‑risk or economic data sets: most feature relationships are weak – with a median absolute correlation of about 0.20 – while a smaller number exhibit stronger associations, closely mirroring what we observe in real‑world modelling for example Stock and Watson (2017) or Laloux et al (1999)).

On each data set, we fitted four common models – linear regression, random forest, gradient boosting, and a neural network – and applied the four explainability methods mentioned above. We then compared the feature rankings assigned by these methods with the true rankings implied by the data‑generating process (ie, the coefficients we used to generate the synthetic data). We measured the rank agreement between the two rankings – that is, the extent to which they place features in the same order – using Spearman’s Rho (ρ) as a rank-agreement coefficient. This was repeated 500 times to see how stable the results are. 


Chart 1: Pairwise feature correlation heatmap



Chart 2: A representative distribution of pairwise feature correlations (absolute values) 


What the results show

Explainability methods are reliable only when features are independent, but their performance deteriorates sharply once features become even mildly correlated (Chart 3). The chart shows the distribution of rank agreement coefficients between estimated and true feature-importance rankings across 500 repeated simulation runs. Each panel corresponds to an explainability method, with separate boxplots for the models used.

Blue boxplots represent simulations with independent features, while orange boxplots show results when features are correlated. Each box shows the interquartile range (the middle 50% of outcomes), with the median indicated by the horizontal line. When features are independent, all methods recover the true ranking with high accuracy and low variability, as reflected in the narrow blue boxplots clustered near one.

By contrast, once correlation is introduced, ranking performance worsens substantially. The orange boxplots are much wider, median rank agreement coefficients fall (typically to between 0.3 and 0.8), and some runs even exhibit negative agreement, meaning genuinely important features are ranked lower than unimportant ones. In real world settings, where only a single data set is typically observed rather than hundreds of simulations, this implies that feature importance explanations from a single model run can be highly misleading. This is especially concerning in high stakes contexts like credit scoring, where decisions carry real consequences.

Chart 3. Boxplots of rank-agreement coefficients between true feature rankings implied by the data generating process and rankings implied by a range of explainability methods for a set of models (across 500 simulations), for the top 10 features.


Chart 3: Boxplots of rank-agreement coefficients


To unpack what the coefficients shown in the charts mean in practice, it is helpful to think about what happens in an individual model run. In our simulations, although the data generating process is a simple fully known linear system, explainability methods often struggle to recover the true ordering of feature importance once features are correlated.

Two broad patterns stand out. First, even genuinely important predictors can be severely misrepresented. In many runs, features that are among the top three true drivers of the outcome are pushed far down the ranking produced by explainability methods or disappear from the top ten altogether. This illustrates how easily real drivers of a model’s behaviour can be obscured once features exhibit even mild dependence.

Second, features with little or no true importance are frequently promoted into the top ranks. This type of mis-ranking is particularly problematic in practice. It encourages users to build interpretive narratives around variables that played no real role in generating the outcome, leading to a false sense of understanding of how the model actually works.

Where does this leave us?

This post argues that feature attribution explainability methods perform poorly in modern ML settings, where large data sets and mutually dependent features are the norm. The results presented indicate that even modest and realistic levels of feature correlation – around 0.20 on average – can meaningfully reduce the accuracy and stability of common attribution methods. In our simulations, rank-agreement that is close to perfect in independent settings often fell sharply once correlations were introduced, with important predictors moving down the list and low relevance features moving up. This matters because tools such as SHAP, LIME, PDPs and permutation importance are frequently used to support model interpretation. Under realistic data conditions, however, their outputs become unreliable, making it harder to identify which features are genuinely driving a model’s behaviour. If these methods struggle to recover the top features in a clean, fully specified linear system, it raises serious questions about their suitability for explaining high dimensional models used in real world decisioning. Rather than clarifying model behaviour, they risk reinforcing misleading narratives, discouraging deeper investigation, and creating unwarranted confidence – ultimately setting the stage for misguided decisions.

Making feature attribution genuinely insightful would require much more structure than most ML pipelines support. That would mean introducing disciplined feature construction – explicitly mapping correlation structure, grouping variables into interpretable clusters (eg, socioeconomic status, credit behaviour, stability, demographics), and reporting explanations at the group level rather than for individual features.

While this kind of structured organisation is standard in classical statistics, many contemporary ML pipelines rely instead on large sets of raw or automatically engineered features. In such settings, models are often trained on whatever variables are available in the data set, with the expectation that the learning algorithm will discover useful structure without extensive manual grouping by domain. As a result, explicit feature grouping is rarely part of modern ML workflows, and with many correlated variables, even defining meaningful groups can become a research task in its own right.

It is worth noting that there are attribution methods designed to relax independence assumptions – such as Conditional SHAP and Causal SHAP – but these are very difficult to scale. Conditional SHAP requires estimating the joint feature distribution in order to compute conditional expectations; Causal SHAP needs a well specified causal graph, which most practical ML projects do not have. Both are computationally very expensive and fragile in high dimensions. So, although these alternatives address some of the theoretical shortcomings of classical feature attribution methods, they remain largely impractical for routine ML use. This leaves a noticeable gap between what explainability methods promise in principle and what they can realistically deliver today.

Rather than treating feature attribution as the primary means of understanding a model, these findings point to a need to rethink how ML models are assessed. One way to move beyond attribution is to examine model behaviour by exploring how outputs change under structured ‘what if’ variations in inputs. A fuller exploration of this and other approaches is beyond the scope of this post.


Ivona Cickovic and Andrea Serafino work in the Bank’s Model Review and Development Division.

If you want to get in touch, please email us at bankunderground@bankofengland.co.uk or leave a comment below.

Comments will only appear once approved by a moderator, and are only published where a full name is supplied. Bank Underground is a blog for Bank of England staff to share views that challenge – or support – prevailing policy orthodoxies. The views expressed here are those of the authors, and are not necessarily those of the Bank of England, or its policy committees.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Related articles

Does your investing app tilt the scales toward trust?

Key takeaways Modern investing apps make it easy to get started, but the best ones also work to...

It’s time independent music retail found its VOICE

MBW Views is a series of op-eds from eminent music industry people… with something to say. The...

The space economy’s next frontier is in ground infrastructure, Northwood Space CEO says

In the last six years, a surge of satellites in orbit has triggered what Northwood Space Chief...