by Piotr Kaminski and Kate Robu
Machine learning (ML) methods have been around for ages, but the big-data revolution and the plummeting cost of computing power are now making them truly excellent and practical analytical tools in banking across a variety of use cases, including credit risk.
ML algorithms may sound complex and futuristic, but the way they work is quite simple. Essentially they combine a massive set of decision trees (i.e., a decision-making model that breaks out individual decisions and possible consequences, also known as “learners”) to create an accurate model. By churning through these learners at high speeds, ML models are able to find “hidden” patterns, particularly in unstructured data that common statistical tools miss.
Overfitting (the analytical description of random errors rather than underlying relationships) of the model is a typical concern about ML. Overfitting of ML models can be avoided by carefully choosing input variables and specific algorithms. One way to guard against overfitting is to use the popular Random Forest algorithm. This is an ensemble of many intentionally “weakened” decision trees, essentially a partial set of variables with each iteration of the model, thereby reducing the reliance on specific variables. In another example, ML model performance is also tested on a holdout sample not used during the model-development process. If the model performance on the sample is significantly degraded, it’s a sign of overfitting.
Where ML is superb is in analyzing long-tail data, which typically account for half of a bank’s portfolio but are not well understood through traditional statistical methods. Think of accounts with low share of wallet. We usually know little about them, and strategies to engage them tend to be quite reactive. But ML has the ability to generate insights into their behaviors to actively target the accounts that are potentially profitable.
Let’s take as an example an ML project focused on optimizing line decisions in credit cards. The company was seeking to optimize credit-line decisions for their cards business; that is, they wanted to make better decisions about where to increase and decrease credit lines.
The existing models were performing and already had a very respectable predictive power. We used the existing traditional account data and set up our ML model as a challenger to the existing credit-line strategies. We also accounted for all the policy-mandated eligibility constraints in place.
Still, the ML model (which used Random Forest and AdaBoost) outperformed dramatically, improving the predictive power of the model by a factor of 1.6. This improvement can translate into significant increased revenue from the less risky accounts that are based on existing models. These would get a credit-line decrease and help avoid losses from the accounts that are given credit-line increases but subsequently are most likely to charge off.
So what prevents banks from adopting ML tools more broadly? Typically, there are three key concerns. First, the scale of variables would tax the bank’s current capacity-constrained systems. Second is the issue of compliance, i.e., an ML model is a black box that makes it hard to explain the outcomes and ensure compliance with such regulations as adverse action. Finally, model risk validation can be challenging given the increased complexity and requires a different set of validation techniques / approaches from those commonly used by the industry today.
While the broader industry and regulatory bodies are still getting up to speed on the application of ML models, there are practical ways to address these three concerns in the near term. First, start modeling with all the available variables (e.g., 100+), but quickly prioritize them based on their contribution to the model, leaving a manageable number (e.g., 30–40) that won’t sacrifice the model’s predictive power. Second, “prune the branches” of the ML decision tree to get to a set of core linear rules that use an even smaller number of variables (e.g., 5–12) while still retaining 70 to 80 percent of the original ML model’s predictive power. This approach delivers a simplified set of new ‘decision strategies’ that banks can deploy quickly on top of existing rules, thus assuring compliance with regulatory requirements while making minimal changes to existing systems.
Is it possible to capture more value with a more sophisticated ML model? Yes, and that’s most certainly the future. But this approach can help banks start capturing value from ML immediately by addressing regulatory and system constraints and making the best use of readily available ‘small’ data. The key implication for banks is that their current models are leaving a lot of value on the table, and ML offers a way to capture it in a practical way.
Piotr Kaminski is a senior partner in McKinsey’s New York office. Kate Robu is a partner in McKinsey’s Chicago office.