Real-world data quality: What are the opportunities and challenges?

(8 pages)

Growth in the availability and variety of real-world data (RWD)¹The US Food and Drug Administration (FDA) defines RWE as “Healthcare information derived from multiple sources outside of typical clinical research settings, including electronic medical records, claims and billing data, product and disease registries, and data gathered by personal devices and health applications.” The data used to inform RWE—real-world data—traditionally come from four sources: clinical data, administrative and claims data, patient-generated/reported data, and emerging data sources such as social media and cross-industry data collaborations. is creating new opportunities for real-world evidence (RWE) at a pace not seen before.² Creating value from next-generation real-world evidence,” McKinsey, July 23, 2020. Nonhealth data, such as consumer credit-card spending, geospatial data, and web-harvested data (as used in an appropriate context and adhering to stringent privacy standards) also present new possibilities in RWE to gain a more holistic understanding of patient behaviors and outcomes (see sidebar “Types and sources of real-world data).

This wealth of data underpins the ability of life sciences companies to move ahead on a number of key issues, such as increasing patient centricity, accelerating the pace of scientific innovation, addressing rising development costs, and intensifying their focus on value. Using advanced analytics, opportunities exist to incorporate RWD along the pharma value chain—for example, to inform research decisions, support market access, sharpen product strategy, improve pharmacovigilance, and enable adherence.³ Creating value from next-generation real-world evidence,” McKinsey, July 23, 2020. A recent McKinsey article estimated that, over the next three to five years, an average top-20 pharmaceutical company could unlock more than $300 million a year by adopting RWE across its value chain.⁴ Generating real-world evidence at scale using advanced analytics,” McKinsey, March 15, 2022.

Types and sources of real-world data

Data sets that can be applied to healthcare are expanding and becoming more linked, enabling the creation of increasingly detailed pictures of patients, their lifestyles, and every aspect of their health. Data types can include:

Clinical (electronic medical records, labs, imaging, genomic, proteomic, metabolomic, tissue, patient-reported outcomes)
Administrative (insurance claims, employment records)
Attitudinal (patient experience and sentiment)
Behavioral (diet, lifestyle, physical activity)
Demographic (age, education, environmental factors, income, geographic location)
Financial (credit-card spending, income, purchases)
Social (employment, family, household, and social networks)

However, as data complexity grows, multiple data-quality challenges, which could limit the application and usefulness of RWE, also arise. Biases in data and other underlying quality issues, for example, can be hard to detect and could limit the insights derived from such data. Moreover, the advent of advanced analytics raises its own specific issues. Although there is growing recognition of the importance of interpretability in the context of advanced-analytics methods, some approaches such as deep learning are, by their very nature, “black box” (for example, deep learning, convolutional neural networks, or generative adversarial networks). Today, more than ever, high-quality data sets are essential to enable robust analyses and insights.

Unfortunately, most healthcare data sets are more suited to administrative and billing purposes than research and RWE. As such, they are often incomplete and may include built-in biases. Indeed, there are numerous additional data-quality issues that lie beyond the scope of this article. Additionally, visibility into RWD is often limited until after the data set is purchased, which makes a comprehensive appraisal of the data challenging, time-consuming, and potentially expensive.

Hence, for RWE to achieve its potential over the next ten years, there needs to be an evolution in the way real-world data sets are assessed. When addressing data-quality issues, there are several dimensions to consider, including the depth, breadth, coverage, timeliness, and potential impact of the data. While some biopharma companies have invested in enhanced capabilities to appraise and acquire RWD, it remains a challenge for many. This article explores lessons from other industries that can help point the way for pharma, sets out a process and framework for evaluating real-world data quality, and considers practical next steps for organizations looking to expand their use of RWD.

Lessons from other industries

RWE leaders can learn from other industries’ use of external data both in terms of approaches to data-quality assessment and the use of nonhealthcare data to support creation of RWE. External data are used by various industries and functions for applications such as customer analytics, risk management, strategic analysis, and forecasting. Moreover, data-quality issues can sometimes be addressed with AI and machine learning. Looking beyond data-quality issues, a number of best practices from outside the healthcare sector can be applied to life sciences, particularly those that address how to deal with data quality, access richer data, and optimize data procurement:

1. Dealing with data quality

Using machine learning techniques to detect outliers in a commercial-loan data player. Implementing kernel-based analysis for longitudinal outlier-detection algorithms can address data quality anomalies. Data were obtained from commercial-mortgage-backed securities issuers and underwriters, including financial terms (rate, term, payment), property description (property type, address, year built, number of units, top three tenants), and financial measures (net operating income, debt-service coverage ratio, appraised value, and loan-to-value ratio). Advanced outlier-detection techniques were based on defining isolated observation and isolation clusters in contrast to crowded or main clusters. Identified isolated observations were then classified by cluster proximity. Implementing machine learning algorithms, therefore, can help identify outliers in an automatic way, including data sets.

Using automated gap analyses and dashboards to detect missing data points at a health insurance company. Automated gap analyses with dashboards can be used to better understand the potential source of missing data points. By using such analyses, the company realized that most gaps could not be explained. This resulted in significant savings for the company, as it addressed the annual allocation losses and thus improved the process of fulfilling available units. This shows that automated gap analyses and dashboards are effective tools for detecting missing data deliveries early on.

2. Accessing richer data

Deploying NLP to expand data types. Advances in natural-language processing (NLP) and semantic ontology are being used to cleanse unstructured data types. A growing number of companies are using cloud platforms that cleanse data through NLP algorithms and can host data from multiple sources. The platforms’ self-service tools allow end users to access and use the data, but this approach requires monitoring of data quality over time, as the data are continually changing. In addition, completeness of information is a major pain point in the context of RWD quality: electronic medical record (EMR) data sets are particularly challenging due to their unstructured format and the inability of current NLP technologies to handle the wide variability in physician shorthand.

Using data marketplaces to enable access to a broader external data ecosystem. The increasing use of multiple data sets has enabled data marketplaces to mature and improve value for customers by allowing access to several sources of data, tailored to specific needs.

3. Optimizing data procurement

Setting up a central data-procurement team. Central data-procurement teams enable resourcing efficiency and help to eliminate silos and prevent duplicate purchasing. Different groups within an organization often purchase a data set multiple times, not realizing that it was already available. To prevent continued duplication of effort, some companies create an internal marketplace where anyone in the organization can access data. Others have created procurement teams to centralize thinking about data sourcing and purchasing.

Establishing a process for testing data prior to use. By purchasing models that allow for testing of data prior to use, organizations can maximize efficiency by only paying for data that provide value for the required use case.

Framework for data-quality components

How data quality and transparency can boost health equity

The role that data can play in health equity¹ is a natural follow-on from assessing and strengthening data quality. Having the right representation and data transparency is critical to understanding and advancing health-equity initiatives. Moreover, the issue is a growing priority for real-world-evidence (RWE) leaders, with over 80 percent of participants in a recent survey reporting that health equity is or will be a top priority within the next three years.²

Representation of populations is critical to inform analyses and actions but has been a significant challenge historically. Such challenges arise in multiple ways: in the context of participation in data-generating events, the capture of demographic information, and transparency of aggregate data sets.

From a participation perspective, we continue to have significant gaps among minority populations. For example, Black Americans make up 13 percent of the population and only 8 percent of trial participants.³ Lack of diversity in studies and product design can have direct health consequences such as pulse oximetry disparities in Black and White patients.⁴ Addressing participation requires very early planning for study design, clear representation goals, and dedicated strategies for enrolling patients.

As companies make progress on participation, limited standardization guidelines on how data are captured still leave significant gaps. For collecting personal information, particularly for patient-reported outcomes, inputs may be optional, unclear, or not inclusive—which can lead to insufficient or useless data sources⁵ (exhibit). With growing hesitancy around how personal information is used, more education for patients is needed to highlight the importance of this information for advancing equity in our health system.

In lieu of standardized guidelines in the near term, transparency is critical for data users to assess the representation of a data set. There are inherent black-box issues for third-party data sources and significant variability across these sources. Increasing visibility into specific demographic and socioeconomic attributes, while maintaining strict privacy standards, is needed to advance the understanding of equity in our data sets. One important example is the redaction of ethnicity or gender information which, although considered best practice until recently, is now coming under review: other variables in the model can still proxy the sensitive attribute and introduce bias, which can no longer be measured and controlled for once the sensitive attribute has been redacted. This move away from “fairness via unawareness” counteroffers transparency and explicit measures of fairness that can be optimized alongside traditional objectives such as predictive accuracy.⁶

A framework to assess “data equity” can be used to assess bias across data sets and inform users how they may need to adjust for biases in the data.

Data equity is impacted by quality of personal-data fields.

Incorporating data sets from the many different sources described above—which may include information of varying quality, suitability, and completeness—inevitably implies a need for a robust approach to data evaluation. Indeed, this is a message amplified by best practices derived from other sectors. It is also important to note that data quality has significant implications in the context of health equity (see sidebar “How data quality and transparency can boost health equity”). In response, we have constructed a data-quality evaluation framework to assist researchers in their selection and assessment of RWD. While there are numerous frameworks that relate to clinical trials and trial standards, this framework is rooted in the practicalities of RWE: accessing a variety of data sources and the need to make pragmatic choices about which data sources are appropriate for different applications in a world where none of them is perfect and everything is messy.

The framework appraises data quality across four dimensions: volume, reliability, usability, and compliance and considers the relative importance of each in the context of how it will be used (exhibit). The framework next considers how to ensure data are “fit for purpose.”

A scorecard looking across multiple categories can help with data set comparisons.

Volume can be measured across three dimensions: length (how recent are the data and what time frame do they cover), representativeness (what is included in the data in terms of demographics, geographic coverage, etcetera), and depth (how many patient records does the data include).
Reliability looks at two factors: the quality of the data points and the completeness of the data. The extent to which the data set is reliable depends on what it is used for—for example, causal inference on observational data is faced with a necessarily incomplete picture, since it is not possible to understand the alternative if data had been randomized. This gap cannot be bridged by any data vendor, though vendors can help capture additional information to complement the evidence (such as the reasons why a physician would prescribe a certain treatment).
Usability takes account of four factors: the “generalizability” of the data (how well do the data support analyses that can be generalized), their “linkability” (how easily can the data be combined with other sources), “reusability” (can the data be shared or reused), and format (are the data structured or unstructured).
Compliance evaluates whether data are compliant with the requirements of regulatory bodies and meet accepted industry standards such as those put forward by the Clinical Data Interchange Standards Consortium (CDISC).

Four-step process for evaluating data

Researchers can follow a four-step approach to evaluating and acquiring data for purposes of data quality optimization.

1. Define use cases

Start by clearly defining the features required for a given use case based on business drivers to thoroughly understand the question you are trying to answer. This process may raise multiple questions, such as whether you are prepared to work with raw data, work on another organization’s platform, accept data that only cover a static point in time, link multiple data sets, or accept a requirement to link to your own data.

2. Determine the requirements

It is important to determine the criteria for selecting the data sources based on requirements for use cases. As we have seen, there is likely no perfect fit between available data and a specific use case. For example, we can expect to encounter variances in depth, coverage, type, and quality of data between therapeutic areas among the major data vendors, which are not well advertised. Because no data sets are likely to deliver all the necessary features for a specific use case, informed trade-offs should be considered. However, it is impossible to know exactly what is contained in a data set and what it is like to work with until you actually see it.

3. Approach vendors

Include multiple potential data vendors in the evaluation process (informed by those that would work best for the desired use case) to ensure sufficient sample size, coverage, completeness, and richness of data. It is also important to determine whether direct access to data for analysis is possible and what is available now versus what is in the road map for the future. This group can then be narrowed down to a short list of data sets for more detailed evaluation.

Approach data vendors with a robust set of scoring criteria to understand how closely the data match your needs, and seek answers to specific questions.

In addition, approach data vendors with a robust set of scoring criteria to understand how closely the data match your needs, and seek answers to specific questions such as: “For what period are the EMR data available?” and “For what share of the patients do the linked EMR data contain content (like drug taken)?” Then rapidly run confirmatory analyses to test the data.

4. Compare data sets

Compare data sets to determine which one can best support your needs. A scorecard that looks across multiple categories—availability of physician details, richness and depth of data, types of insurance coverage (for example, public, private), and several other factors—may be a useful tool for assessing data quality.

Significant probing of each feature under consideration may also be required. For instance, assessing the quality of EMR requires us to query the source and time period covered by the data (among other questions) and to receive detailed answers in response. There may be some surprises, such as when preconceived notions about the best data set may not be borne out.

The time investment for conducting a thorough assessment and evaluation across data sets is likely to be significant, so it is important to plan ample time and resources. Ultimately, no data set has all the desired features, so this process will necessitate informed trade-offs between various relevant factors: breadth of coverage; ability to link to proprietary data, taking into account data already licensed; and completeness of data points for each individual. If looking at a health data set, other factors could include clinical outcomes or individual healthcare professional specificity.

Practical next steps for RWE users

What, then, are the practical steps that biopharma leaders can take to maximize data quality and robustness, as well as the value derived from incorporating RWD into programs? The following is our perspective on a number of actions for leaders to take in order to enhance the appraisal, acquisition, and application of RWD.

Establish a group of in-house data experts to streamline decision making around purchasing in order to “empower the edge without causing fragmentation.” Although data purchasing could be carried out centrally, there is no need for this group of in-house data-purchasing experts to be centralized—for example, the group could comprise representatives from each of the key data users. Thus, the business unit makes the decision to purchase (or makes the purchase case upstream within that part of the organization), while a central team negotiates licensing, financials, and related details.

Assess the suitability of data sets for particular studies or use cases—for example, by prototyping analyses on a data sample prior to purchasing. The same group of in-house experts described above could also undertake assessment of data suitability. However, different team members are likely to be involved, because the individuals making purchasing decisions are probably too senior to undertake the actual analyses (although they may help shape those decisions). Under this setup, the functional team responsible for data quality may be employed full-time by a data center of excellence, or they could be specialists attached to business units who follow centrally defined standards.

It should be noted that this setup creates some potential ambiguity when it comes to reuse: that is, if business unit A buys data X, can business unit B then also use those data? Again, the central team is responsible for reuse or, when it comes to big enterprise licenses, for collecting enough buy-in from different business units to justify the purchase.

Cultivate a network or ecosystem of relationships with potential data partners. Recognizing that there may be no one perfect source, flexibility is required to find fit-for-purpose data for each need. Also consider precompetitive collaboration to create a universally accessible ranking system of data sets and quality to increase confidence in purchasing.

Note the importance of assessing and enhancing data quality in the context of current priorities, such as health equity.

Explore a career with us

Search Openings