Measuring impact: methods, challenges and biases
What are the main approaches for measuring impact of research and the most important methodological challenges in such measurements? Magnus Gulbrandsen and Richard Woolley discuss main methods, historical examples and interesting recent developments in this blog post.
“Impact” is a relatively new concept in Norwegian science policy – rarely heard a decade ago, now ubiquitous. It normally refers to the broader, longer-term effects of research, and it is thereby an expression of one of the central goals of society’s research efforts. Impact is a term imported from science policy debates in the UK and the EU, and in many contexts it has replaced other and similar Norwegian words such as “nytte” (utility/usefulness) and “samfunnsnytte” (societal usefulness/benefit).
The term is often tied to evaluations of “what society gets back” from research. Attempts have been made to measure what policymakers now call impact for nearly 50 years. There are substantial conceptual and methodological challenges, but new measurement and evaluation methods are continuously being development.
Two main quantitative methods
The most common approaches to impact measurement involve quantifying the possible economic effects of investments in research and development (R&D). Through looking at complex indicators such as innovation, growth, turnover and productivity in private firms, the effects of public and privately funded R&D are investigated. Findings are then typically used in so-called summative evaluations. This means that impact measurements are a basis for decisions about terminating or continuing particular policy schemes or funding arrangements, and give policymakers indications of whether specific types and fields of research lead to the results they want. This form of evaluation usually means that the relationship between science and impact is studied in terms of a mathematical relationship between indicators of effort/input and indicators of (desired) outputs.
There are two main approaches to such investigations. The first involves the use of different types of databases, where statistical relationships between indicators for research on the one hand, and indicators of effects on the other, are investigated. One example of this kind of research is are studies dealing with to what extent commercially successful patents are based on published research, and to what extent firms under particular public R&D funding schemes get a better score on various economic indicators compared to those who do not receive the specific type of R&D funding in question.
The second approach uses questionnaires/panel data, sometimes combined with data from relevant databases, to learn more about the experiences of organisation, implementation and use of research in firms or other types of organisations. The large-scale Community Innovation Survey (CIS) in Norway and many other countries is an important example of such an approach. Databases and questionnaires/panel data are particularly relevant for estimating the benefits of research for a specific firm, but can also be used to estimate wider societal benefits.
Common to both approaches is that they often observe high rates of return on investments, often 20 percent or more, for the firms that invest in or get funding to do research. Many investigations find an even greater benefit for society as a whole: quantitative studies indicating figures in the 50-100 percent range are common. One example is the recent (2018) economic footprint study of nine leading technological research institutes in Europe, which found that one job at a research institute results in four jobs elsewhere, and that governments get three Euros back for each Euro they invest in institutes in the form of basic grants. Such numbers are impressive but often controversial, not least due to the huge methodological challenges associated with them.
Many challenges of measurement
Attribution is a central and complex methodological challenge. Research usually does not lead to certain impacts on its own or automatically. It is the combination of research with a range of other factors –social/organisational, political and often not directly related to the research itself – that makes the difference for a firm or for society. The measurement challenge is to figure out how much credit should be given, or attributed, to research or to the individuals and groups under evaluation. If the other factors necessary to make an impact are not included in measurements, it is easy to get the flawed impression that impact relies more or less on the contributions of research alone.
To take one example: in numerous interviews, George Lucas has stated that the research writings by Joseph Campbell in comparative religion were paramount for the creation of the Star Wars universe. But how much of this creation should be attributed to Campbell’s book on the monomyth and the hero’s journey? Clearly George Lucas and his partners were essential for Star Wars too. And should some of the credit go to researchers like Freud, Jung and Maslow, whose shoulders Campbell’s work partly rests on?
This example also illustrates the major problem of finding a good indicator for the effect: what should be counted as the “size” of Star Wars? Movie ticket sales or any kind of spin-off products? What about the wider cultural and economic effects? Under headings like “StarMetrics” and “AltMetrics”, initiatives are currently undertaken to develop new and alternative effect indicators using, for example, data on researchers’ careers or data from various social media platforms. This has increased the number of available effect indicators. Although supplementing economic indicators is necessary, the new ones have many of the similar challenges.
Latency is another crucial measurement issue – often the time between research discoveries/input and measurable output can be substantial. Systematic investigations of agricultural research, which is probably the most frequently studied subject area in terms of impact, indicate that the average time lag between R&D and (measurable) outcome can span decades. In most cases it is unrealistic to expect major impacts within the time span of a regular research project. The problem of latency will depend on what is measured, of course, whether it is an intermediary form of knowledge exchange or an ultimate economic or other effect. Latency still creates substantial challenges for all systems of measurement and indicators – and for the question of attribution.
One final disputed issue is causality. Many have argued that the relationship between societal effects and research is more complicated than the latter leading to the former. In many instances what initiates and influences specific research efforts are the needs and challenges in society. Impact, in turn, is a result of the mutual influence between research and those who use it, translate it, transfer it and develop it further. Even though there are many examples of scientific breakthroughs or research-based inventions that have led to the creation of a concrete product or other effects, impact is better framed as a more complex, collaborative and indirect process.
Investigations of impact based on case studies have often aimed to handle these measurement challenges in different ways compared to the broad quantitative approaches. Typically case studies have been concerned with evaluating a certain research unit, field, programme or other effort. Evaluators start with this research and attempt to map what it has led to. Although the method may involve historical data, i.e. starting with research that was carried out some time ago, impact is perceived as something that can be traced forward in time.
Several of the more recently developed methods for measuring impact are based on this kind of approach. Examples would be the British Payback Framework, used particularly in medical research, and the French ASIRPA Framework, developed to study the effects of publicly funded agricultural research. The aim of both these methods is to map the pathways through which research influences various groups and sectors in different ways. For example, an agricultural research project may have had some effects on farming and other effects on policy or on the environment. The SIAMPI approach is similar; it assumes that the final impact cannot be measured in a meaningful way, and instead aims to map the various direct and indirect interactions between users and researchers.
There are also methods that take as their starting point the societal values that research aims to support. Such values, for example a safe society, clean air and water and efficient transportation – are often an important justification for the funding of research but are often left out of the measurements of economic impact. Public value mapping involves a systematic and case-based approach to understanding the wider values relevant for research.
Forward-tracing case investigations of the pathway from research to impact are mostly used in so-called formative evaluations where the purpose is to help research environments and funders to improve the way they work rather than to provide them with a score. Compared to the purely quantitative approaches described earlier, the cases may allow for a better understanding of the relationship between input and output and the characteristics of the often-lengthy process through which research demonstrates impact. Forward-tracing cases frequently combine quantitative data with qualitative assessments, and they have been developed by organisations that are more often involved in scientific research than in evaluations.
Alternatively, case-based mapping can be retrospective and look at the process from the impact instead of the research. Starting with a certain impact, for example a medical product, evaluators attempt to trace back in time the kind of research that was important in developing the product, and how. This approach is often very similar to popular methods used in research on innovations and new technologies, and there may be few differences between a thorough impact measurement of this type and a regular scientific investigation. Two of the oldest and most well-known systematic evaluations of impact used this backward-looking approach.
In the 1960s, the U.S. Army sought to map the costs and benefits of different kinds of research and it defined 20 of its most important and advanced weapon technologies as the starting point for an extended evaluation. This measurement project, called Hindsight, showed that less than one percent of the around 700 “research events” that were identified behind the technologies could be classified as basic research. The driving force in the processes was usually an identified practical need and applied technological research and development.
As a direct response to these findings, the U.S. National Science Foundation (NSF) initiated its own impact measurement project called Traces, using five civilian technological innovations it had supported as its starting point. Perhaps not surprisingly, given a stricter definition of a “research event” (a publication rather than a new idea) and a longer timescale (Traces went back 50 years to search for events, compared to 20 in Hindsight), the evaluation found that around two out of three events could be classified as basic research.
These examples shows that challenges of measuring impact do not disappear through the use of case-based methods, and that relatively small differences in practical choices – such as: how are research events defined, how far back can they go – can have substantial effects on the observed outcome. Of course, these choices may be even more consequential in quantitative than in case methods, but perhaps more visible in the latter. The enormous difference in conclusions and numbers in the Hindsight and Traces projects were due to technical definitions that in evaluations will mostly be hidden away far beyond any “executive summary”. The same point can be made with respect to the various assumptions made in most quantitative studies of impact discussed earlier. This indicates why it is easy to question the validity of single investigations that proclaim a certain number for the return on investment in research.
Impact in science policy
Hindsight and Traces also demonstrate the political nature of impact measurement. Demonstrating impact through strong case narratives or through impressive numbers has seemingly become an essential argument for (more) funding to science in the general debate. Furthermore, different actors can use different impact evaluations to argue for support for their preferred type or field of research. This happens despite the conspicuousness of the U.S Army finding support for its preferred user-oriented technological activities with its evaluation approach, and the basic research funding agency NSF finding arguments for basic research with NSF’s approach. Interestingly, the recent Norwegian report on how to reorganise the research council, referred to the Hindsight/Traces debate as part of an argument for more emphasis on thematically unrestricted basic research.
This shows how impact is a political term, and evaluations of it are initiated by and requested by powerful actors in the research system. Although many researchers and their organisations may be interested in promoting and understanding the use of their research, “impact” is primarily constructed and debated within science policy. As such it is no surprise that the term easily becomes enrolled to dominant perspectives on why and how science should be supported. Traces and Hindsight may be seen to express a distinction between an emphasis on fundamental research with few strings attached and an emphasis on societal needs and the users of research.
Impact measurement is a contested issue in itself. Today’s most extensive measuring of impact is the UK Research Excellence Framework (REF), a national evaluation of research in the higher education sector. The REF, which was conducted in 2014 (new round in 2021), combined an emphasis on scientific publishing with cases of impact. All research units were required to submit one or more so-called “impact cases” – short descriptions documenting how a concrete example of use could be attributed to a concrete research outcome. The exercise, which has been copied in several recent evaluations in Norway, resulted in a database of almost 7000 cases that have been given a score on a scale of one to four.
It is easy to criticise the REF impact approach as only distinct types of impact (the “exceptional” and highly visible or easily measurable ones) are reported, the method is very time-consuming and expensive, and its subsequent use for ranking purposes is problematic. However, proponents argue that the REF increases attention to questions of usefulness and societal needs within research units, maybe in particular in those for which earlier terms like innovation and commercialisation were alien.
Towards a broader concept of impact
An interesting aspect of the UK REF is the broad definition of impact: “an effect on, change or benefit to the economy, society, culture, public policy or services, health, the environment or quality of life, beyond academia”. This is in line with many of the forward-tracing case-based approaches. Because the definition emphasizes more than economic effects, impact becomes a relevant concept for all scientific disciplines, including social science and the humanities. In addition, the definition opens up to the possibility that not all effects of research are “positive” or “beneficial”, although the self-reported impact case approach is probably a poor method for getting at those types of effects.
Nevertheless, widening the concept also contributes to increasing the already severe measurement problems discussed earlier. By including a wider range of effects across a wider range of sectors, the challenges of attribution, effect indicators and causality are made more difficult rather than less. While improved measurement methods are developed based on a contemporary broad understanding of impact and its underlying values, there are still few signs of standardisation, and the difficulties in comparing numbers across different measures of impact remain substantial. It should be added that even with these challenges, recent approaches to measuring impact have made great progress in promoting a more nuanced and inclusive notion of impact.
However, despite their increased sophistication, most types of impact evaluations do not involve the society or user side to any major extent. The phenomena in the spotlight are the research units, the researchers’ activities to engage with non-researchers and similar. Users can be called upon to confirm impact cases or to answer questionnaires about how useful they found certain projects and publications, but little else. A weak (or non-existent) score on impact thereby first and foremost indicates failure on the research side, and that actions need to be taken in order to improve the usefulness, packaging and dissemination of the research.
In OSIRIS we believe this remains a flawed perspective. More likely is that the central preconditions for impact are intimately tied to the specific context of use, which in many cases will be beyond the control of researchers. It is therefore essential to understand the context of use of research, and these contexts are likely to differ significantly between societal sectors, types of impact and more. The OSIRIS approach to understanding the impact process from this perspective will be the topic of a later blog post.
This blog post has sought to present briefly the main traditional approaches for evaluating impact of research beyond academia (for academic impact bibliometrics has developed a number of powerful measurement tools). Our overall impression is that individual methods have become increasingly sophisticated and nuanced, but that the field of impact measurement remains fragmented with little communication between the main types of measurement. A deeper understanding of impact may require freeing the concept from the evaluation context, as is the goal of OSIRIS.
Written by Magnus Gulbrandsen and Richard Woolley, based on a short article that appeared in the Norwegian-language Indikatorrapporten 2017.