
Project
TReMorS: Transparent and Reproducible Modern Science
Mapping the presence and impact of open source software in research with large data and AI.
To what extent are software creation and code sharing practices promoting or hindering R&D in academia, and what can be done to fully harness them? IGL’s new research project, funded by the UKRI Metascience Research programme, will answer these questions using large data gathered from publications and code repositories.
Background
Researchers are increasingly using data and computational methods to power their work. From natural sciences to humanities, creating and analysing data using digital tools has become a daily component of research. Applications like spreadsheets or graphing software have played an important role in this transition, but code is steadily rising to become the most important software tool that academics have at their disposal.
For researchers, programming languages such as Python or R offer a number of advantages over graphical software designed for specific uses. Code offers more flexibility, providing access to a multitude of methods, packages and interfaces for the entire research pipeline. It also offers the ability to handle large-scale data and modelling. A researcher could, in theory, programme an array of sensors to monitor a natural environment, use machine learning to detect important signals in the data collected, and then crunch the data with routine analysis to produce the results of their study.
Code is also theoretically well-suited for sharing with others. This means that code isn’t just a tool to perform a study. When used to power a research project, it also becomes a valuable output of the work for the academic community and beyond. In particular, it can be useful for:
- Scrutiny – code can be examined by other researchers to check for mistakes in the execution of a methodology.
- Reproducibility – code can be tweaked and modified, or run with new data, to investigate the reproducibility of results.
- Learning – code made available to other researchers provides them with knowledge on how to use and execute methods that are new to them.
- Reuse – code can be adapted to new data or research problems by the original author or other researchers without the need to start from scratch.
The importance of publishing open source code is emphasised by the requirement of some journals and funders that researchers publish their code alongside their findings.
However, despite the potential for open source code to advance the quality, efficiency and creative aspects of science, the reality is that this potential is not being reached. Codebases from research projects are not always published, and when they are, their quality is highly variable. This creates a drag on research as a whole.
Research
Understanding the availability and quality of open source code in academia is important to develop policies that can support its production, boosting open research more broadly.
The existing research on this area strongly suggests that practices are highly varied across fields but is less clear when it comes to identifying the impact of that heterogeneity.
In this project we are asking 3 questions:
- What are the characteristics of the landscape of open source software in research?
- How does the availability and quality of open source code impact research productivity?
- What are the factors associated with open source code publishing practices in research?
We plan to answer these questions using a data-led study, collecting information from publications from OpenAlex and code from repositories such as GitHub. This approach will complement, existing studies, which are mostly small in scale, focusing on a limited number of publications and code repositories in a specific field. Using natural language processing and LLMs, we will link papers and code, and extract indicators of code quality. These will be used to map activity and draw links between policies of funders and publishers and code outputs.
Project team
-
George Richardson
Head of Data Science and Technology -
David Ampudia
Senior Data Scientist -
Christopher Edgar
Data Scientist