Image from Pixabay
Researchers have developed a new way to identify and correct for a type of systematic statistical error in large datasets.
The error, known as a batch effect, crops up when research data is generated in batches – for example, at different times, using different machines, or in different labs.
Because conditions are unlikely to be exactly the same from batch to batch, this can introduce differences into the data that can be mistaken for biologically significant results, or mask them entirely.
The team of researchers, from The Institute of Cancer Research, London, created a package of statistical software, called exploBATCH, that can evaluate a dataset for these batch effects and correct for them – making it the first systematic method for quantifying batch effects.
They have made exploBATCH available to download for free on the website GitHub.
Three step system
In a paper published in the journal Scientific Reports, the research team outlined how exploBATCH works.
Firstly, each dataset is pre-processed to check for errors, and to make sure that individual data points are labelled and recorded consistently from one batch to the next.
Then, a process called findBATCH looks for batch effects in the combined datasets, and works out how strong the effects are. This process uses a mathematical technique called probabilistic principal component and covariates analysis (PPCCA), which the paper’s first author, Dr Gift Nyamundanda, previously helped to develop.
Finally, if batch effects are found, another process called correctBATCH removes the effect from each affected point.
Explore batch effect (exploBATCH) is a package for discovering and correcting batch effects. It is based on probabilistic principal component and covariates analysis (PPCCA).
Download on GitHub
Vital test for error correction
The team demonstrated their technique by applying it to data from studies on breast cancer and bowel cancer.
The breast cancer data came from three different studies totalling 70 samples. When combined into one dataset, the data points clustered according to the study they came from, but exploBATCH was able to identify and correct for this batch effect.
With the bowel cancer data, the researchers showed that the software could distinguish between batch effects and ‘biological effects’ – where data from tumour samples tends to cluster together away from data from ‘normal’ cells.
This ability to differentiate statistical artefacts from scientifically interesting results is vital for any error-correcting software.
In both cases, the exploBATCH software was as good as, or better than, two of the current standard techniques for correcting batch effects – ComBat and principal component analysis.
Dr Anguraj Sadanandam, Leader of the ICR’s Systems and Precision Cancer Medicine Team, led the research. He said: “As genome projects continue to generate large-scale datasets, often produced by different teams in different laboratories, the importance of correcting for batch effects has never been greater.
“exploBATCH is a new approach to this – able to correct for batch effects while, crucially, retaining biological effects.
“We’re pleased to make it available on GitHub so that others can use it in their own research.”