Benchmark of single-cell RNA-seq analysis workflows: evaluating scalability across R and Python
Author(s): Ilaria Billato,Gabriele Sales,Chiara Romualdi,Davide Risso
Affiliation(s): Department of Biology, University of Padova
The rapid growth of single-cell RNA-seq data has led to an increase in computationally intensive workflows, making it crucial to adopt more efficient algorithms and out-of-memory data representations for analysis. This study compares various workflows for single-cell data analysis, evaluating their performance and efficacy within R and Python programming environments. We assess the capabilities of Seurat and Bioconductor in R, and Scanpy and rapids_singlecell in Python, with a focus on their utilisation of both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) for optimising computational efficiency. Real single-cell RNA-seq datasets, including approximately 1.3 million cells from the mouse brain, BE1, sc_mixology, and cord blood, are used to assess workflow scalability. The study also assessed the concordance between workflows by comparing the Rand index using cell annotations and cluster analysis. The results revealed significant disparities in computational times across workflows and datasets, with GPU-accelerated approaches consistently outperforming CPU-centric methods. The comparison of Rand indices demonstrated high concordance between workflows, affirming their reliability and consistency in identifying cellular groups across diverse datasets. To guide the user in the pipeline setting in the Bioconductor framework we also provide a vignette featuring the most efficient methods regarding scalability, computational time, and memory usage developed, serving as a valuable resource for researchers working with large single-cell RNA seq data. This research emphasises the importance of efficient computational frameworks in handling the escalating demands of single-cell RNA-seq data analysis, offering valuable insights into selecting the optimal workflow.