Defining cell identity beyond the premise of differential gene expression

Kim, Hani Jieun; Tam, Patrick P. L.; Yang, Pengyi

doi:10.1186/s13619-021-00083-7

Opinion
Open access
Published: 01 May 2021

Defining cell identity beyond the premise of differential gene expression

Hani Jieun Kim^1,2,3,
Patrick P. L. Tam^4,5 &
Pengyi Yang^1,2,3,5

Cell Regeneration volume 10, Article number: 20 (2021) Cite this article

2684 Accesses
2 Citations
1 Altmetric
Metrics details

Abstract

Identifying genes that define cell identity is a requisite step for characterising cell types and cell states and predicting cell fate choices. By far, the most widely used approach for this task is based on differential expression (DE) of genes, whereby the shift of mean expression are used as the primary statistics for identifying gene transcripts that are specific to cell types and states. While DE-based methods are useful for pinpointing genes that discriminate cell types, their reliance on measuring difference in mean expression may not reflect the biological attributes of cell identity genes. Here, we highlight the quest for non-DE methods and provide an overview of these methods and their applications to identify genes that define cell identity and functionality.

Main text

Defining the identity of a cell is fundamental to cell biology research (Kotliar et al. 2019; Morris 2019; Wagner et al. 2016; Weinreb et al. 2020). Traditionally, histological and morphological assessment of cells, overlaid with immunohistochemical information, has enabled us to identify cell types with confidence. Bulk RNA-sequencing (RNA-seq) preceded by FACS sorting has further unveiled the global molecular characteristics of cell populations of interest. However, these approaches have been restricted to cell types with known marker genes and the bulk measurement have masked the underlying cellular heterogeneity. Recent technological advances in genome-wide profiling of single cells have enabled the unbiased exploration of cell identity, allowing discovery of known and unknown cell types at single-cell resolution. Yet inferring the identity of cells has become a renewed challenge as the expanding breadth and depth of single-cell omics data now provide an unprecedented lens into the complexities and nuances of cellular identities.

For some cell types, the computational task to infer cell identity on the basis of omics profiles alone may be relatively straightforward, requiring the evaluation of the expression of known marker genes. For rare or previously unknown cell types, defining the gene set that uniquely identifies the cell is a challenge in the absence of any prior knowledge. This raises an important question of how we could select genes that mark a cell’s identity, henceforth referred to as cell identity genes (CIGs).

Many methods have been devised to identify CIGs, among which the most popular approach is based on differential expression (DE) of genes. A host of tools have been developed for DE analysis on bulk RNA-seq data, such as DESeq2 (Love et al. 2014), edgeR (Robinson et al. 2009), and Limma (Ritchie et al. 2015), and many of them have been successfully applied on single-cell data. Recent methods designed for mining single-cell gene expression data (Delmans and Hemberg 2016; Finak et al. 2015; Kharchenko et al. 2014; Pierson and Yau 2015; Qiu et al. 2017; Vallejos et al. 2015) address some confounding aspects of the analysis of scRNA-seq data, such as technical noise arising from variation in cellular detection rate, and attempt to capture more nuanced differences in cell-to-cell heterogeneity. However, whether these approaches faithfully capture CIGs remains unknown.

A common feature among most current DE methods is their reliance on a specific model of gene expression, which overlooks the heterogeneous nature of gene expression between cells and limits the discovery of CIGs by placing restrictions on the distribution of the genes selected. t-test based approaches (such as Limma) and MAST (Finak et al. 2015) assume a Gaussian distribution of gene expression; BASiCS, a Poisson distribution (Vallejos et al. 2015); and SCDE, a Poisson and negative binomial distribution (Kharchenko et al. 2014). As an illustration of the potential caveat, DE methods that are based on the Student’s t-test such as Limma (Ritchie et al. 2015) is that they prioritise genes that are stably expressed (i.e., conforms to a Gaussian distribution) in both the cell type of interest and other cell types as long as there are shifts in the mean expression. This means that any genes that do not follow this distribution are penalised irrespective of whether the gene may be critical to the identity of the cell or not, meaning that many marker genes identified by DE methods are simply more highly expressed in the cell type of interest than the rest.

Recently, methods based on new statistical metrics have been developed. These methods break away from detecting genes on the basis of shifts in means and attempt to capture more subtle differences in gene expression. For example, scDD is a blanket approach that detects differential proportion (DP), differential modes (DM), and bimodal distribution (BD), as well as DE (Korthauer et al. 2015). Non-parametric approaches are rarer, with model-free methods developed to find differential genes (Li and Tibshirani 2013; Tiberi et al. 2020). These metrics prioritise genes that are differentially distributed (DD) as opposed to those that are differentially expressed (Fig. 1). Whilst these methods are yet to be vetted in terms of their fidelity to prioritize genes that are robust in defining cell identities, they present new avenues for researchers go beyond finding genes that are most distinctively expressed in cells to those that may be more relevant to the identity of the cell and its phenotype.

A biological read-out that accurately captures cellular attributes would not only enhance our ability to assign cellular identities but also opens up a plethora of possibilities to investigate complex systems where assigning cellular identities is inherently more challenging. First, the availability of a comprehensive set of cell identity read-outs encompassing a wide range of cell types would enable a data-driven approach to accurately predict and quantitatively investigate new cell types. This kind of computational approach would not be limited to analysing new cell types but may be used to analyse how cell states are affected with disease or perturbations, capturing the nuance changes in the omics that would affect the overall phenotype of the cell. Second, assignment of cell identities of discrete cellular states, whilst of great importance, provides only a partial answer towards the greater goal of mapping all cellular states. Cells dynamically transition between discrete cell states or cell types, and this developmental landscape, as depicted in the Waddington’s model, illustrates the spectrum of states in which a cell may lie. Identifying the CIGs that define these transitional cell states will help us perform a much deeper analysis of cell identity characterisation and lineage differentiation.

In conclusion, with rapidly advancing single-cell technologies, the development of new computational methods that faithfully capture CIGs that are most relevant to the identity of cells is critical to advancing our knowledge of cellular identity. The selection of CIGs has major implications on a range of downstream single-cell computational applications, and oftentimes the biological interpretation hinges on the outcome of these downstream analyses. We aspire that enhancing our ability to identify CIGs will contribute towards and invigorate new research in elucidating the factors of cell identity and realising the potential of single-cell analytics technologies to pinpoint functional attributes that are relevant to the cellular phenotype.

Abbreviations

scRNA-seq:: Single-cell RNA sequencing
CIG:: Cell identity gene
DD:: Differential distribution
DE:: Differential expression
DP:: Differential proportion
DM:: Differential modality

References

Delmans M, Hemberg M. Discrete distributional differential expression (D3E) - a tool for gene expression analysis of single-cell RNA-seq data. BMC Bioinformatics. 2016;17(1):110. https://doi.org/10.1186/s12859-016-0944-6.
Article CAS PubMed PubMed Central Google Scholar
Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015;16(1):278. https://doi.org/10.1186/s13059-015-0844-5.
Article CAS PubMed PubMed Central Google Scholar
Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nat Methods. 2014;11(7):740–2. https://doi.org/10.1038/nmeth.2967.
Article CAS PubMed PubMed Central Google Scholar
Korthauer K, Chu L-F, Newton M, Li Y, Thomson J, Stewart R, et al. scDD: a statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 2015;17:222.
Article Google Scholar
Kotliar D, Veres A, Nagy MA, Tabrizi S, Hodis E, Melton DA, et al. Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq. Elife. 2019;8:e43803. https://doi.org/10.7554/eLife.43803.
Article PubMed PubMed Central Google Scholar
Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res. 2013;22(5):519–36. https://doi.org/10.1177/0962280211428386.
Article PubMed Google Scholar
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. https://doi.org/10.1186/s13059-014-0550-8.
Article CAS PubMed PubMed Central Google Scholar
Morris SA. The evolving concept of cell identity in the single cell era. Dev. 2019;146:dev169748.
Article CAS Google Scholar
Pierson E, Yau C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 2015;16:241.
Article PubMed PubMed Central Google Scholar
Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA quantification and differential analysis with census. Nat Methods. 2017;14(3):309–15. https://doi.org/10.1038/nmeth.4150.
Article CAS PubMed PubMed Central Google Scholar
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. https://doi.org/10.1093/nar/gkv007.
Article CAS PubMed PubMed Central Google Scholar
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;550:139–40.
Tiberi S, Crowell HL, Weber LM, Samartsidis P, Robinson MD. distinct:a novel approach to differential distribution analyses. bioRxiv. 2020; Available from. https://doi.org/10.1101/2020.11.24.394213.
Vallejos CA, Marioni JC, Richardson S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput Biol. 2015;11(6):e1004333. https://doi.org/10.1371/journal.pcbi.1004333.
Article CAS PubMed PubMed Central Google Scholar
Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nat Biotechnol. 2016;34(11):1145–60. https://doi.org/10.1038/nbt.3711.
Article CAS PubMed PubMed Central Google Scholar
Weinreb C, Rodriguez-Fraticelli A, Camargo FD, Klein AM. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science. 2020;367:eaaw3381.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

The authors thank all their colleagues, particularly at the School of Mathematics and Statistics, The University of Sydney, and Sydney Precision Bioinformatics Alliance for their support and intellectual engagement.

Funding

This work is supported by a National Health and Medical Research Council (NHMRC) Investigator Grant (1173469) to P.Y., a NHMRC Research Fellowship (1110751) to P.T., and an Australian Research Council (ARC) Postgraduate Research Scholarship and Children’s Medical Research Institute Postgraduate Scholarship to H.J.K.

Author information

Authors and Affiliations

School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
Hani Jieun Kim & Pengyi Yang
Computational Systems Biology Group, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
Hani Jieun Kim & Pengyi Yang
Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
Hani Jieun Kim & Pengyi Yang
Embryology Unit, Children’s Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
Patrick P. L. Tam
School of Medical Science, Faculty of Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, 2006, Australia
Patrick P. L. Tam & Pengyi Yang

Authors

Hani Jieun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Patrick P. L. Tam
View author publications
You can also search for this author in PubMed Google Scholar
Pengyi Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.Y. and H.J.K. conceptualised the work. H.J.K., P.Y., and P.P.L.T. wrote and approved the manuscript.

Corresponding author

Correspondence to Pengyi Yang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Kim, H.J., Tam, P.P.L. & Yang, P. Defining cell identity beyond the premise of differential gene expression. Cell Regen 10, 20 (2021). https://doi.org/10.1186/s13619-021-00083-7

Download citation

Published: 01 May 2021
DOI: https://doi.org/10.1186/s13619-021-00083-7

Defining cell identity beyond the premise of differential gene expression

Abstract

Main text

Abbreviations

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article