Data Loading and Description:
Our simulated dataset consists of 100 samples per group (case/control) and 2,000 total features (genes). The bulk level data comprises of six distinct cell types (Celltype1 - Celltype6), each has specific underlying cell-specific gene expression patterns. Additionally, 5% of the cell-specific Differentially Expressed Genes (csDEGs) are simulated for each cell type. This results in 100 csDEGs per cell type and a total of 600 csDEGs overall. The Log-Fold-Changes (LFC) for these csDEGs are simulated from a normal distribution with a mean of 1.5 and standard deviation of 0.2, i.e., LFC ~N(1.5, 0.2).
The RNAseq_final_count
dataset is a matrix containing
simulated raw RNA-sequencing count data with 2000 genes and 200
samples.
The sample_CT_prop
and est_CT_prop
datasets
represent sample-specific true cell type abundance and estimated cell
type abundance, respectively. The true cell type abundance is generated
from a Dirichlet distribution, while the estimated cell type
abundance is derived from CIBERSORT.
In this session, we will demonstrate how to conduct csDEGs analysis using eight different methods: TOAST, CellDMC, TCA, CARseq, DESeq2, CeDAR, csSAM, and LRCDE. For a detailed comparison of these eight methods, please refer to this comprehensive benchmark study: A comprehensive assessment of cell type-specific differential expression methods in bulk data
For your reference, we have included gene_CT_DE_connect
,
which contains the true label information for csDEGs. Obtaining such
ground-truth information is not always feasible in a real research
setting. However, if you are interested in evaluating the performance of
your models based on metrics such as sensitivities and true discovery
rates after classification, this dataframe can be helpful. The
gene.id
variable indicates the gene index (from gene1 to
gene2000), while the csDE.True
variable shows the locations
of DEGs (Celltype1-Celltype6). A value of 1 indicates that the DEG is
located in Celltype 1, a value of 2 indicates Celltype 2, and so on. A
value of 0 means that the gene is not a DEG.
library(TOAST) #TOAST/CeDAR
library(EpiDISH) #CellDMC
library(TCA) # TCA
library(CARseq) #CARseq
library(DESeq2) #DESeq2
library("lrcde")
library(csSAM)
library(dplyr)
library(lubridate)
library(tidyr)
library("recount")
library(data.table)
library(GEOquery)
library(stringr)
library(biomaRt)
library(org.Hs.eg.db)
library(annotate)
library(EnsDb.Hsapiens.v86) # 1nd method match gene symbols
library(annotables) # 2nd method match gene symbols
library(AnnotationDbi) # 3rd method match gene symbols
library(immunedeconv)
library(IOBR)
# Data Simulated Load ----
load("Sim_rawRNAseq_N100_LFC1.5_0.2.Rda")
Here lets take a glance of the RNA-seq count data:
# The first 10 genes and 10 samples
print(RNAseq_final_count[1:10,1:10])
## control1 case1 control2 case2 control3 case3 control4 case4 control5 case5
## 1 305 442 255 588 235 456 228 617 298 446
## 2 131 205 137 261 120 222 109 210 114 245
## 3 434 325 399 290 603 292 466 260 426 283
## 4 73 67 40 93 53 67 56 94 47 82
## 5 26 32 21 54 22 59 26 53 22 41
## 6 81 135 88 125 102 96 106 98 78 102
## 7 5 4 14 9 11 13 12 11 13 13
## 8 11 12 9 26 12 19 13 13 6 21
## 9 146 155 131 159 155 135 141 159 131 161
## 10 158 139 208 323 111 133 180 121 172 130
True csDEGs labels information:
# True DEG label and expression strata information
gene_CT_DE_connect <- gene_CT_DE_connect %>% dplyr::rename(gene.id = Gene_ID) %>%
dplyr::select(gene.id, csDE.True)
print(gene_CT_DE_connect[c(1:3,101:103,201:203,301:303,401:403,501:503,601:603),])
## gene.id csDE.True
## 1 1 1
## 2 2 1
## 3 3 1
## 101 101 2
## 102 102 2
## 103 103 2
## 201 201 3
## 202 202 3
## 203 203 3
## 301 301 4
## 302 302 4
## 303 303 4
## 401 401 5
## 402 402 5
## 403 403 5
## 501 501 6
## 502 502 6
## 503 503 6
## 601 601 0
## 602 602 0
## 603 603 0
TOAST
Before running the algorithm, it’s important to set up the working pipelines, which should include sample phenotypical information, covariates information, cell proportions, and other relevant factors. If available and meaningful for the study, users can also customize their input by including additional information such as age, sex, and gender in the model. However, for our purposes, we will focus solely on detecting csDEGs in case and control groups.
makeDesign: generates linear model design matrix and prepare for model fitting and statistical testing. It requires two inputs: design and Prop. design is a samples by phenotypes (age, gender, disease, etc.) matrix. Prop is a samples by cell type proportion matrix.
fitModel: fits the model including all cell type proportion matrix and its interaction with phenotypes. This function requires two inputs: Design_out and Y. Design_out is the output from makeDesign and Y is a gene by sample raw RNAseq count data matrix.
csTest: conducts statistical tests for specified phenotype with respect to each/all cell type(s). It requires fitted_model input which is the output from fitModel. Users could specify phenotype names (e.g. “disease”,“gender”) into coef. If cell_type is set to “NULL” or specified as “ALL”, compound effect of coefficients in all cell types will be tested; or user could specify one specific cell type (e.g. “cell1”) to be tested.
sim.design <- data.frame(disease = factor(rep(c(1,2),100))
#,gender = as.factor(rep(c("M","F"),each = 100))
)
sim.Design_out <- makeDesign(design = sim.design, Prop = est_CT_prop)
sim.fitted_model_strata <- fitModel(Design_out = sim.Design_out,
Y = as.matrix(RNAseq_final_count))
sim.res.TOAST_strata <- csTest(fitted_model = sim.fitted_model_strata,
coef = "disease",
cell_type = NULL, verbose = F, sort = T)
# Return a list and each element contains a matrix including the results from
# testing the phenotype in specified cell type(s). Below show the results for
# celltype 1 for the first 10 genes ordered by adjusted p value.
names(sim.res.TOAST_strata)
## [1] "Celltype1" "Celltype2" "Celltype3" "Celltype4" "Celltype5" "Celltype6"
## [7] "joint"
sim.res.TOAST_strata$Celltype1[1:10,]
## beta beta_var mu effect_size f_statistics p_value
## 83 10843.656 18101.836 2076.3503 1.446172 6495.743 9.758826e-148
## 60 5915.421 9510.757 1261.6126 1.401983 3679.224 2.139365e-125
## 16 9373.780 24728.101 1563.0064 1.499830 3553.356 4.802484e-124
## 38 3828.761 4496.013 766.1246 1.428373 3260.535 1.022123e-120
## 89 8307.551 21485.076 2841.6309 1.187572 3212.249 3.848863e-120
## 80 4162.086 7668.555 919.4395 1.387139 2258.960 1.050431e-106
## 35 2962.700 4520.129 570.1162 1.444187 1941.889 4.891123e-101
## 19 2260.829 3369.388 884.2274 1.122199 1516.996 6.000874e-92
## 85 3266.375 7198.074 590.2584 1.469060 1482.231 4.166486e-91
## 10 1544.094 1777.648 360.6128 1.363246 1341.225 1.670974e-87
## fdr
## 83 1.951765e-144
## 60 2.139365e-122
## 16 3.201656e-121
## 38 5.110613e-118
## 89 1.539545e-117
## 80 3.501436e-104
## 35 1.397464e-98
## 19 1.500219e-89
## 85 9.258858e-89
## 10 3.341947e-85
The outcomes consist of matrices that contain the results of testing the phenotype (disease) in specific cell types. We then extract and summarize the cell type-specific testing results, including the adjusted p-values, from the output. Researchers have the option to customize the p-value threshold for identifying csDEGs for each cell type.
aa <- lapply(seq(1:6), function(i,x) {
x[[i]]$gene.id <- (rownames(x[[i]]))
x[[i]] <- x[[i]] %>% dplyr::select(gene.id,fdr)
names(x[[i]]) <- c("gene.id",names(x)[i])
assign(names(x)[i],x[[i]], envir=.GlobalEnv)
},x=sim.res.TOAST_strata)
remove(aa)
TOAST.outputs <- Reduce(function(x,y) merge(x = x, y = y, by = "gene.id",all.x =T),
list(Celltype1, Celltype2, Celltype3,
Celltype4, Celltype5, Celltype6,
gene_CT_DE_connect))
TOAST.outputs = TOAST.outputs[order(as.numeric(TOAST.outputs$gene.id)),]
rownames(TOAST.outputs) = TOAST.outputs$gene.id
print(TOAST.outputs[c(1:5,101:105,601:605),])
## gene.id Celltype1 Celltype2 Celltype3 Celltype4 Celltype5
## 1 1 1.029246e-16 7.813553e-01 0.727374367 1.132910e-01 2.711014e-35
## 2 2 2.019897e-31 4.016448e-03 0.229712533 1.415307e-02 1.048191e-01
## 3 3 1.514625e-07 2.205288e-14 0.076707649 2.829022e-01 5.457391e-06
## 4 4 9.394208e-02 5.410599e-01 0.886704734 5.180522e-01 1.645431e-03
## 5 5 3.736756e-06 5.314524e-01 0.814041361 2.465622e-01 1.580457e-01
## 101 101 2.803460e-01 2.027197e-06 0.143076203 2.987690e-01 2.253781e-02
## 102 102 3.204534e-02 6.070946e-18 0.719043851 3.131572e-02 1.073527e-01
## 103 103 7.071813e-01 6.080519e-14 0.996174877 9.862629e-01 3.742248e-13
## 104 104 5.911943e-01 1.692564e-11 0.125043947 1.017404e-01 2.535746e-01
## 105 105 8.059386e-01 1.374500e-49 0.010819419 1.905337e-02 8.970970e-03
## 601 601 9.652161e-01 4.806782e-01 0.855788854 6.466708e-06 3.822834e-01
## 602 602 2.755434e-02 3.314457e-01 0.007801243 6.015588e-01 1.261172e-03
## 603 603 7.997659e-01 3.001327e-01 0.982863589 1.825545e-03 1.785221e-03
## 604 604 4.176650e-01 8.729935e-01 0.725130106 1.806278e-04 1.091866e-01
## 605 605 6.951382e-01 9.038723e-01 0.946732381 6.547847e-01 5.795991e-01
## Celltype6 csDE.True
## 1 0.8162655314 1
## 2 0.0020267402 1
## 3 0.6134432745 1
## 4 0.9993615717 1
## 5 0.5492940034 1
## 101 0.0846483423 2
## 102 0.2165157114 2
## 103 0.5125835458 2
## 104 0.5806145448 2
## 105 0.5125835458 2
## 601 0.3341435694 0
## 602 0.5558696323 0
## 603 0.0002083174 0
## 604 0.5829995797 0
## 605 0.5572015028 0
CellDMC
CellDMC is a tool specifically designed for identifying cell-type-specific differentially methylated (csDM) genes in Epigenome-Wide Association Studies (EWAS). To use the tool, users must specify three input files: beta.m (a matrix with genes in rows and samples in columns), pheno.v (a vector with the phenotype information, e.g., 0 for healthy and 1 for disease), and frac.m (a matrix with cell abundances for each sample, where the row sums should be close to or equal to 1). The row order in frac.m must be identical to the column order in beta.m
# Run Model
pheno.v <- rep(c(0, 1), times = 100)
celldmc.o <- CellDMC(beta.m = RNAseq_final_count, pheno.v = pheno.v,
frac.m = est_CT_prop)
# CellDMC returns a list. Results for each cell type are nested under "coe"
The output of the analysis is a list that includes two components: “dmct” and “coe.” The “dmct” component is a matrix that shows which genes are DE or not and indicates whether a gene is a csDEG for a specific cell type. The first column of “dmct” indicates whether a gene is a DEG (1) or not (0). The remaining columns correspond to each cell type, and a value of 1 indicates that the gene is over-expressed compared to the control group, while a value of -1 indicates that the gene is under-expressed. A value of 0 indicates that the gene is not a csDEG. The rows of “dmct” are ordered in the same way as those in the input beta.m.
names(celldmc.o)
## [1] "dmct" "coe"
print(celldmc.o$dmct[c(1:5,101:105,301:305,701:710),])
## DMC Celltype1 Celltype2 Celltype3 Celltype4 Celltype5 Celltype6
## 1 1 1 0 0 0 1 0
## 2 1 1 -1 0 -1 0 1
## 3 1 1 -1 0 0 -1 0
## 4 1 0 0 0 0 1 0
## 5 1 1 0 0 0 0 0
## 101 1 0 1 0 0 1 0
## 102 1 1 1 0 1 0 0
## 103 1 0 1 0 0 -1 0
## 104 1 0 1 0 0 0 0
## 105 1 0 1 1 1 1 0
## 301 1 0 0 0 1 0 0
## 302 1 0 0 0 1 -1 0
## 303 1 0 0 0 1 0 0
## 304 1 0 0 0 1 0 0
## 305 1 0 0 0 1 0 0
## 701 0 0 0 0 0 0 0
## 702 0 0 0 0 0 0 0
## 703 0 0 0 0 0 0 0
## 704 1 0 0 0 0 1 0
## 705 1 0 0 -1 0 1 0
## 706 0 0 0 0 0 0 0
## 707 1 0 0 0 -1 -1 1
## 708 0 0 0 0 0 0 0
## 709 1 0 0 0 0 1 -1
## 710 1 0 0 0 0 1 0
The second component of the output, “coe,” is a list that contains multiple data frames, each corresponding to a specific cell type. Each data frame includes the estimated RNAseq changes (“Estimate”), the standard error (“SE”), multiple hypothesis-corrected p-values, and other relevant information.
names(celldmc.o$coe)
## [1] "Celltype1" "Celltype2" "Celltype3" "Celltype4" "Celltype5" "Celltype6"
head(celldmc.o$coe$Celltype1,10)
## Estimate SE t p adjP
## 1 720.89235 74.72128 9.647752 3.808212e-18 1.015523e-16
## 2 647.38874 44.00735 14.710922 4.241783e-33 2.019897e-31
## 3 370.35494 61.95166 5.978128 1.113249e-08 1.494294e-07
## 4 52.18125 23.12261 2.256719 2.517648e-02 9.307385e-02
## 5 89.53427 16.89504 5.299440 3.232294e-07 3.694050e-06
## 6 70.88153 30.16143 2.350072 1.980672e-02 7.722269e-02
## 7 28.55231 10.44652 2.733189 6.871234e-03 3.327474e-02
## 8 61.52042 9.88528 6.223437 3.092587e-09 4.482011e-08
## 9 151.73516 37.81378 4.012695 8.660082e-05 6.661602e-04
## 10 1544.09394 42.16216 36.622742 1.670974e-87 3.341947e-85
Pull out and summarize cs-testing results (adjusted p-value and csDEGs):
CellDMC.outputs <- as.data.frame(
cbind(celldmc.o[["coe"]][["Celltype1"]][["adjP"]],
celldmc.o[["coe"]][["Celltype2"]][["adjP"]],
celldmc.o[["coe"]][["Celltype3"]][["adjP"]],
celldmc.o[["coe"]][["Celltype4"]][["adjP"]],
celldmc.o[["coe"]][["Celltype5"]][["adjP"]],
celldmc.o[["coe"]][["Celltype6"]][["adjP"]]))
colnames(CellDMC.outputs)[1:6] <- paste0("Celltype",seq(1:6))
CellDMC.outputs$gene.id <- rownames(celldmc.o[["coe"]][["Celltype1"]])
CellDMC.outputs <- CellDMC.outputs[,c(7,1:6)]
CellDMC.outputs <- merge(x = CellDMC.outputs, y = gene_CT_DE_connect,
by = "gene.id",all.x =T)
###
CellDMC.outputs = CellDMC.outputs[order(as.numeric(CellDMC.outputs$gene.id)),]
rownames(CellDMC.outputs) = CellDMC.outputs$gene.id
head(CellDMC.outputs,10)
## gene.id Celltype1 Celltype2 Celltype3 Celltype4 Celltype5
## 1 1 1.015523e-16 7.663098e-01 0.71444741 0.11105202 2.711014e-35
## 2 2 2.019897e-31 3.963368e-03 0.22519853 0.01404364 1.034232e-01
## 3 3 1.494294e-07 2.205288e-14 0.07587987 0.27402185 5.441754e-06
## 4 4 9.307385e-02 5.283095e-01 0.87677458 0.50927162 1.638604e-03
## 5 5 3.694050e-06 5.180646e-01 0.80012095 0.23975269 1.556115e-01
## 6 6 7.722269e-02 6.271143e-01 0.85896731 0.24238137 9.971349e-01
## 7 7 3.327474e-02 9.958517e-01 0.89451333 0.58985647 2.582935e-01
## 8 8 4.482011e-08 5.104551e-01 0.72590599 0.02171998 6.948185e-01
## 9 9 6.661602e-04 1.386175e-02 0.42131743 0.73418276 8.911668e-01
## 10 10 3.341947e-85 6.417007e-01 0.38704273 0.34808716 1.433807e-04
## Celltype6 csDE.True
## 1 0.802875687 1
## 2 0.002026740 1
## 3 0.599142748 1
## 4 0.999361572 1
## 5 0.535134425 1
## 6 0.001053128 1
## 7 0.536671252 1
## 8 0.455505373 1
## 9 0.981017167 1
## 10 0.414298281 1
TCA
Tensor Composition Analysis (TCA) is a method designed for DNA methylation data that enables the deconvolution of 2-dimensional data (genes by samples) derived from a mixture of K sources (cell types) into a 3-dimensional matrix of signals. TCA also allows for the testing of specific genes in the data for different statistical relations with an outcome of interest while modeling source-specific effects (TCA regression). In particular, it enables the identification of statistical relations between source-specific signals and an outcome of interest. For further details, please refer to the following link:.
tca enables the modeling of the RNAseq of each sample as a mixture of cell-type-specific RNAseq that are unique to the sample. It also allows for the statistical testing of the effects of covariates and phenotypes at the cell type level.
X a gene by sample data matrix, which should include row and column names. Please note that NA values are not supported, and features cannot be constant across all observations (sometimes it may be necessary to trim the dataset in real data analysis).
W input is a sample by cell type matrix (proportion matrix), and each cell must be positive. The row sums should be 1 across cell types for all samples. W should also include row and column names, and NA values are not supported.
C1 input is a sample by phenotypical covariate matrix, and the covariate may affect the hidden source-specific values. C1 must include row and column names as well.
sim.design <- as.data.frame(rep(c(1,2),100))
colnames(sim.design) <- "disease"
rownames(sim.design) <- rownames(est_CT_prop)
### Run model
tca.mdl.hannum.rnaseq <- tca(X = RNAseq_final_count,
W = est_CT_prop, refit_W = FALSE,
C1 = sim.design, verbose = FALSE)
## [1] "W" "mus_hat" "sigmas_hat"
## [4] "tau_hat" "deltas_hat" "gammas_hat"
## [7] "deltas_hat_pvals" "gammas_hat_pvals" "gammas_hat_pvals.joint"
## [10] "C1" "C2"
The function returns a list of estimated model parameters. “W” is a sample by cell type proportion matrix, and if “refit_w==TRUE”, then this is the re-estimated “W”, otherwise it is the input W. “gamma_hat” is a gene by cell types*phenotypical covariates (interactions) matrix of the estimated effects of the covariates in C1 on each genetic feature in X. “gamma_hat_pvals” is a gene by cell types*phenotypical covariates matrix of p-values for the estimates in “gamma_hat” based on a T-test. “gammas_hat_pvals.joint” is a gene by phenotypical covariates matrix of p-values for the joint effects of each phenotypical covariate in C1 on each genetic feature in X, across all cell types.
tca.mdl.hannum.rnaseq$gammas_hat[1:10,]
## Celltype1.disease Celltype2.disease Celltype3.disease Celltype4.disease
## 1 784.22306 30.2847840 -97.906578 -113.485828
## 2 656.13090 -276.6065342 84.915921 -98.655051
## 3 374.71811 -1110.7511702 183.839474 54.506216
## 4 51.60151 61.3370465 -15.591577 -20.571400
## 5 88.96086 40.1912526 -11.565838 -21.617242
## 6 71.67337 42.1897088 -8.808229 27.706714
## 7 28.08200 0.2764637 -5.469976 -7.822591
## 8 61.60371 25.6929457 -8.722172 -22.520150
## 9 146.44423 -244.9431658 69.896175 18.729986
## 10 1533.08346 -88.3231045 -75.967708 -26.777691
## Celltype5.disease Celltype6.disease
## 1 958.492382 34.719485
## 2 99.414460 127.490024
## 3 -312.645371 -40.333640
## 4 82.464520 1.090204
## 5 29.727019 -14.487409
## 6 -14.580426 -84.444257
## 7 -12.784730 9.548939
## 8 7.132471 8.669879
## 9 -7.164853 -6.917363
## 10 168.214846 -47.677302
Here we extract p-values for the estimates in gammas_hat:
TCA.outputs <- as.data.frame(tca.mdl.hannum.rnaseq[["gammas_hat_pvals"]])
TCA.outputs$gene.id <- (rownames(TCA.outputs))
colnames(TCA.outputs)[1:6] <- paste0("Celltype",seq(1:6))
TCA.outputs <- TCA.outputs[,c(7,1:6)]
TCA.outputs <- merge(x = TCA.outputs, y = gene_CT_DE_connect,
by = "gene.id",all.x =T)
TCA.outputs = TCA.outputs[order(as.numeric(TCA.outputs$gene.id)),]
rownames(TCA.outputs) = TCA.outputs$gene.id
head(TCA.outputs,10)
## gene.id Celltype1 Celltype2 Celltype3 Celltype4 Celltype5
## 1 1 9.912568e-23 9.712202e-01 0.21439015 0.0195782032 2.849345e-33
## 2 2 9.295390e-10 9.354472e-04 0.08586113 0.0006870254 5.483400e-02
## 3 3 1.850755e-09 2.235645e-11 0.01444532 0.2190124740 1.836655e-07
## 4 4 1.004905e-01 4.448617e-01 0.61779720 0.2058142064 1.166022e-04
## 5 5 9.994407e-05 2.483426e-01 0.55567238 0.1334958334 6.734446e-02
## 6 6 7.489067e-03 2.500240e-01 0.77233266 0.3851258227 3.362693e-01
## 7 7 5.234669e-02 9.111096e-01 0.83322621 0.5212428286 1.894777e-01
## 8 8 7.970820e-08 2.033746e-01 0.20980690 0.0026481470 4.741161e-01
## 9 9 5.218804e-03 2.088747e-03 0.13211519 0.5097266004 8.091741e-01
## 10 10 1.397981e-74 1.402615e-01 0.08312005 0.4680263507 3.599102e-05
## Celltype6 csDE.True
## 1 0.6499155206 1
## 2 0.0001536946 1
## 3 0.6189171146 1
## 4 0.7910781596 1
## 5 0.4826790774 1
## 6 0.0002578700 1
## 7 0.8241765316 1
## 8 0.1638095538 1
## 9 0.9345749350 1
## 10 0.9442792351 1
Here is an alternative way to fit TCA model depending on a slightly different model assumption. However, results are similar based on different assumptions. For details please refer TCA Manual
### Run model/another way to fit TCA
# tca.mdl.hannum.rnaseq <- tca(X = RNAseq_final_count,
# W = est_CT_prop, refit_W = FALSE,
# verbose = FALSE)
# tca.reg = tcareg(RNAseq_final_count,tca.mdl.hannum.rnaseq,sim.design)
# TCA.outputs2 <- as.data.frame(tca.reg[["pvals"]])
# TCA.outputs2$gene.id <- as.numeric(rownames(TCA.outputs2))
# colnames(TCA.outputs2)[1:6] <- paste0("Celltype",seq(1:6))
# TCA.outputs2 <- TCA.outputs2[,c(7,1:6)]
# TCA.outputs2 <- merge(x = TCA.outputs2, y = gene_CT_DE_connect,
# by = "gene.id",all.x =T)
CARseq
CARseq is the most rigorous statistical model in cell type aware analysis of RNA-sequencing data. The inputs are bulk RNA-seq data (count_matrix) and cell type fraction estimates (cellular_proportions) for a set of samples, and relevant variables of these samples (groups), include a variable of interest for differential expression analysis as well as covariates.
# This method requires approximately 10 minutes to complete.
res.CARseq = run_CARseq(count_matrix = RNAseq_final_count,
cellular_proportions = est_CT_prop,
groups = as.factor(rep(c(1,2),100)),
shrunken_lfc = TRUE,
cores = 1,
fix_overdispersion = FALSE,
useSocket = TRUE)
The function returns a list of mostly matrices. “padj” is a genes by cell types matrix of adjusted p-values by Benjamini & Hochberg (BH) method. “shrunken_lfc” is a genes by cell types matrix of shrunken cell-specific LFC effect between groups and “lfc” is non-shrunken LFC effects.
names(res.CARseq)
## [1] "p" "padj"
## [3] "shrunken_lfc" "shrunken_lfcSE"
## [5] "shrunken_coefficients" "shrunken_coefficientsSE"
## [7] "lfc" "lfcSE"
## [9] "coefficients" "coefficientsSE"
## [11] "overdispersion" "lambda"
## [13] "elapsed_time"
head(res.CARseq$padj)
## Celltype1 Celltype2 Celltype3 Celltype4 Celltype5 Celltype6
## 1 6.202366e-20 8.833948e-01 0.6042996 0.109648084 1.902985e-38 1.000000e+00
## 2 6.178139e-33 1.666764e-01 0.4737238 0.004323851 3.361490e-01 8.093820e-06
## 3 5.827273e-10 1.609231e-14 0.2168014 0.648206023 1.462221e-03 3.961132e-01
## 4 1.450619e-01 5.074261e-01 0.7887477 0.545012720 4.203383e-03 9.448435e-01
## 5 4.458972e-05 5.957874e-01 0.8197820 0.215714744 1.935064e-01 6.198377e-01
## 6 7.416990e-02 7.112412e-01 0.8377204 0.245918659 1.000000e+00 1.170872e-03
head(res.CARseq$shrunken_lfc)
## 2_vs_1:Celltype1 2_vs_1:Celltype2 2_vs_1:Celltype3 2_vs_1:Celltype4
## 1 1.4284198 -0.1093055 -0.12588312 -0.2405679
## 2 1.8838144 -0.7674251 0.21158909 -0.7490455
## 3 1.2517946 -1.1667454 0.09199791 0.1630767
## 4 1.1499341 0.4789599 -0.09534685 -0.2457471
## 5 1.4596118 0.6860330 -0.09578678 -0.7749907
## 6 0.5806996 0.4140480 -0.14246080 0.1737755
## 2_vs_1:Celltype5 2_vs_1:Celltype6
## 1 0.5642154064 0.03450534
## 2 0.0008674781 0.62648925
## 3 -0.0010523342 -0.27048491
## 4 0.3567752641 0.06114347
## 5 0.7731222294 -0.17263066
## 6 -0.0915354251 -0.76483178
head(res.CARseq$lfc)
## 2_vs_1:Celltype1 2_vs_1:Celltype2 2_vs_1:Celltype3 2_vs_1:Celltype4
## 1 1.5705257 -0.06288303 -2.7299993 -0.2970589
## 2 1.9047142 -Inf 0.2439680 -0.9970980
## 3 1.5401461 -1.47182438 0.2168885 0.2809325
## 4 1.9516849 0.73475923 -0.4760905 -0.3760920
## 5 1.5512505 1.26164776 -0.6250052 -1.1897895
## 6 0.6363702 0.77308748 -0.2939310 0.1712822
## 2_vs_1:Celltype5 2_vs_1:Celltype6
## 1 0.56733958 0.7544951
## 2 Inf 0.6120254
## 3 -Inf -0.1787770
## 4 0.36233572 0.1032475
## 5 1.04053205 -0.1709291
## 6 -0.05493189 -0.7772930
Here we extract adjusted p-values for the estimates:
CARseq.outputs <- as.data.frame(res.CARseq$padj)
CARseq.outputs$gene.id <- rownames(res.CARseq$padj)
CARseq.outputs <- CARseq.outputs[,c(7,1:6)]
CARseq.outputs <- merge(x = CARseq.outputs, y = gene_CT_DE_connect,
by = "gene.id",all.x =T)
CARseq.outputs = CARseq.outputs[order(as.numeric(CARseq.outputs$gene.id)),]
rownames(CARseq.outputs) = CARseq.outputs$gene.id
head(CARseq.outputs,10)
## gene.id Celltype1 Celltype2 Celltype3 Celltype4 Celltype5
## 1 1 6.202366e-20 8.833948e-01 0.6042996 0.109648084 1.902985e-38
## 2 2 6.178139e-33 1.666764e-01 0.4737238 0.004323851 3.361490e-01
## 3 3 5.827273e-10 1.609231e-14 0.2168014 0.648206023 1.462221e-03
## 4 4 1.450619e-01 5.074261e-01 0.7887477 0.545012720 4.203383e-03
## 5 5 4.458972e-05 5.957874e-01 0.8197820 0.215714744 1.935064e-01
## 6 6 7.416990e-02 7.112412e-01 0.8377204 0.245918659 1.000000e+00
## 7 7 8.535865e-03 9.181916e-01 0.7997128 0.561945232 3.361074e-01
## 8 8 8.070407e-07 6.192936e-01 0.7183726 0.032954096 8.026218e-01
## 9 9 5.762145e-05 2.082161e-02 0.4870788 0.821984401 1.000000e+00
## 10 10 1.041724e-86 4.663815e-01 0.8207524 0.242055902 7.932625e-04
## Celltype6 csDE.True
## 1 1.000000e+00 1
## 2 8.093820e-06 1
## 3 3.961132e-01 1
## 4 9.448435e-01 1
## 5 6.198377e-01 1
## 6 1.170872e-03 1
## 7 5.471940e-01 1
## 8 3.908968e-01 1
## 9 1.000000e+00 1
## 10 5.525047e-01 1
DESeq2
The goal of DESeq2 is to estimate variance-mean dependence in count data from high-throughput sequencing assays and test for DEG based on a model using the negative binomial distribution. Although DESeq2 is not specifically designed for csDEGs analysis, it could achieve this goal by incorporating the cell type proportions as interaction covariates in the model.
DESeqDataSetxxxx is used to store the input values, intermediate calculations, and results of an analysis of differential expression. The DESeqDataSet class enforces non-negative integer values in the “counts” matrix stored as countData for the first input below. colData is a matrix (or dataframe) which should includes all covariates to be tested in the design. A formula which specifies the design of the experiment must be provided into design.
DESeq is the function conducting differential expression analysis based on the negative binomial distribution. DESeq function returns a DESeqDataSet object, results tables (log2 fold changes and p-values) can be generated using the results function. Shrunken LFC can then be generated using the lfcShrink function.
# create coldata for model fitting
coldata = cbind(disease = rep(c(0,1),times=100), est_CT_prop)
#be sure to delete one CT to make the design matrix full rank
dds.bl <- DESeqDataSetFromMatrix(countData = RNAseq_final_count,
colData = coldata,
design= ~ 0 + Celltype1 + Celltype2 + Celltype3 +
Celltype4 + Celltype5 + Celltype6 +
disease:Celltype1 + disease:Celltype2 +
disease:Celltype3 + disease:Celltype4 +
disease:Celltype5 + disease:Celltype6)
tic()
dds.bl <- DESeq(dds.bl)
toc()
dds.bl
## class: DESeqDataSet
## dim: 2000 200
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(2000): 1 2 ... 1999 2000
## rowData names(62): baseMean baseVar ... deviance maxCooks
## colnames(200): control1 case1 ... control100 case100
## colData names(8): disease Celltype1 ... Celltype6 sizeFactor
Here we extract adjusted p-values for the estimates by results function:
# Celltype1
res1.bl <- as.data.frame(results(dds.bl, name = "Celltype1.disease"))
res1.bl$padj <- ifelse(is.na(res1.bl$padj), 1, res1.bl$padj)
res1.bl$gene.id <- (rownames(res1.bl))
Celltype1.bl <- res1.bl %>% dplyr::select(gene.id, padj) %>%
dplyr::rename(Celltype1 = padj)
# Celltype2
res2.bl <- as.data.frame(results(dds.bl, name = "Celltype2.disease"))
res2.bl$padj <- ifelse(is.na(res2.bl$padj), 1, res2.bl$padj)
res2.bl$gene.id <- (rownames(res2.bl))
Celltype2.bl <- res2.bl %>% dplyr::select(gene.id, padj) %>%
dplyr::rename(Celltype2 = padj)
# Celltype3
res3.bl <- as.data.frame(results(dds.bl, name = "Celltype3.disease"))
res3.bl$padj <- ifelse(is.na(res3.bl$padj), 1, res3.bl$padj)
res3.bl$gene.id <- (rownames(res3.bl))
Celltype3.bl <- res3.bl %>% dplyr::select(gene.id, padj) %>%
dplyr::rename(Celltype3 = padj)
# Celltype4
res4.bl <- as.data.frame(results(dds.bl, name = "Celltype4.disease"))
res4.bl$padj <- ifelse(is.na(res4.bl$padj), 1, res4.bl$padj)
res4.bl$gene.id <- (rownames(res4.bl))
Celltype4.bl <- res4.bl %>% dplyr::select(gene.id, padj) %>%
dplyr::rename(Celltype4 = padj)
# Celltype5
res5.bl <- as.data.frame(results(dds.bl, name = "Celltype5.disease"))
res5.bl$padj <- ifelse(is.na(res5.bl$padj), 1, res5.bl$padj)
res5.bl$gene.id <- (rownames(res5.bl))
Celltype5.bl <- res5.bl %>% dplyr::select(gene.id, padj) %>%
dplyr::rename(Celltype5 = padj)
# Celltype6
res6.bl <- as.data.frame(results(dds.bl, name = "Celltype6.disease"))
res6.bl$padj <- ifelse(is.na(res6.bl$padj), 1, res6.bl$padj)
res6.bl$gene.id <- (rownames(res6.bl))
Celltype6.bl <- res6.bl %>% dplyr::select(gene.id, padj) %>%
dplyr::rename(Celltype6 = padj)
DESeq2.outputs <- Reduce(function(x,y) merge(x = x, y = y, by = "gene.id",all.x =T),
list(Celltype1.bl, Celltype2.bl, Celltype3.bl,
Celltype4.bl, Celltype5.bl, Celltype6.bl,
gene_CT_DE_connect))
DESeq2.outputs = DESeq2.outputs[order(as.numeric(DESeq2.outputs$gene.id)),]
rownames(DESeq2.outputs) = DESeq2.outputs$gene.id
head(DESeq2.outputs,10)
## gene.id Celltype1 Celltype2 Celltype3 Celltype4 Celltype5
## 1 1 4.477931e-18 8.406315e-01 0.90677407 0.04164153 6.762314e-08
## 2 2 1.487284e-26 2.142558e-02 0.76992896 0.02819519 4.522998e-06
## 3 3 2.206552e-04 1.092036e-13 0.02245169 0.61624271 4.483315e-11
## 4 4 1.190654e-01 7.094800e-01 0.77012465 0.62513041 2.433349e-01
## 5 5 5.432354e-04 6.297119e-01 0.84516203 0.36336180 1.406307e-01
## 6 6 2.905903e-01 7.891463e-01 0.71505031 0.61100180 9.520691e-01
## 7 7 5.060034e-02 9.909691e-01 0.79018056 0.51686683 2.855007e-01
## 8 8 1.124914e-05 6.524908e-01 0.65949497 0.02815057 5.774959e-01
## 9 9 1.047052e-02 4.549209e-03 0.79609517 0.71503956 8.736590e-01
## 10 10 2.243256e-102 7.234885e-02 0.94103752 0.84391555 1.821115e-02
## Celltype6 csDE.True
## 1 0.003242970 1
## 2 0.090274614 1
## 3 0.902579287 1
## 4 0.822409279 1
## 5 0.220237822 1
## 6 0.001884162 1
## 7 0.531998812 1
## 8 0.519704799 1
## 9 0.947377225 1
## 10 0.545832319 1
CeDAR
CeDAR is used to test csDE signals for phenotypical covariates by considering DE state correlation between cell types. This method provides posterior probability of whether a genetic feature is DE in certain cell types given observed bulk data. When sample size is small or technical noise is large, this tree structure is recommended. We could also incorporate such correlation into csDE detection to improve the power, especially in cell types with low abundance.
cedar is incorporated as a function in TOAST package.
Y_raw is a genes by samples matrix of observed bulk data. prop is a samples by cell types matrix of cell type abundance. design.1 includes covariates with cell type specific effect; while design.2 is for covarites without cell type specific effect (sample by covariate vector/matrix). factor.to.test specifies which covariates to be tested (from design.1/design2).
sim.design <- as.data.frame(factor(rep(c(1,2),100)))
colnames(sim.design) <- "disease"
# about 30 secs
res_cedar <- cedar(Y_raw = as.matrix(RNAseq_final_count),
prop = est_CT_prop,
design.1 = sim.design,
factor.to.test = 'disease')
The output is a list. “toast_res” is results of csTest() function from TOAST package. “tree_res” is a matrix of cell-type specific posterior probability of each genetic feature.
names(res_cedar)
## [1] "toast_res" "tree_res" "time_used"
We can have posterior probability of DE for each genetic feature in each cell type from a more complex or from a simpler tree structure:
head(res_cedar$tree_res$full$pp)
## Celltype1 Celltype2 Celltype3 Celltype4 Celltype5 Celltype6
## 1 1.0000000 0.7145110 0.5871622 0.9198576 1.0000000 0.4404046
## 2 1.0000000 0.9975011 0.8970120 0.9875514 0.9764860 0.9989232
## 3 1.0000000 1.0000000 0.9696668 0.6877684 0.9999990 0.4554661
## 4 0.8326793 0.5400587 0.3627337 0.5200247 0.9994420 0.3784331
## 5 0.9999986 0.5984155 0.4308709 0.7304736 0.8177801 0.4906431
## 6 0.8924257 0.5519430 0.3726420 0.7820650 0.8134018 0.9993236
head(res_cedar$tree_res$single$pp)
## Celltype1 Celltype2 Celltype3 Celltype4 Celltype5 Celltype6
## 1 1.0000000 0.4080102 0.3695901 0.9110021 1.0000000 0.3028812
## 2 1.0000000 0.9842010 0.6068900 0.9789479 0.9300748 0.9985262
## 3 1.0000000 1.0000000 0.8391173 0.5949058 0.9999947 0.2821854
## 4 0.7956898 0.3122701 0.1770390 0.4087894 0.9993220 0.2334202
## 5 0.9999981 0.2938729 0.1826718 0.6802124 0.7147263 0.3794544
## 6 0.8792442 0.3147759 0.1739759 0.7244160 0.5026382 0.9993225
We further use \(1-post.Prob\) to get the norminal p-value.
# Returns a posterior probability of whether a feature (gene) is a DEG in
# a certain cell type
cedar.outputs <- as.data.frame(1- res_cedar$tree_res$full$pp)
cedar.outputs$gene.id <- (rownames(cedar.outputs))
cedar.outputs <- cedar.outputs[,c(7,1:6)]
cedar.outputs <- merge(x = cedar.outputs, y = gene_CT_DE_connect,
by = "gene.id",all.x =T)
# aa <- as.data.frame(table(gene.CT.FDR$Celltype3))
cedar.outputs = cedar.outputs[order(as.numeric(cedar.outputs$gene.id)),]
rownames(cedar.outputs) = cedar.outputs$gene.id
head(cedar.outputs,10)
## gene.id Celltype1 Celltype2 Celltype3 Celltype4 Celltype5
## 1 1 0.000000e+00 0.285489044 0.4128378 0.08014244 0.000000e+00
## 2 2 0.000000e+00 0.002498894 0.1029880 0.01244864 2.351405e-02
## 3 3 1.254557e-08 0.000000000 0.0303332 0.31223160 9.766730e-07
## 4 4 1.673207e-01 0.459941290 0.6372663 0.47997527 5.579680e-04
## 5 5 1.423306e-06 0.401584517 0.5691291 0.26952642 1.822199e-01
## 6 6 1.075743e-01 0.448056980 0.6273580 0.21793499 1.865982e-01
## 7 7 5.388047e-02 0.523084342 0.6433422 0.54743357 2.097714e-01
## 8 8 7.291874e-09 0.375509923 0.5718931 0.04840874 2.563183e-01
## 9 9 1.045304e-04 0.009656311 0.1905728 0.63540905 4.782149e-01
## 10 10 0.000000e+00 0.155473134 0.1500476 0.29992566 2.645339e-04
## Celltype6 csDE.True
## 1 0.5595954313 1
## 2 0.0010768204 1
## 3 0.5445338768 1
## 4 0.6215669384 1
## 5 0.5093568965 1
## 6 0.0006763656 1
## 7 0.6044155459 1
## 8 0.5096116278 1
## 9 0.7566018023 1
## 10 0.3463708761 1
LRCDE
LRCDE a cell type-specific differential expression package. lrcde applies to any type of high-throughput data (methylation profiles, RNA-seq data) and it requires three components.
het.sub is a samples by genes heterogeneous gene expression data matrix. Input should be non-median-centered and non-standardized. cell.props is a cell types by samples matrix. The sum across cell types should be ~1 for each sample. groups is a vector of 1’s and 2’s indicating group assignment for each sample. Also the sample order should be the same across het.sub and cell.props.
pheno.v <- rep(c(1, 2), 100)
# Run Model
power.analysis.df <- lrcde(het.sub = t(RNAseq_final_count),
cell.props = est_CT_prop ,
groups = pheno.v ,
output.file = "aa.csv",
VERBOSE = T,
method = "dual",
direction = "two.sided")
# The warning message you received when running the lrcde function in R may indicate that the function was not able to achieve full numerical precision in some calculations. This can happen due to the limitations of floating-point arithmetic used in computers, which can sometimes lead to rounding errors and loss of precision when working with very small or very large numbers.
A list with three elements. The first contains a data.frame of analysis results. The second contains a list of parameters supplied to the lrcde function.
print(power.analysis.df[[1]][c(1:3,2001:2003,4001:4003),])
## site base case diff.est mse.control mse.case cell
## 1 1 200.60797 921.50032 720.89235 584.6699 674.1086 Celltype1
## 2 2 132.71330 780.10204 647.38874 223.3617 213.2660 Celltype1
## 3 3 103.47018 473.82512 370.35494 408.6066 456.6934 Celltype1
## 2001 1 117.38873 24.62962 -92.75912 584.6699 674.1086 Celltype2
## 2002 2 74.39143 -244.36096 -74.39143 223.3617 213.2660 Celltype2
## 2003 3 1237.86152 125.08931 -1112.77221 408.6066 456.6934 Celltype2
## 4001 1 93.37761 25.21199 -68.16562 584.6699 674.1086 Celltype3
## 4002 2 211.36005 305.54773 94.18768 223.3617 213.2660 Celltype3
## 4003 3 478.54139 653.37801 174.83662 408.6066 456.6934 Celltype3
## cell.sd kappa.1 kappa.2 t.crit t.stat p.val.t se1
## 1 0.12128222 119.6352 1102.016 1.983066 78.202534 1.951995e-94 14.29189
## 2 0.12128222 119.6352 1102.016 1.981997 127.333387 1.420163e-120 11.08339
## 3 0.12128222 119.6352 1102.016 1.979126 60.529567 9.266536e-95 21.00371
## 2001 0.09744163 119.6352 1102.016 1.983673 -4.842907 9.999977e-01 20.28405
## 2002 0.09744163 119.6352 1102.016 1.983140 -7.087687 1.000000e+00 15.73032
## 2003 0.09744163 119.6352 1102.016 1.981540 -89.863451 1.000000e+00 29.80992
## 4001 0.04713714 119.6352 1102.016 1.979904 -7.639704 1.000000e+00 27.86769
## 4002 0.04713714 119.6352 1102.016 1.977008 18.471909 1.054239e-39 21.61145
## 4003 0.04713714 119.6352 1102.016 1.972843 25.952864 6.240336e-64 40.95501
## se2 se.p t.power
## 1 91.06810 9.218273 1.0000000
## 2 49.61925 5.084203 1.0000000
## 3 57.46778 6.118579 1.0000000
## 2001 190.45894 19.153603 0.9999977
## 2002 103.77322 10.495868 1.0000000
## 2003 120.18755 12.382923 1.0000000
## 4001 84.76189 8.922548 1.0000000
## 4002 46.18326 5.098969 1.0000000
## 4003 53.48829 6.736699 1.0000000
Here we extract T-test p-values for the estimates:
LRCDE.outputs = as.data.frame(power.analysis.df[[1]] %>%
dplyr::select(site, cell, p.val.t) %>%
pivot_wider(id_cols = NULL, names_from = cell,values_from = p.val.t))
names(LRCDE.outputs)[1] <- c("gene.id")
# Correct for multiple comparisons.
for( zz in 1:6){
LRCDE.outputs[,1+zz] = p.adjust(LRCDE.outputs[,1+zz], method = "BH")
}
LRCDE.outputs <- merge(x = LRCDE.outputs, y = gene_CT_DE_connect,
by = "gene.id",all.x =T)
LRCDE.outputs = LRCDE.outputs[order(as.numeric(LRCDE.outputs$gene.id)),]
rownames(LRCDE.outputs) = LRCDE.outputs$gene.id
head(LRCDE.outputs,10)
## gene.id Celltype1 Celltype2 Celltype3 Celltype4 Celltype5
## 1 1 4.539523e-93 1.000000e+00 1.000000e+00 1.000000e+00 5.025893e-174
## 2 2 4.814112e-119 1.000000e+00 7.075428e-39 1.000000e+00 3.053245e-08
## 3 3 2.232900e-93 1.000000e+00 7.564043e-63 9.543993e-36 1.000000e+00
## 4 4 2.670971e-39 5.266458e-17 1.000000e+00 1.000000e+00 4.677184e-90
## 5 5 9.862755e-79 2.467829e-18 1.000000e+00 1.000000e+00 9.174420e-45
## 6 6 2.393608e-45 4.178652e-15 1.000000e+00 3.867938e-38 8.375209e-01
## 7 7 4.762983e-50 9.881423e-01 1.000000e+00 1.000000e+00 1.000000e+00
## 8 8 2.368155e-87 5.088581e-19 1.000000e+00 1.000000e+00 4.998919e-05
## 9 9 3.477816e-73 1.000000e+00 7.434881e-33 9.181229e-08 1.000000e+00
## 10 10 1.577104e-175 1.000000e+00 1.000000e+00 1.000000e+00 4.838563e-93
## Celltype6 csDE.True
## 1 1.000000e+00 1
## 2 2.394650e-99 1
## 3 1.000000e+00 1
## 4 1.000000e+00 1
## 5 1.000000e+00 1
## 6 1.000000e+00 1
## 7 3.034906e-20 1
## 8 1.253472e-25 1
## 9 1.000000e+00 1
## 10 1.000000e+00 1
csSAM
The first model designed for csDEGs calling process. csSAM implements the method that computes cell-specific differential expression from measured cell proportions using SAM.
csSamWrapper performs the entire functionality of csSAM. G is a sample by gene gene expression matrix. cc is a sample by cell type cell abundance matrix. y is a numeric vector of group assignment for each sample (1 or 2).
res.csSAM = csSamWrapper(G = t(RNAseq_final_count),
cc = est_CT_prop,
y = as.factor(rep(c(1,2),100)),
nperms = 50,
alternative = "two.sided",
standardize = TRUE,
medianCenter = TRUE)
The function returns a list. “sigGene.csSAM” is a genes by cell types matrix object with the result of contrasting the average cell-specific expression profile of the two groups, per cell-type.
names(res.csSAM)
## [1] "deconv" "fdr.csSAM" "fdr.SAM" "sigGene.csSAM"
## [5] "fileName"
head(t(res.csSAM$sigGene.csSAM))
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.0000000 0.7446222 0.6274246 0.1825442 0.00000000 0.99062000
## [2,] 0.0000000 0.7446222 0.3084977 0.1825442 0.98211419 0.02761905
## [3,] 0.0000000 0.0000000 0.1219841 0.5040726 0.07919283 0.68638152
## [4,] 0.4883158 0.7446222 0.9827167 0.9988200 0.28979253 0.99062000
## [5,] 0.1757495 0.7446222 0.8556053 0.9988200 0.98211419 0.99062000
## [6,] 0.4883158 0.7446222 0.8556053 0.5040726 0.98211419 0.11173554
csSAM.outputs <- as.data.frame(t(res.csSAM$sigGene.csSAM))
csSAM.outputs$gene.id <- (rownames(RNAseq_final_count))
colnames(csSAM.outputs)[1:6] <- paste0("Celltype",seq(1:6))
csSAM.outputs <- csSAM.outputs[,c(7,1:6)]
csSAM.outputs <- merge(x = csSAM.outputs, y = gene_CT_DE_connect,
by = "gene.id",all.x =T)
csSAM.outputs = csSAM.outputs[order(as.numeric(csSAM.outputs$gene.id)),]
rownames(csSAM.outputs) = csSAM.outputs$gene.id
head(csSAM.outputs,10)
## gene.id Celltype1 Celltype2 Celltype3 Celltype4 Celltype5 Celltype6
## 1 1 0.00000000 0.7446222 0.6274246 0.1825442 0.00000000 0.99062000
## 2 2 0.00000000 0.7446222 0.3084977 0.1825442 0.98211419 0.02761905
## 3 3 0.00000000 0.0000000 0.1219841 0.5040726 0.07919283 0.68638152
## 4 4 0.48831579 0.7446222 0.9827167 0.9988200 0.28979253 0.99062000
## 5 5 0.17574953 0.7446222 0.8556053 0.9988200 0.98211419 0.99062000
## 6 6 0.48831579 0.7446222 0.8556053 0.5040726 0.98211419 0.11173554
## 7 7 0.95581000 0.9277755 0.9827167 0.9988200 0.98211419 0.99062000
## 8 8 0.17574953 0.8367708 0.8556053 0.9988200 0.98211419 0.68638152
## 9 9 0.04919753 0.2451613 0.5213170 0.9988200 0.98211419 0.99062000
## 10 10 0.00000000 0.6372567 0.5213170 0.5040726 0.07919283 0.68638152
## csDE.True
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
Example of Real Data Analysis:
Preprosessing: Load into Datasets, Structure Manipulations:
The dataset used in this analysis is the bulk RNA-seq data from The Cancer Genome Atlas Lung Adenocarcinoma TCGA-LUAD. The goal is to predict the stages of pathologic cancer, which are categorized into a binary variable: stage-I (n=330) versus stage-II/III/IV (n=271). To ensure data quality, we filtered out genes with a 75% quantile expression level lower than 20 across all study samples.
load("rse_gene_lung.Rdata")
## sample variable hdle ----
covar <- as.data.frame(rse_gene@colData)
table(covar$gdc_cases.project.project_id)
# Select TCGA-LUAD dataset
covar.luad <- covar[covar$gdc_cases.project.project_id == "TCGA-LUAD",]
# Select TCGA-LUAD phenotypical information
covar.luad <- covar.luad[,c('gdc_cases.diagnoses.tumor_stage',
'cgc_case_pathologic_stage')]
#table(covar.luad$cgc_case_pathologic_stage)
# Stage I Stage IA Stage IB Stage II Stage IIA Stage IIB Stage IIIA
# 5 154 171 1 58 83 85
#Stage IIIB Stage IV
# 12 29
# Create binary pathologic stage variable. Combine I and IA stage cancer:
covar.luad$cgc_case_patho_stage_binary <-
ifelse(covar.luad$cgc_case_pathologic_stage %in% c("Stage I","Stage IA",
"Stage IB"), 0, 1)
#table(covar.luad$cgc_case_patho_stage_binary)
# 0 1
# 330 271
## count data hdle ----
count <- rse_gene@assays$data$counts
count.luad <- count[,rownames(covar.luad)]
# dim(count.luad) # [1] 58037 601
# Filter genes based on 75% quantile
count.luad.filter <- count.luad[apply(count.luad,1,
function(x) quantile(x, 0.75)) >= 20,]
# dim(count.luad.filter) # [1] 43975 features 601 samples
count.luad.filter <- cbind(count.luad.filter,
gene_id = gsub("\\.\\d+$", "", rownames(count.luad.filter)))
count.luad.filter <- count.luad.filter[,c(602,1:601)]
# Check no duplicated genes: which(duplicated(count.luad.filter[,'gene_id']))
Genes without Ensembl Annotations and with replications are also removed, which led to a total of 31,294 genes.
## Gene Annotations ----
# Filter genes based on gene annotation
ensembl.con <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# check for gene stable ID in the description
G_list <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol", # "external_gene_name",
"start_position","end_position"),
filters = "ensembl_gene_id",
values = as.vector(count.luad.filter[,1]),
mart = ensembl.con)
G_list <- G_list[G_list$hgnc_symbol != '',]
# Check for duplicated genes
duplicated.id = which(duplicated(G_list$hgnc_symbol))
G_list = G_list[-duplicated.id,] # [1] 31296 4
duplicated.id = which(duplicated(G_list$ensembl_gene_id))
G_list = G_list[-duplicated.id,]
# Creat gene size
G_list$gene_size <- G_list$end_position - G_list$start_position
# Merge gene name with raw count data
count.luad.filter <- left_join(G_list, as.data.frame(count.luad.filter),
by = c("ensembl_gene_id" = "gene_id"))
# Data structure manipulations
numeric.count <- t(apply(count.luad.filter[,6:606],1,function(x)
as.numeric(unlist(x))))
colnames(numeric.count) <- colnames(count.luad.filter[,6:606])
count.luad.filter[,6:606] <- numeric.count
# Final data dimension: 606 patients and ~30k genes.
# dim(count.luad.filter) 31294 606
# Clear memory.
rm(numeric.count,ensembl.con,rse_gene,duplicated.id,count,
covar,count.luad)
Deconvolution:
Because tumor-infiltrating immune cells are crucial in cancer patients and the compositions of immune cells in tumor samples play important roles in tumor heterogeneities, we use TIMER2.0 from immunedeconv package to deconvolute cell proportions. The cell composition includes six immune cell types are B cell, T cell CD4, T cell CD8, Neutrophil, Macrophage, and Myeloid Dendritic.
We need to first convert RNA sequencing count data to TMP using the gene length information provided by Ensembl because TIMER2.0 deconvolution algorithms require standardized TMP inputs. The cell type composition matrix needs to be further normalized to 1 across six cells to reach the constrained criteria.
## Deconvolute ----
# Create TPMs tandardized data for cell type composition analysis
TPM_luad.filter <- count.luad.filter[,-c(3,4)]
TPM_luad.filter$gene_size <- TPM_luad.filter$gene_size/1000
names(TPM_luad.filter)[3] <- "gene_size_kb"
TPM_luad.filter[,4:604] <- sweep(TPM_luad.filter[,4:604], 1,TPM_luad.filter[,3],
FUN = '/')
test <- as.matrix(apply(TPM_luad.filter[,4:604],2,sum)/1000000)
TPM_luad.filter[,4:604] <- sweep(TPM_luad.filter[,4:604], 2, test, FUN = "/")
rm(test)
TPM_luad.filter.timer <- TPM_luad.filter[,c(2,4:604)]
rownames(TPM_luad.filter.timer) <- TPM_luad.filter.timer$hgnc_symbol
TPM_luad.filter.timer <- TPM_luad.filter.timer[,-1]
TPM_luad.filter.timer <- round(TPM_luad.filter.timer,digits = 2)
# Use TIMER for deconvolution, from immunedeconv package
CT.Prop.luad.TIMER <- deconvolute(TPM_luad.filter.timer, "timer",
indications=as.vector(c(rep("luad",601))))
# Other deconvolution methods: from IOBR package
# CT.Prop.luad.CIBERSORT = deconvo_tme(eset = count_luad.timer,
# method = "cibersort",
# arrays = FALSE, perm = 200 )
# CT.Prop.luad.CIBERSORT = deconvo_tme(eset = TPM_luad.filter.timer,
# method = "cibersort",
# arrays = FALSE, perm = 200)
# CT.Prop.luad.TIMER2 = deconvo_tme(eset = count_luad.timer,
# method = "timer",
# group_list = rep("luad",dim(CT.Prop.luad.TIMER)[1]))
## CT prop normalize----
rownames(count.luad.filter) <- count.luad.filter$ensembl_gene_id
count.luad.filter <- count.luad.filter[,6:606]
CT.Prop.luad.TIMER <- as.data.frame(CT.Prop.luad.TIMER)
rownames(CT.Prop.luad.TIMER) <- CT.Prop.luad.TIMER$cell_type
CT.Prop.luad.TIMER <- CT.Prop.luad.TIMER[,-1]
CT.Prop.luad.TIMER <- t(CT.Prop.luad.TIMER)
colnames(CT.Prop.luad.TIMER) <- c("B_cell","T_cell_CD4", "T_cell_CD8",
"Neutrophil","Macrophage","Myeloid_dentritic_c")
CT.Prop.luad.TIMER <- sweep(CT.Prop.luad.TIMER,1,
1/apply(CT.Prop.luad.TIMER,1,sum),
FUN = "*")
head(apply(CT.Prop.luad.TIMER,1,sum),10)
## 672AFEB7-E9AA-4A44-AA9C-EF6344AE5C5C 670D8333-6723-4B4F-B533-D2BFF803A9BF
## 1 1
## 4B3D8B07-8B44-45F3-BE30-A0B6EDBF8265 05E2E228-C0C5-4D6D-8FE1-DA25D78B0EC1
## 1 1
## AE818BC4-05BF-4E85-A4AD-F3E43DD7D37A 1002C658-3E4D-4F5A-BCB2-E81193CF209A
## 1 1
## 2634796D-5545-49C3-AA21-939D1EE443D6 8085F165-9AE3-4079-9F7F-711716D68BF0
## 1 1
## 3E6800B3-B156-46ED-A949-1CF9C79F2C88 A6CC4EB8-BB74-4CBD-85BD-F60FD9D4DC50
## 1 1
TCA
Notice here that the clinical outcome is treated as an independent covariate in csDEGs calling process.
## TCA ----
tca.design <- as.matrix(as.numeric(covar.luad$cgc_case_patho_stage_binary))
colnames(tca.design) <- "stage"
rownames(tca.design) <- rownames(covar.luad)
### Run model
count.luad.filter2 <- count.luad.filter[apply(count.luad.filter,1,sd) <= 35000,]
tca.model <- tca(X = count.luad.filter2,
W = as.matrix(CT.Prop.luad.TIMER), refit_W = F,
C1 = tca.design, verbose = T)
# extract p-values for the estimates in gammas_hat (based on a T-test)
tca.outputs <- as.data.frame(tca.model[["gammas_hat_pvals"]])
tca.outputs$gene.id <- rownames(tca.outputs)
colnames(tca.outputs)[1:6] <- colnames(CT.Prop.luad.TIMER)
tca.outputs <- tca.outputs[,c(7,1:6)]
tca.smry <- tca.outputs
tca.smry.symbol <- merge(tca.smry,G_list, by = "gene.id")
head(tca.smry.symbol,15)
## gene.id Astro Exc Inh Micro Oligo
## 1 ENSG00000000003 0.75610643 0.929799835 0.183145345 0.10130346 0.003612741
## 2 ENSG00000000419 0.09143639 0.042033683 0.607801011 0.79984218 0.302678077
## 3 ENSG00000000457 0.90941863 0.740706553 0.392678985 0.34839534 0.312184203
## 4 ENSG00000000460 0.31948279 0.762589104 0.478549522 0.25237920 0.084136204
## 5 ENSG00000000938 0.80405838 0.223628958 0.477214695 0.83075836 0.884505706
## 6 ENSG00000000971 0.61009856 0.999088525 0.838252895 0.40056510 0.421305082
## 7 ENSG00000001036 0.96399096 0.404510558 0.569663484 0.17241321 0.426463478
## 8 ENSG00000001084 0.64027595 0.426597331 0.841178132 0.17810117 0.419323416
## 9 ENSG00000001167 0.90197934 0.771422207 0.700177593 0.19859443 0.375012411
## 10 ENSG00000001460 0.40404415 0.127584520 0.432182333 0.51997939 0.335826679
## 11 ENSG00000001461 0.32984070 0.706440349 0.773923427 0.37606696 0.350318294
## 12 ENSG00000001497 0.99376041 0.188302333 0.556366441 0.53821038 0.597548629
## 13 ENSG00000001561 0.91833879 0.754994661 0.196644583 0.06687532 0.513414944
## 14 ENSG00000001617 0.50483991 0.001345846 0.009221129 0.74192914 0.657240921
## 15 ENSG00000001626 0.31276473 0.563097410 0.128776499 0.52039704 0.114163183
## OPC hgnc_symbol
## 1 0.1487570 TSPAN6
## 2 0.3557776 DPM1
## 3 0.5518572 SCYL3
## 4 0.6204813 C1orf112
## 5 0.3586093 FGR
## 6 0.8481829 CFH
## 7 0.3325489 FUCA2
## 8 0.8463933 GCLC
## 9 0.6392501 NFYA
## 10 0.9200947 STPG1
## 11 0.1865906 NIPAL3
## 12 0.6760131 LAS1L
## 13 0.5740340 ENPP4
## 14 0.9364709 SEMA3F
## 15 0.6154911 CFTR
TOAST
## TOAST ----
toast.design <- as.data.frame(factor(covar.luad$cgc_case_patho_stage_binary))
colnames(toast.design) <- "stage"
sim.Design_out <- makeDesign(toast.design,CT.Prop.luad.TIMER)
Toast.model <- fitModel(sim.Design_out, as.matrix(count.luad.filter))
Toast.out <- csTest(Toast.model, coef = "stage",cell_type = NULL,
verbose = F, sort = T)
aa <- lapply(seq(1:6), function(i,x) {
x[[i]]$gene.id <- rownames(x[[i]])
x[[i]] <- x[[i]] %>% dplyr::select(gene.id,fdr)
names(x[[i]]) <- c("gene.id",names(x)[i])
assign(names(x)[i],x[[i]], envir=.GlobalEnv)
},x=Toast.out)
remove(aa)
Toast.smry <- Reduce(function(x,y) merge(x = x, y = y, by = "gene.id",all.x =T),
list(B_cell,T_cell_CD4,T_cell_CD8,Neutrophil,Macrophage,
Myeloid_dentritic_c))
remove(B_cell,T_cell_CD4,T_cell_CD8,Neutrophil,Macrophage,
Myeloid_dentritic_c, sim.Design_out,Toast.model,Toast.out,toast.design)
Toast.smry.symbol <- merge(Toast.smry,G_list, by = "gene.id")
head(Toast.smry.symbol,15)
## gene.id Astro Exc Inh Micro Oligo OPC
## 1 ENSG00000000003 0.9998135 0.9284587 0.9999893 0.999998 0.8319204 0.9953115
## 2 ENSG00000000419 0.9998135 0.4043824 0.9999893 0.999998 0.8319204 0.9953115
## 3 ENSG00000000457 0.9998135 0.8286659 0.9999893 0.999998 0.8330658 0.9992906
## 4 ENSG00000000460 0.9998135 0.7965821 0.9999893 0.999998 0.8319204 0.9992906
## 5 ENSG00000000938 0.9998135 0.5274979 0.9999893 0.999998 0.9928465 0.9953115
## 6 ENSG00000000971 0.9998135 0.7772749 0.9999893 0.999998 0.9474317 0.9962820
## 7 ENSG00000001036 0.9998135 0.6333875 0.9999893 0.999998 0.9439384 0.9953115
## 8 ENSG00000001084 0.9998135 0.7680022 0.9999893 0.999998 0.8319204 0.9992906
## 9 ENSG00000001167 0.9999913 0.9615638 0.9999893 0.999998 0.8319204 0.9953115
## 10 ENSG00000001460 0.9998135 0.3383280 0.9999893 0.999998 0.8782956 0.9997742
## 11 ENSG00000001461 0.9998135 0.8985756 0.9999893 0.999998 0.8319204 0.9953115
## 12 ENSG00000001497 0.9998135 0.4777003 0.9999893 0.999998 0.9660650 0.9968641
## 13 ENSG00000001561 0.9998135 0.5172285 0.9999893 0.999998 0.8319204 0.9953115
## 14 ENSG00000001617 0.9998135 0.2475045 0.9999893 0.999998 0.9737621 0.9992906
## 15 ENSG00000001626 0.9998135 0.6690331 0.9999893 0.999998 0.8319204 0.9953115
## hgnc_symbol
## 1 TSPAN6
## 2 DPM1
## 3 SCYL3
## 4 C1orf112
## 5 FGR
## 6 CFH
## 7 FUCA2
## 8 GCLC
## 9 NFYA
## 10 STPG1
## 11 NIPAL3
## 12 LAS1L
## 13 ENPP4
## 14 SEMA3F
## 15 CFTR
DESeq2
## DESeq2 ----
# create coldata for model fitting
covar.luad2 <- covar.luad[order(covar.luad$cgc_case_patho_stage_binary),]
CT.Prop.luad.TIMER2 <- CT.Prop.luad.TIMER[rownames(covar.luad2),]
count.luad.filter2 <- count.luad.filter[,rownames(covar.luad2)]
coldata = cbind(stage = as.factor(covar.luad2$cgc_case_patho_stage_binary),
CT.Prop.luad.TIMER2)
dds.bl <- DESeqDataSetFromMatrix(countData = count.luad.filter2,
colData = coldata,
design = ~ 0 + B_cell + T_cell_CD4 +
T_cell_CD8 + Neutrophil + Macrophage +
Myeloid_dentritic_c +
stage:B_cell + stage:T_cell_CD4 +
stage:T_cell_CD8 + stage:Neutrophil +
stage:Macrophage + stage:Myeloid_dentritic_c)
dds.bl <- DESeq(dds.bl)
DESeq2.out <- dds.bl
# resultsNames(dds.bl)
deseq2resultname <- resultsNames(dds.bl)[7:12]
DESeq2.smry <-data.frame(gene.id = rownames(count.luad.filter2))
for(i in 1:length(deseq2resultname)){
res1.bl <- as.data.frame(results(dds.bl, name = deseq2resultname[i]))
res1.bl$padj <- ifelse(is.na(res1.bl$padj), 1, res1.bl$padj)
res1.bl$gene.id <- rownames(res1.bl)
Celltype1.bl <- res1.bl %>%
dplyr::select(gene.id, padj) %>%
dplyr::rename(!!substr(deseq2resultname[i],1,nchar(deseq2resultname[i])-6):=padj)
DESeq2.smry <- merge(DESeq2.smry, Celltype1.bl, by = "gene.id",all.x =T)
}
DESeq2.smry.symbol <- merge(DESeq2.smry,G_list, by = "gene.id")
head(DESeq2.smry.symbol,15)
## gene.id Astro Exc Inh Micro Oligo OPC
## 1 ENSG00000000003 0.9611865 0.6541721 0.3762072 0.9992306 1.0000000 0.8083559
## 2 ENSG00000000419 0.3467613 0.8499556 0.8312811 0.9992306 0.2480934 0.9880205
## 3 ENSG00000000457 0.8457842 0.6363731 0.8186414 0.9992306 0.4901586 0.7434262
## 4 ENSG00000000460 0.3434591 0.6474127 0.5434581 0.9992306 0.4973274 0.7442209
## 5 ENSG00000000938 0.7337655 0.9891557 0.9184069 0.9992306 1.0000000 0.9912205
## 6 ENSG00000000971 0.8418209 0.8775300 0.9876620 0.9992306 0.2688641 0.9068843
## 7 ENSG00000001036 0.9871490 0.9691142 0.4801493 0.9992306 1.0000000 0.8564742
## 8 ENSG00000001084 0.5442480 0.6675758 0.5266777 0.9992306 0.2386741 0.8118223
## 9 ENSG00000001167 0.6784735 0.4183163 0.2360247 0.9992306 1.0000000 0.8160445
## 10 ENSG00000001460 0.3909547 0.5685522 0.3156943 0.9992306 1.0000000 0.7033200
## 11 ENSG00000001461 0.4630501 0.4835969 0.4355589 0.9992306 0.1053649 0.7399374
## 12 ENSG00000001497 0.7062219 0.9807327 0.7865962 0.9992306 0.2039998 0.9424297
## 13 ENSG00000001561 0.7721005 0.4062672 0.2631234 0.9992306 0.1012494 0.8411948
## 14 ENSG00000001617 0.6417434 0.3953291 0.2421134 0.9992306 1.0000000 0.8477374
## 15 ENSG00000001626 0.8991963 0.4487335 0.3472886 0.9992306 1.0000000 0.9334153
## hgnc_symbol
## 1 TSPAN6
## 2 DPM1
## 3 SCYL3
## 4 C1orf112
## 5 FGR
## 6 CFH
## 7 FUCA2
## 8 GCLC
## 9 NFYA
## 10 STPG1
## 11 NIPAL3
## 12 LAS1L
## 13 ENPP4
## 14 SEMA3F
## 15 CFTR
CellDMC
## CellDMC ----
pheno.v <- as.numeric(covar.luad$cgc_case_patho_stage_binary)
celldmc.model <- CellDMC(as.matrix(count.luad.filter), pheno.v, CT.Prop.luad.TIMER)
celldmc.smry <- as.data.frame(cbind(celldmc.model[["coe"]][["B_cell"]][["adjP"]],
celldmc.model[["coe"]][["T_cell_CD4"]][["adjP"]],
celldmc.model[["coe"]][["T_cell_CD8"]][["adjP"]],
celldmc.model[["coe"]][["Neutrophil"]][["adjP"]],
celldmc.model[["coe"]][["Macrophage"]][["adjP"]],
celldmc.model[["coe"]][["Myeloid_dentritic_c"]][["adjP"]]))
colnames(celldmc.smry) <- colnames(CT.Prop.luad.TIMER)
celldmc.smry$gene.id <- as.matrix(rownames(celldmc.model[["coe"]][["B_cell"]]))
celldmc.smry <- celldmc.smry[,c(7,1:6)]
celldmc.smry.symbol <- merge(celldmc.smry,G_list, by = "gene.id")
head(celldmc.smry.symbol,15)
## gene.id Astro Exc Inh Micro Oligo OPC
## 1 ENSG00000000003 0.9985790 0.9213590 0.9999137 0.9999787 0.7994531 0.9656499
## 2 ENSG00000000419 0.9985790 0.3765622 0.9999137 0.9999787 0.7994531 0.9656499
## 3 ENSG00000000457 0.9985790 0.8144335 0.9999137 0.9999787 0.8006029 0.9916577
## 4 ENSG00000000460 0.9985790 0.7795219 0.9999137 0.9999787 0.7994531 0.9913214
## 5 ENSG00000000938 0.9985790 0.5010179 0.9999137 0.9999787 0.9860303 0.9656499
## 6 ENSG00000000971 0.9985790 0.7590710 0.9999137 0.9999787 0.9184396 0.9723666
## 7 ENSG00000001036 0.9994527 0.6079064 0.9999137 0.9999787 0.9144004 0.9656499
## 8 ENSG00000001084 0.9985790 0.7491604 0.9999137 0.9999787 0.7994531 0.9892898
## 9 ENSG00000001167 0.9999891 0.9585184 0.9999137 0.9999787 0.7994531 0.9656499
## 10 ENSG00000001460 0.9985790 0.3118733 0.9999137 0.9999787 0.8435897 0.9992170
## 11 ENSG00000001461 0.9985790 0.8897347 0.9999137 0.9999787 0.7994531 0.9656499
## 12 ENSG00000001497 0.9985790 0.4507549 0.9999137 0.9999787 0.9439301 0.9776327
## 13 ENSG00000001561 0.9985790 0.4903285 0.9999137 0.9999787 0.7994531 0.9656499
## 14 ENSG00000001617 0.9985790 0.2025085 0.9999137 0.9999787 0.9549264 0.9942869
## 15 ENSG00000001626 0.9985790 0.6461350 0.9999137 0.9999787 0.7994531 0.9656499
## hgnc_symbol
## 1 TSPAN6
## 2 DPM1
## 3 SCYL3
## 4 C1orf112
## 5 FGR
## 6 CFH
## 7 FUCA2
## 8 GCLC
## 9 NFYA
## 10 STPG1
## 11 NIPAL3
## 12 LAS1L
## 13 ENPP4
## 14 SEMA3F
## 15 CFTR
CARseq
## CARseq ----
carseq.model = run_CARseq(count_matrix = count.luad.filter,
cellular_proportions = CT.Prop.luad.TIMER,
groups = as.factor(covar.luad$cgc_case_patho_stage_binary),
shrunken_lfc = TRUE,
cores = 1,
fix_overdispersion = FALSE,
useSocket = TRUE)
carseq.smry <- as.data.frame(carseq.model$padj)
carseq.smry$gene.id <- rownames(carseq.model$padj)
carseq.smry <- carseq.smry[,c(7,1:6)]
carseq.smry.symbol <- merge(carseq.smry,G_list, by = "gene.id")
head(carseq.smry.symbol,15)
## gene.id Astro Exc Inh Micro Oligo OPC hgnc_symbol
## 1 ENSG00000000003 1 0.18029575 1 1 1 1 TSPAN6
## 2 ENSG00000000419 1 0.08717086 1 1 1 1 DPM1
## 3 ENSG00000000457 1 0.69350773 1 1 1 1 SCYL3
## 4 ENSG00000000460 1 0.21816729 1 1 1 1 C1orf112
## 5 ENSG00000000938 1 0.24611833 1 1 1 1 FGR
## 6 ENSG00000000971 1 0.47869053 1 1 1 1 CFH
## 7 ENSG00000001036 1 0.09292355 1 1 1 1 FUCA2
## 8 ENSG00000001084 1 0.22411438 1 1 1 1 GCLC
## 9 ENSG00000001167 1 0.36186032 1 1 1 1 NFYA
## 10 ENSG00000001460 1 0.55750753 1 1 1 1 STPG1
## 11 ENSG00000001461 1 0.81757233 1 1 1 1 NIPAL3
## 12 ENSG00000001497 1 0.24136670 1 1 1 1 LAS1L
## 13 ENSG00000001561 1 0.54540356 1 1 1 1 ENPP4
## 14 ENSG00000001617 1 0.18288062 1 1 1 1 SEMA3F
## 15 ENSG00000001626 1 0.28204673 1 1 1 1 CFTR