csDEG Application Demo

Leslie Meng

2023-03-16

Data Loading and Description:

Our simulated dataset consists of 100 samples per group (case/control) and 2,000 total features (genes). The bulk level data comprises of six distinct cell types (Celltype1 - Celltype6), each has specific underlying cell-specific gene expression patterns. Additionally, 5% of the cell-specific Differentially Expressed Genes (csDEGs) are simulated for each cell type. This results in 100 csDEGs per cell type and a total of 600 csDEGs overall. The Log-Fold-Changes (LFC) for these csDEGs are simulated from a normal distribution with a mean of 1.5 and standard deviation of 0.2, i.e., LFC ~N(1.5, 0.2).

The RNAseq_final_count dataset is a matrix containing simulated raw RNA-sequencing count data with 2000 genes and 200 samples.

The sample_CT_prop and est_CT_prop datasets represent sample-specific true cell type abundance and estimated cell type abundance, respectively. The true cell type abundance is generated from a Dirichlet distribution, while the estimated cell type abundance is derived from CIBERSORT.

In this session, we will demonstrate how to conduct csDEGs analysis using eight different methods: TOAST, CellDMC, TCA, CARseq, DESeq2, CeDAR, csSAM, and LRCDE. For a detailed comparison of these eight methods, please refer to this comprehensive benchmark study: A comprehensive assessment of cell type-specific differential expression methods in bulk data

For your reference, we have included gene_CT_DE_connect, which contains the true label information for csDEGs. Obtaining such ground-truth information is not always feasible in a real research setting. However, if you are interested in evaluating the performance of your models based on metrics such as sensitivities and true discovery rates after classification, this dataframe can be helpful. The gene.id variable indicates the gene index (from gene1 to gene2000), while the csDE.True variable shows the locations of DEGs (Celltype1-Celltype6). A value of 1 indicates that the DEG is located in Celltype 1, a value of 2 indicates Celltype 2, and so on. A value of 0 means that the gene is not a DEG.

library(TOAST) #TOAST/CeDAR
library(EpiDISH) #CellDMC
library(TCA) # TCA
library(CARseq) #CARseq
library(DESeq2) #DESeq2
library("lrcde")
library(csSAM)
library(dplyr)
library(lubridate)
library(tidyr)
library("recount")
library(data.table)
library(GEOquery)
library(stringr)
library(biomaRt)
library(org.Hs.eg.db)
library(annotate)
library(EnsDb.Hsapiens.v86) # 1nd method match gene symbols
library(annotables) # 2nd method match gene symbols
library(AnnotationDbi) # 3rd method match gene symbols
library(immunedeconv)
library(IOBR)
# Data Simulated Load ----
load("Sim_rawRNAseq_N100_LFC1.5_0.2.Rda")

Here lets take a glance of the RNA-seq count data:

# The first 10 genes and 10 samples
print(RNAseq_final_count[1:10,1:10])

##    control1 case1 control2 case2 control3 case3 control4 case4 control5 case5
## 1       305   442      255   588      235   456      228   617      298   446
## 2       131   205      137   261      120   222      109   210      114   245
## 3       434   325      399   290      603   292      466   260      426   283
## 4        73    67       40    93       53    67       56    94       47    82
## 5        26    32       21    54       22    59       26    53       22    41
## 6        81   135       88   125      102    96      106    98       78   102
## 7         5     4       14     9       11    13       12    11       13    13
## 8        11    12        9    26       12    19       13    13        6    21
## 9       146   155      131   159      155   135      141   159      131   161
## 10      158   139      208   323      111   133      180   121      172   130

True csDEGs labels information:

# True DEG label and expression strata information
gene_CT_DE_connect <- gene_CT_DE_connect %>% dplyr::rename(gene.id = Gene_ID) %>% 
  dplyr::select(gene.id, csDE.True)
print(gene_CT_DE_connect[c(1:3,101:103,201:203,301:303,401:403,501:503,601:603),])

##     gene.id csDE.True
## 1         1         1
## 2         2         1
## 3         3         1
## 101     101         2
## 102     102         2
## 103     103         2
## 201     201         3
## 202     202         3
## 203     203         3
## 301     301         4
## 302     302         4
## 303     303         4
## 401     401         5
## 402     402         5
## 403     403         5
## 501     501         6
## 502     502         6
## 503     503         6
## 601     601         0
## 602     602         0
## 603     603         0

TOAST

Before running the algorithm, it’s important to set up the working pipelines, which should include sample phenotypical information, covariates information, cell proportions, and other relevant factors. If available and meaningful for the study, users can also customize their input by including additional information such as age, sex, and gender in the model. However, for our purposes, we will focus solely on detecting csDEGs in case and control groups.

makeDesign: generates linear model design matrix and prepare for model fitting and statistical testing. It requires two inputs: design and Prop. design is a samples by phenotypes (age, gender, disease, etc.) matrix. Prop is a samples by cell type proportion matrix.

fitModel: fits the model including all cell type proportion matrix and its interaction with phenotypes. This function requires two inputs: Design_out and Y. Design_out is the output from makeDesign and Y is a gene by sample raw RNAseq count data matrix.

csTest: conducts statistical tests for specified phenotype with respect to each/all cell type(s). It requires fitted_model input which is the output from fitModel. Users could specify phenotype names (e.g. “disease”,“gender”) into coef. If cell_type is set to “NULL” or specified as “ALL”, compound effect of coefficients in all cell types will be tested; or user could specify one specific cell type (e.g. “cell1”) to be tested.

sim.design <- data.frame(disease = factor(rep(c(1,2),100))
                         #,gender = as.factor(rep(c("M","F"),each = 100))
                         )

sim.Design_out <- makeDesign(design = sim.design, Prop = est_CT_prop)
sim.fitted_model_strata <- fitModel(Design_out = sim.Design_out, 
                                    Y = as.matrix(RNAseq_final_count))
sim.res.TOAST_strata <- csTest(fitted_model = sim.fitted_model_strata, 
                               coef = "disease",
                               cell_type = NULL, verbose = F, sort = T)
# Return a list and each element contains a matrix including the results from 
# testing the phenotype in specified cell type(s). Below show the results for 
# celltype 1 for the first 10 genes ordered by adjusted p value.
names(sim.res.TOAST_strata)

## [1] "Celltype1" "Celltype2" "Celltype3" "Celltype4" "Celltype5" "Celltype6"
## [7] "joint"

sim.res.TOAST_strata$Celltype1[1:10,]

##         beta  beta_var        mu effect_size f_statistics       p_value
## 83 10843.656 18101.836 2076.3503    1.446172     6495.743 9.758826e-148
## 60  5915.421  9510.757 1261.6126    1.401983     3679.224 2.139365e-125
## 16  9373.780 24728.101 1563.0064    1.499830     3553.356 4.802484e-124
## 38  3828.761  4496.013  766.1246    1.428373     3260.535 1.022123e-120
## 89  8307.551 21485.076 2841.6309    1.187572     3212.249 3.848863e-120
## 80  4162.086  7668.555  919.4395    1.387139     2258.960 1.050431e-106
## 35  2962.700  4520.129  570.1162    1.444187     1941.889 4.891123e-101
## 19  2260.829  3369.388  884.2274    1.122199     1516.996  6.000874e-92
## 85  3266.375  7198.074  590.2584    1.469060     1482.231  4.166486e-91
## 10  1544.094  1777.648  360.6128    1.363246     1341.225  1.670974e-87
##              fdr
## 83 1.951765e-144
## 60 2.139365e-122
## 16 3.201656e-121
## 38 5.110613e-118
## 89 1.539545e-117
## 80 3.501436e-104
## 35  1.397464e-98
## 19  1.500219e-89
## 85  9.258858e-89
## 10  3.341947e-85

The outcomes consist of matrices that contain the results of testing the phenotype (disease) in specific cell types. We then extract and summarize the cell type-specific testing results, including the adjusted p-values, from the output. Researchers have the option to customize the p-value threshold for identifying csDEGs for each cell type.

aa <- lapply(seq(1:6), function(i,x) {
  x[[i]]$gene.id  <- (rownames(x[[i]]))
  x[[i]] <- x[[i]] %>% dplyr::select(gene.id,fdr) 
  names(x[[i]]) <- c("gene.id",names(x)[i])
  assign(names(x)[i],x[[i]], envir=.GlobalEnv)
},x=sim.res.TOAST_strata) 
remove(aa)

TOAST.outputs <- Reduce(function(x,y) merge(x = x, y = y, by = "gene.id",all.x =T),
                        list(Celltype1, Celltype2, Celltype3, 
                             Celltype4, Celltype5, Celltype6, 
                             gene_CT_DE_connect))
TOAST.outputs = TOAST.outputs[order(as.numeric(TOAST.outputs$gene.id)),]
rownames(TOAST.outputs) = TOAST.outputs$gene.id
print(TOAST.outputs[c(1:5,101:105,601:605),])

##     gene.id    Celltype1    Celltype2   Celltype3    Celltype4    Celltype5
## 1         1 1.029246e-16 7.813553e-01 0.727374367 1.132910e-01 2.711014e-35
## 2         2 2.019897e-31 4.016448e-03 0.229712533 1.415307e-02 1.048191e-01
## 3         3 1.514625e-07 2.205288e-14 0.076707649 2.829022e-01 5.457391e-06
## 4         4 9.394208e-02 5.410599e-01 0.886704734 5.180522e-01 1.645431e-03
## 5         5 3.736756e-06 5.314524e-01 0.814041361 2.465622e-01 1.580457e-01
## 101     101 2.803460e-01 2.027197e-06 0.143076203 2.987690e-01 2.253781e-02
## 102     102 3.204534e-02 6.070946e-18 0.719043851 3.131572e-02 1.073527e-01
## 103     103 7.071813e-01 6.080519e-14 0.996174877 9.862629e-01 3.742248e-13
## 104     104 5.911943e-01 1.692564e-11 0.125043947 1.017404e-01 2.535746e-01
## 105     105 8.059386e-01 1.374500e-49 0.010819419 1.905337e-02 8.970970e-03
## 601     601 9.652161e-01 4.806782e-01 0.855788854 6.466708e-06 3.822834e-01
## 602     602 2.755434e-02 3.314457e-01 0.007801243 6.015588e-01 1.261172e-03
## 603     603 7.997659e-01 3.001327e-01 0.982863589 1.825545e-03 1.785221e-03
## 604     604 4.176650e-01 8.729935e-01 0.725130106 1.806278e-04 1.091866e-01
## 605     605 6.951382e-01 9.038723e-01 0.946732381 6.547847e-01 5.795991e-01
##        Celltype6 csDE.True
## 1   0.8162655314         1
## 2   0.0020267402         1
## 3   0.6134432745         1
## 4   0.9993615717         1
## 5   0.5492940034         1
## 101 0.0846483423         2
## 102 0.2165157114         2
## 103 0.5125835458         2
## 104 0.5806145448         2
## 105 0.5125835458         2
## 601 0.3341435694         0
## 602 0.5558696323         0
## 603 0.0002083174         0
## 604 0.5829995797         0
## 605 0.5572015028         0

CellDMC

CellDMC is a tool specifically designed for identifying cell-type-specific differentially methylated (csDM) genes in Epigenome-Wide Association Studies (EWAS). To use the tool, users must specify three input files: beta.m (a matrix with genes in rows and samples in columns), pheno.v (a vector with the phenotype information, e.g., 0 for healthy and 1 for disease), and frac.m (a matrix with cell abundances for each sample, where the row sums should be close to or equal to 1). The row order in frac.m must be identical to the column order in beta.m

# Run Model
pheno.v <- rep(c(0, 1), times = 100)
celldmc.o <- CellDMC(beta.m = RNAseq_final_count, pheno.v = pheno.v, 
                     frac.m = est_CT_prop)
# CellDMC returns a list. Results for each cell type are nested under "coe"

The output of the analysis is a list that includes two components: “dmct” and “coe.” The “dmct” component is a matrix that shows which genes are DE or not and indicates whether a gene is a csDEG for a specific cell type. The first column of “dmct” indicates whether a gene is a DEG (1) or not (0). The remaining columns correspond to each cell type, and a value of 1 indicates that the gene is over-expressed compared to the control group, while a value of -1 indicates that the gene is under-expressed. A value of 0 indicates that the gene is not a csDEG. The rows of “dmct” are ordered in the same way as those in the input beta.m.

names(celldmc.o)

## [1] "dmct" "coe"

print(celldmc.o$dmct[c(1:5,101:105,301:305,701:710),])

##     DMC Celltype1 Celltype2 Celltype3 Celltype4 Celltype5 Celltype6
## 1     1         1         0         0         0         1         0
## 2     1         1        -1         0        -1         0         1
## 3     1         1        -1         0         0        -1         0
## 4     1         0         0         0         0         1         0
## 5     1         1         0         0         0         0         0
## 101   1         0         1         0         0         1         0
## 102   1         1         1         0         1         0         0
## 103   1         0         1         0         0        -1         0
## 104   1         0         1         0         0         0         0
## 105   1         0         1         1         1         1         0
## 301   1         0         0         0         1         0         0
## 302   1         0         0         0         1        -1         0
## 303   1         0         0         0         1         0         0
## 304   1         0         0         0         1         0         0
## 305   1         0         0         0         1         0         0
## 701   0         0         0         0         0         0         0
## 702   0         0         0         0         0         0         0
## 703   0         0         0         0         0         0         0
## 704   1         0         0         0         0         1         0
## 705   1         0         0        -1         0         1         0
## 706   0         0         0         0         0         0         0
## 707   1         0         0         0        -1        -1         1
## 708   0         0         0         0         0         0         0
## 709   1         0         0         0         0         1        -1
## 710   1         0         0         0         0         1         0

The second component of the output, “coe,” is a list that contains multiple data frames, each corresponding to a specific cell type. Each data frame includes the estimated RNAseq changes (“Estimate”), the standard error (“SE”), multiple hypothesis-corrected p-values, and other relevant information.

names(celldmc.o$coe)

## [1] "Celltype1" "Celltype2" "Celltype3" "Celltype4" "Celltype5" "Celltype6"

head(celldmc.o$coe$Celltype1,10)

##      Estimate       SE         t            p         adjP
## 1   720.89235 74.72128  9.647752 3.808212e-18 1.015523e-16
## 2   647.38874 44.00735 14.710922 4.241783e-33 2.019897e-31
## 3   370.35494 61.95166  5.978128 1.113249e-08 1.494294e-07
## 4    52.18125 23.12261  2.256719 2.517648e-02 9.307385e-02
## 5    89.53427 16.89504  5.299440 3.232294e-07 3.694050e-06
## 6    70.88153 30.16143  2.350072 1.980672e-02 7.722269e-02
## 7    28.55231 10.44652  2.733189 6.871234e-03 3.327474e-02
## 8    61.52042  9.88528  6.223437 3.092587e-09 4.482011e-08
## 9   151.73516 37.81378  4.012695 8.660082e-05 6.661602e-04
## 10 1544.09394 42.16216 36.622742 1.670974e-87 3.341947e-85

Pull out and summarize cs-testing results (adjusted p-value and csDEGs):

CellDMC.outputs <- as.data.frame(
  cbind(celldmc.o[["coe"]][["Celltype1"]][["adjP"]],
        celldmc.o[["coe"]][["Celltype2"]][["adjP"]],
        celldmc.o[["coe"]][["Celltype3"]][["adjP"]],
        celldmc.o[["coe"]][["Celltype4"]][["adjP"]],
        celldmc.o[["coe"]][["Celltype5"]][["adjP"]],
        celldmc.o[["coe"]][["Celltype6"]][["adjP"]]))

colnames(CellDMC.outputs)[1:6] <- paste0("Celltype",seq(1:6))
CellDMC.outputs$gene.id <- rownames(celldmc.o[["coe"]][["Celltype1"]])

CellDMC.outputs <- CellDMC.outputs[,c(7,1:6)]

CellDMC.outputs <- merge(x = CellDMC.outputs, y = gene_CT_DE_connect, 
                         by = "gene.id",all.x =T)
###
CellDMC.outputs = CellDMC.outputs[order(as.numeric(CellDMC.outputs$gene.id)),]
rownames(CellDMC.outputs) = CellDMC.outputs$gene.id
head(CellDMC.outputs,10)

##    gene.id    Celltype1    Celltype2  Celltype3  Celltype4    Celltype5
## 1        1 1.015523e-16 7.663098e-01 0.71444741 0.11105202 2.711014e-35
## 2        2 2.019897e-31 3.963368e-03 0.22519853 0.01404364 1.034232e-01
## 3        3 1.494294e-07 2.205288e-14 0.07587987 0.27402185 5.441754e-06
## 4        4 9.307385e-02 5.283095e-01 0.87677458 0.50927162 1.638604e-03
## 5        5 3.694050e-06 5.180646e-01 0.80012095 0.23975269 1.556115e-01
## 6        6 7.722269e-02 6.271143e-01 0.85896731 0.24238137 9.971349e-01
## 7        7 3.327474e-02 9.958517e-01 0.89451333 0.58985647 2.582935e-01
## 8        8 4.482011e-08 5.104551e-01 0.72590599 0.02171998 6.948185e-01
## 9        9 6.661602e-04 1.386175e-02 0.42131743 0.73418276 8.911668e-01
## 10      10 3.341947e-85 6.417007e-01 0.38704273 0.34808716 1.433807e-04
##      Celltype6 csDE.True
## 1  0.802875687         1
## 2  0.002026740         1
## 3  0.599142748         1
## 4  0.999361572         1
## 5  0.535134425         1
## 6  0.001053128         1
## 7  0.536671252         1
## 8  0.455505373         1
## 9  0.981017167         1
## 10 0.414298281         1

TCA

Tensor Composition Analysis (TCA) is a method designed for DNA methylation data that enables the deconvolution of 2-dimensional data (genes by samples) derived from a mixture of K sources (cell types) into a 3-dimensional matrix of signals. TCA also allows for the testing of specific genes in the data for different statistical relations with an outcome of interest while modeling source-specific effects (TCA regression). In particular, it enables the identification of statistical relations between source-specific signals and an outcome of interest. For further details, please refer to the following link:.

tca enables the modeling of the RNAseq of each sample as a mixture of cell-type-specific RNAseq that are unique to the sample. It also allows for the statistical testing of the effects of covariates and phenotypes at the cell type level.

X a gene by sample data matrix, which should include row and column names. Please note that NA values are not supported, and features cannot be constant across all observations (sometimes it may be necessary to trim the dataset in real data analysis).

W input is a sample by cell type matrix (proportion matrix), and each cell must be positive. The row sums should be 1 across cell types for all samples. W should also include row and column names, and NA values are not supported.

C1 input is a sample by phenotypical covariate matrix, and the covariate may affect the hidden source-specific values. C1 must include row and column names as well.

sim.design <- as.data.frame(rep(c(1,2),100))
colnames(sim.design) <- "disease"
rownames(sim.design) <- rownames(est_CT_prop)
### Run model
tca.mdl.hannum.rnaseq <- tca(X = RNAseq_final_count,
                             W = est_CT_prop, refit_W = FALSE,
                             C1 = sim.design, verbose = FALSE)

##  [1] "W"                      "mus_hat"                "sigmas_hat"            
##  [4] "tau_hat"                "deltas_hat"             "gammas_hat"            
##  [7] "deltas_hat_pvals"       "gammas_hat_pvals"       "gammas_hat_pvals.joint"
## [10] "C1"                     "C2"

The function returns a list of estimated model parameters. “W” is a sample by cell type proportion matrix, and if “refit_w==TRUE”, then this is the re-estimated “W”, otherwise it is the input W. “gamma_hat” is a gene by cell types*phenotypical covariates (interactions) matrix of the estimated effects of the covariates in C1 on each genetic feature in X. “gamma_hat_pvals” is a gene by cell types*phenotypical covariates matrix of p-values for the estimates in “gamma_hat” based on a T-test. “gammas_hat_pvals.joint” is a gene by phenotypical covariates matrix of p-values for the joint effects of each phenotypical covariate in C1 on each genetic feature in X, across all cell types.

tca.mdl.hannum.rnaseq$gammas_hat[1:10,]

##    Celltype1.disease Celltype2.disease Celltype3.disease Celltype4.disease
## 1          784.22306        30.2847840        -97.906578       -113.485828
## 2          656.13090      -276.6065342         84.915921        -98.655051
## 3          374.71811     -1110.7511702        183.839474         54.506216
## 4           51.60151        61.3370465        -15.591577        -20.571400
## 5           88.96086        40.1912526        -11.565838        -21.617242
## 6           71.67337        42.1897088         -8.808229         27.706714
## 7           28.08200         0.2764637         -5.469976         -7.822591
## 8           61.60371        25.6929457         -8.722172        -22.520150
## 9          146.44423      -244.9431658         69.896175         18.729986
## 10        1533.08346       -88.3231045        -75.967708        -26.777691
##    Celltype5.disease Celltype6.disease
## 1         958.492382         34.719485
## 2          99.414460        127.490024
## 3        -312.645371        -40.333640
## 4          82.464520          1.090204
## 5          29.727019        -14.487409
## 6         -14.580426        -84.444257
## 7         -12.784730          9.548939
## 8           7.132471          8.669879
## 9          -7.164853         -6.917363
## 10        168.214846        -47.677302

Here we extract p-values for the estimates in gammas_hat:

TCA.outputs <- as.data.frame(tca.mdl.hannum.rnaseq[["gammas_hat_pvals"]])
TCA.outputs$gene.id <- (rownames(TCA.outputs))
colnames(TCA.outputs)[1:6] <- paste0("Celltype",seq(1:6))
TCA.outputs <- TCA.outputs[,c(7,1:6)]

TCA.outputs <- merge(x = TCA.outputs, y = gene_CT_DE_connect, 
                     by = "gene.id",all.x =T)
TCA.outputs = TCA.outputs[order(as.numeric(TCA.outputs$gene.id)),]
rownames(TCA.outputs) = TCA.outputs$gene.id
head(TCA.outputs,10)

##    gene.id    Celltype1    Celltype2  Celltype3    Celltype4    Celltype5
## 1        1 9.912568e-23 9.712202e-01 0.21439015 0.0195782032 2.849345e-33
## 2        2 9.295390e-10 9.354472e-04 0.08586113 0.0006870254 5.483400e-02
## 3        3 1.850755e-09 2.235645e-11 0.01444532 0.2190124740 1.836655e-07
## 4        4 1.004905e-01 4.448617e-01 0.61779720 0.2058142064 1.166022e-04
## 5        5 9.994407e-05 2.483426e-01 0.55567238 0.1334958334 6.734446e-02
## 6        6 7.489067e-03 2.500240e-01 0.77233266 0.3851258227 3.362693e-01
## 7        7 5.234669e-02 9.111096e-01 0.83322621 0.5212428286 1.894777e-01
## 8        8 7.970820e-08 2.033746e-01 0.20980690 0.0026481470 4.741161e-01
## 9        9 5.218804e-03 2.088747e-03 0.13211519 0.5097266004 8.091741e-01
## 10      10 1.397981e-74 1.402615e-01 0.08312005 0.4680263507 3.599102e-05
##       Celltype6 csDE.True
## 1  0.6499155206         1
## 2  0.0001536946         1
## 3  0.6189171146         1
## 4  0.7910781596         1
## 5  0.4826790774         1
## 6  0.0002578700         1
## 7  0.8241765316         1
## 8  0.1638095538         1
## 9  0.9345749350         1
## 10 0.9442792351         1

Here is an alternative way to fit TCA model depending on a slightly different model assumption. However, results are similar based on different assumptions. For details please refer TCA Manual

### Run model/another way to fit TCA
# tca.mdl.hannum.rnaseq <- tca(X = RNAseq_final_count,
#                              W = est_CT_prop, refit_W = FALSE,
#                              verbose = FALSE)
# tca.reg = tcareg(RNAseq_final_count,tca.mdl.hannum.rnaseq,sim.design)
# TCA.outputs2 <- as.data.frame(tca.reg[["pvals"]])
# TCA.outputs2$gene.id <- as.numeric(rownames(TCA.outputs2))
# colnames(TCA.outputs2)[1:6] <- paste0("Celltype",seq(1:6))
# TCA.outputs2 <- TCA.outputs2[,c(7,1:6)]
# TCA.outputs2 <- merge(x = TCA.outputs2, y = gene_CT_DE_connect, 
#                      by = "gene.id",all.x =T)

CARseq

CARseq is the most rigorous statistical model in cell type aware analysis of RNA-sequencing data. The inputs are bulk RNA-seq data (count_matrix) and cell type fraction estimates (cellular_proportions) for a set of samples, and relevant variables of these samples (groups), include a variable of interest for differential expression analysis as well as covariates.

# This method requires approximately 10 minutes to complete.
res.CARseq = run_CARseq(count_matrix = RNAseq_final_count,
                        cellular_proportions = est_CT_prop,
                        groups = as.factor(rep(c(1,2),100)),
                        shrunken_lfc = TRUE,
                        cores = 1,
                        fix_overdispersion = FALSE,
                        useSocket = TRUE)

The function returns a list of mostly matrices. “padj” is a genes by cell types matrix of adjusted p-values by Benjamini & Hochberg (BH) method. “shrunken_lfc” is a genes by cell types matrix of shrunken cell-specific LFC effect between groups and “lfc” is non-shrunken LFC effects.

names(res.CARseq)

##  [1] "p"                       "padj"                   
##  [3] "shrunken_lfc"            "shrunken_lfcSE"         
##  [5] "shrunken_coefficients"   "shrunken_coefficientsSE"
##  [7] "lfc"                     "lfcSE"                  
##  [9] "coefficients"            "coefficientsSE"         
## [11] "overdispersion"          "lambda"                 
## [13] "elapsed_time"

head(res.CARseq$padj)

##      Celltype1    Celltype2 Celltype3   Celltype4    Celltype5    Celltype6
## 1 6.202366e-20 8.833948e-01 0.6042996 0.109648084 1.902985e-38 1.000000e+00
## 2 6.178139e-33 1.666764e-01 0.4737238 0.004323851 3.361490e-01 8.093820e-06
## 3 5.827273e-10 1.609231e-14 0.2168014 0.648206023 1.462221e-03 3.961132e-01
## 4 1.450619e-01 5.074261e-01 0.7887477 0.545012720 4.203383e-03 9.448435e-01
## 5 4.458972e-05 5.957874e-01 0.8197820 0.215714744 1.935064e-01 6.198377e-01
## 6 7.416990e-02 7.112412e-01 0.8377204 0.245918659 1.000000e+00 1.170872e-03

head(res.CARseq$shrunken_lfc)

##   2_vs_1:Celltype1 2_vs_1:Celltype2 2_vs_1:Celltype3 2_vs_1:Celltype4
## 1        1.4284198       -0.1093055      -0.12588312       -0.2405679
## 2        1.8838144       -0.7674251       0.21158909       -0.7490455
## 3        1.2517946       -1.1667454       0.09199791        0.1630767
## 4        1.1499341        0.4789599      -0.09534685       -0.2457471
## 5        1.4596118        0.6860330      -0.09578678       -0.7749907
## 6        0.5806996        0.4140480      -0.14246080        0.1737755
##   2_vs_1:Celltype5 2_vs_1:Celltype6
## 1     0.5642154064       0.03450534
## 2     0.0008674781       0.62648925
## 3    -0.0010523342      -0.27048491
## 4     0.3567752641       0.06114347
## 5     0.7731222294      -0.17263066
## 6    -0.0915354251      -0.76483178

head(res.CARseq$lfc)

##   2_vs_1:Celltype1 2_vs_1:Celltype2 2_vs_1:Celltype3 2_vs_1:Celltype4
## 1        1.5705257      -0.06288303       -2.7299993       -0.2970589
## 2        1.9047142             -Inf        0.2439680       -0.9970980
## 3        1.5401461      -1.47182438        0.2168885        0.2809325
## 4        1.9516849       0.73475923       -0.4760905       -0.3760920
## 5        1.5512505       1.26164776       -0.6250052       -1.1897895
## 6        0.6363702       0.77308748       -0.2939310        0.1712822
##   2_vs_1:Celltype5 2_vs_1:Celltype6
## 1       0.56733958        0.7544951
## 2              Inf        0.6120254
## 3             -Inf       -0.1787770
## 4       0.36233572        0.1032475
## 5       1.04053205       -0.1709291
## 6      -0.05493189       -0.7772930

Here we extract adjusted p-values for the estimates:

CARseq.outputs <- as.data.frame(res.CARseq$padj)
CARseq.outputs$gene.id <- rownames(res.CARseq$padj)
CARseq.outputs <- CARseq.outputs[,c(7,1:6)]
CARseq.outputs <- merge(x = CARseq.outputs, y = gene_CT_DE_connect, 
                        by = "gene.id",all.x =T)
CARseq.outputs = CARseq.outputs[order(as.numeric(CARseq.outputs$gene.id)),]
rownames(CARseq.outputs) = CARseq.outputs$gene.id
head(CARseq.outputs,10)

##    gene.id    Celltype1    Celltype2 Celltype3   Celltype4    Celltype5
## 1        1 6.202366e-20 8.833948e-01 0.6042996 0.109648084 1.902985e-38
## 2        2 6.178139e-33 1.666764e-01 0.4737238 0.004323851 3.361490e-01
## 3        3 5.827273e-10 1.609231e-14 0.2168014 0.648206023 1.462221e-03
## 4        4 1.450619e-01 5.074261e-01 0.7887477 0.545012720 4.203383e-03
## 5        5 4.458972e-05 5.957874e-01 0.8197820 0.215714744 1.935064e-01
## 6        6 7.416990e-02 7.112412e-01 0.8377204 0.245918659 1.000000e+00
## 7        7 8.535865e-03 9.181916e-01 0.7997128 0.561945232 3.361074e-01
## 8        8 8.070407e-07 6.192936e-01 0.7183726 0.032954096 8.026218e-01
## 9        9 5.762145e-05 2.082161e-02 0.4870788 0.821984401 1.000000e+00
## 10      10 1.041724e-86 4.663815e-01 0.8207524 0.242055902 7.932625e-04
##       Celltype6 csDE.True
## 1  1.000000e+00         1
## 2  8.093820e-06         1
## 3  3.961132e-01         1
## 4  9.448435e-01         1
## 5  6.198377e-01         1
## 6  1.170872e-03         1
## 7  5.471940e-01         1
## 8  3.908968e-01         1
## 9  1.000000e+00         1
## 10 5.525047e-01         1

DESeq2

The goal of DESeq2 is to estimate variance-mean dependence in count data from high-throughput sequencing assays and test for DEG based on a model using the negative binomial distribution. Although DESeq2 is not specifically designed for csDEGs analysis, it could achieve this goal by incorporating the cell type proportions as interaction covariates in the model.

DESeqDataSetxxxx is used to store the input values, intermediate calculations, and results of an analysis of differential expression. The DESeqDataSet class enforces non-negative integer values in the “counts” matrix stored as countData for the first input below. colData is a matrix (or dataframe) which should includes all covariates to be tested in the design. A formula which specifies the design of the experiment must be provided into design.

DESeq is the function conducting differential expression analysis based on the negative binomial distribution. DESeq function returns a DESeqDataSet object, results tables (log2 fold changes and p-values) can be generated using the results function. Shrunken LFC can then be generated using the lfcShrink function.

# create coldata for model fitting
coldata = cbind(disease = rep(c(0,1),times=100), est_CT_prop)
#be sure to delete one CT to make the design matrix full rank
dds.bl <- DESeqDataSetFromMatrix(countData = RNAseq_final_count, 
                                 colData = coldata,  
                                 design= ~ 0 + Celltype1 + Celltype2 + Celltype3 + 
                                   Celltype4 + Celltype5 + Celltype6 +
                                   disease:Celltype1 + disease:Celltype2 + 
                                   disease:Celltype3 + disease:Celltype4 + 
                                   disease:Celltype5 + disease:Celltype6)
tic()
dds.bl <- DESeq(dds.bl)
toc()

dds.bl

## class: DESeqDataSet 
## dim: 2000 200 
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(2000): 1 2 ... 1999 2000
## rowData names(62): baseMean baseVar ... deviance maxCooks
## colnames(200): control1 case1 ... control100 case100
## colData names(8): disease Celltype1 ... Celltype6 sizeFactor

Here we extract adjusted p-values for the estimates by results function:

# Celltype1
res1.bl <- as.data.frame(results(dds.bl, name = "Celltype1.disease"))
res1.bl$padj <- ifelse(is.na(res1.bl$padj), 1, res1.bl$padj)
res1.bl$gene.id <- (rownames(res1.bl))
Celltype1.bl <- res1.bl %>% dplyr::select(gene.id, padj) %>% 
  dplyr::rename(Celltype1 = padj)

# Celltype2
res2.bl <- as.data.frame(results(dds.bl, name = "Celltype2.disease"))
res2.bl$padj <- ifelse(is.na(res2.bl$padj), 1, res2.bl$padj)
res2.bl$gene.id <- (rownames(res2.bl))
Celltype2.bl <- res2.bl %>% dplyr::select(gene.id, padj) %>% 
  dplyr::rename(Celltype2 = padj)

# Celltype3
res3.bl <- as.data.frame(results(dds.bl, name = "Celltype3.disease"))
res3.bl$padj <- ifelse(is.na(res3.bl$padj), 1, res3.bl$padj)
res3.bl$gene.id <- (rownames(res3.bl))
Celltype3.bl <- res3.bl %>% dplyr::select(gene.id, padj) %>% 
  dplyr::rename(Celltype3 = padj)

# Celltype4
res4.bl <- as.data.frame(results(dds.bl, name = "Celltype4.disease"))
res4.bl$padj <- ifelse(is.na(res4.bl$padj), 1, res4.bl$padj)
res4.bl$gene.id <- (rownames(res4.bl))
Celltype4.bl <- res4.bl %>% dplyr::select(gene.id, padj) %>% 
  dplyr::rename(Celltype4 = padj)

# Celltype5
res5.bl <- as.data.frame(results(dds.bl, name = "Celltype5.disease"))
res5.bl$padj <- ifelse(is.na(res5.bl$padj), 1, res5.bl$padj)
res5.bl$gene.id <- (rownames(res5.bl))
Celltype5.bl <- res5.bl %>% dplyr::select(gene.id, padj) %>% 
  dplyr::rename(Celltype5 = padj)

# Celltype6
res6.bl <- as.data.frame(results(dds.bl, name = "Celltype6.disease"))
res6.bl$padj <- ifelse(is.na(res6.bl$padj), 1, res6.bl$padj)
res6.bl$gene.id <- (rownames(res6.bl))
Celltype6.bl <- res6.bl %>% dplyr::select(gene.id, padj) %>% 
  dplyr::rename(Celltype6 = padj)

DESeq2.outputs <- Reduce(function(x,y) merge(x = x, y = y, by = "gene.id",all.x =T),
                         list(Celltype1.bl, Celltype2.bl, Celltype3.bl, 
                              Celltype4.bl, Celltype5.bl, Celltype6.bl, 
                              gene_CT_DE_connect))
DESeq2.outputs = DESeq2.outputs[order(as.numeric(DESeq2.outputs$gene.id)),]
rownames(DESeq2.outputs) = DESeq2.outputs$gene.id
head(DESeq2.outputs,10)

##    gene.id     Celltype1    Celltype2  Celltype3  Celltype4    Celltype5
## 1        1  4.477931e-18 8.406315e-01 0.90677407 0.04164153 6.762314e-08
## 2        2  1.487284e-26 2.142558e-02 0.76992896 0.02819519 4.522998e-06
## 3        3  2.206552e-04 1.092036e-13 0.02245169 0.61624271 4.483315e-11
## 4        4  1.190654e-01 7.094800e-01 0.77012465 0.62513041 2.433349e-01
## 5        5  5.432354e-04 6.297119e-01 0.84516203 0.36336180 1.406307e-01
## 6        6  2.905903e-01 7.891463e-01 0.71505031 0.61100180 9.520691e-01
## 7        7  5.060034e-02 9.909691e-01 0.79018056 0.51686683 2.855007e-01
## 8        8  1.124914e-05 6.524908e-01 0.65949497 0.02815057 5.774959e-01
## 9        9  1.047052e-02 4.549209e-03 0.79609517 0.71503956 8.736590e-01
## 10      10 2.243256e-102 7.234885e-02 0.94103752 0.84391555 1.821115e-02
##      Celltype6 csDE.True
## 1  0.003242970         1
## 2  0.090274614         1
## 3  0.902579287         1
## 4  0.822409279         1
## 5  0.220237822         1
## 6  0.001884162         1
## 7  0.531998812         1
## 8  0.519704799         1
## 9  0.947377225         1
## 10 0.545832319         1

CeDAR

CeDAR is used to test csDE signals for phenotypical covariates by considering DE state correlation between cell types. This method provides posterior probability of whether a genetic feature is DE in certain cell types given observed bulk data. When sample size is small or technical noise is large, this tree structure is recommended. We could also incorporate such correlation into csDE detection to improve the power, especially in cell types with low abundance.

cedar is incorporated as a function in TOAST package.

Y_raw is a genes by samples matrix of observed bulk data. prop is a samples by cell types matrix of cell type abundance. design.1 includes covariates with cell type specific effect; while design.2 is for covarites without cell type specific effect (sample by covariate vector/matrix). factor.to.test specifies which covariates to be tested (from design.1/design2).

sim.design <- as.data.frame(factor(rep(c(1,2),100)))
colnames(sim.design) <- "disease"
# about 30 secs
res_cedar <- cedar(Y_raw = as.matrix(RNAseq_final_count), 
                   prop = est_CT_prop, 
                   design.1 = sim.design,
                   factor.to.test = 'disease')

The output is a list. “toast_res” is results of csTest() function from TOAST package. “tree_res” is a matrix of cell-type specific posterior probability of each genetic feature.

names(res_cedar)

## [1] "toast_res" "tree_res"  "time_used"

We can have posterior probability of DE for each genetic feature in each cell type from a more complex or from a simpler tree structure:

head(res_cedar$tree_res$full$pp)

##   Celltype1 Celltype2 Celltype3 Celltype4 Celltype5 Celltype6
## 1 1.0000000 0.7145110 0.5871622 0.9198576 1.0000000 0.4404046
## 2 1.0000000 0.9975011 0.8970120 0.9875514 0.9764860 0.9989232
## 3 1.0000000 1.0000000 0.9696668 0.6877684 0.9999990 0.4554661
## 4 0.8326793 0.5400587 0.3627337 0.5200247 0.9994420 0.3784331
## 5 0.9999986 0.5984155 0.4308709 0.7304736 0.8177801 0.4906431
## 6 0.8924257 0.5519430 0.3726420 0.7820650 0.8134018 0.9993236

head(res_cedar$tree_res$single$pp)

##   Celltype1 Celltype2 Celltype3 Celltype4 Celltype5 Celltype6
## 1 1.0000000 0.4080102 0.3695901 0.9110021 1.0000000 0.3028812
## 2 1.0000000 0.9842010 0.6068900 0.9789479 0.9300748 0.9985262
## 3 1.0000000 1.0000000 0.8391173 0.5949058 0.9999947 0.2821854
## 4 0.7956898 0.3122701 0.1770390 0.4087894 0.9993220 0.2334202
## 5 0.9999981 0.2938729 0.1826718 0.6802124 0.7147263 0.3794544
## 6 0.8792442 0.3147759 0.1739759 0.7244160 0.5026382 0.9993225

We further use \(1-post.Prob\) to get the norminal p-value.

# Returns a posterior probability of whether a feature (gene) is a DEG in 
# a certain cell type
cedar.outputs <- as.data.frame(1- res_cedar$tree_res$full$pp)
cedar.outputs$gene.id <- (rownames(cedar.outputs))
cedar.outputs <- cedar.outputs[,c(7,1:6)]

cedar.outputs <- merge(x = cedar.outputs, y = gene_CT_DE_connect, 
                       by = "gene.id",all.x =T)
# aa <- as.data.frame(table(gene.CT.FDR$Celltype3))

cedar.outputs = cedar.outputs[order(as.numeric(cedar.outputs$gene.id)),]
rownames(cedar.outputs) = cedar.outputs$gene.id
head(cedar.outputs,10)

##    gene.id    Celltype1   Celltype2 Celltype3  Celltype4    Celltype5
## 1        1 0.000000e+00 0.285489044 0.4128378 0.08014244 0.000000e+00
## 2        2 0.000000e+00 0.002498894 0.1029880 0.01244864 2.351405e-02
## 3        3 1.254557e-08 0.000000000 0.0303332 0.31223160 9.766730e-07
## 4        4 1.673207e-01 0.459941290 0.6372663 0.47997527 5.579680e-04
## 5        5 1.423306e-06 0.401584517 0.5691291 0.26952642 1.822199e-01
## 6        6 1.075743e-01 0.448056980 0.6273580 0.21793499 1.865982e-01
## 7        7 5.388047e-02 0.523084342 0.6433422 0.54743357 2.097714e-01
## 8        8 7.291874e-09 0.375509923 0.5718931 0.04840874 2.563183e-01
## 9        9 1.045304e-04 0.009656311 0.1905728 0.63540905 4.782149e-01
## 10      10 0.000000e+00 0.155473134 0.1500476 0.29992566 2.645339e-04
##       Celltype6 csDE.True
## 1  0.5595954313         1
## 2  0.0010768204         1
## 3  0.5445338768         1
## 4  0.6215669384         1
## 5  0.5093568965         1
## 6  0.0006763656         1
## 7  0.6044155459         1
## 8  0.5096116278         1
## 9  0.7566018023         1
## 10 0.3463708761         1

LRCDE

LRCDE a cell type-specific differential expression package. lrcde applies to any type of high-throughput data (methylation profiles, RNA-seq data) and it requires three components.

het.sub is a samples by genes heterogeneous gene expression data matrix. Input should be non-median-centered and non-standardized. cell.props is a cell types by samples matrix. The sum across cell types should be ~1 for each sample. groups is a vector of 1’s and 2’s indicating group assignment for each sample. Also the sample order should be the same across het.sub and cell.props.

pheno.v <- rep(c(1, 2), 100)
# Run Model
power.analysis.df <- lrcde(het.sub = t(RNAseq_final_count), 
                           cell.props = est_CT_prop , 
                           groups = pheno.v , 
                           output.file = "aa.csv", 
                           VERBOSE = T, 
                           method = "dual", 
                           direction = "two.sided")
# The warning message you received when running the lrcde function in R may indicate that the function was not able to achieve full numerical precision in some calculations. This can happen due to the limitations of floating-point arithmetic used in computers, which can sometimes lead to rounding errors and loss of precision when working with very small or very large numbers.

A list with three elements. The first contains a data.frame of analysis results. The second contains a list of parameters supplied to the lrcde function.

print(power.analysis.df[[1]][c(1:3,2001:2003,4001:4003),])

##      site       base       case    diff.est mse.control mse.case      cell
## 1       1  200.60797  921.50032   720.89235    584.6699 674.1086 Celltype1
## 2       2  132.71330  780.10204   647.38874    223.3617 213.2660 Celltype1
## 3       3  103.47018  473.82512   370.35494    408.6066 456.6934 Celltype1
## 2001    1  117.38873   24.62962   -92.75912    584.6699 674.1086 Celltype2
## 2002    2   74.39143 -244.36096   -74.39143    223.3617 213.2660 Celltype2
## 2003    3 1237.86152  125.08931 -1112.77221    408.6066 456.6934 Celltype2
## 4001    1   93.37761   25.21199   -68.16562    584.6699 674.1086 Celltype3
## 4002    2  211.36005  305.54773    94.18768    223.3617 213.2660 Celltype3
## 4003    3  478.54139  653.37801   174.83662    408.6066 456.6934 Celltype3
##         cell.sd  kappa.1  kappa.2   t.crit     t.stat       p.val.t      se1
## 1    0.12128222 119.6352 1102.016 1.983066  78.202534  1.951995e-94 14.29189
## 2    0.12128222 119.6352 1102.016 1.981997 127.333387 1.420163e-120 11.08339
## 3    0.12128222 119.6352 1102.016 1.979126  60.529567  9.266536e-95 21.00371
## 2001 0.09744163 119.6352 1102.016 1.983673  -4.842907  9.999977e-01 20.28405
## 2002 0.09744163 119.6352 1102.016 1.983140  -7.087687  1.000000e+00 15.73032
## 2003 0.09744163 119.6352 1102.016 1.981540 -89.863451  1.000000e+00 29.80992
## 4001 0.04713714 119.6352 1102.016 1.979904  -7.639704  1.000000e+00 27.86769
## 4002 0.04713714 119.6352 1102.016 1.977008  18.471909  1.054239e-39 21.61145
## 4003 0.04713714 119.6352 1102.016 1.972843  25.952864  6.240336e-64 40.95501
##            se2      se.p   t.power
## 1     91.06810  9.218273 1.0000000
## 2     49.61925  5.084203 1.0000000
## 3     57.46778  6.118579 1.0000000
## 2001 190.45894 19.153603 0.9999977
## 2002 103.77322 10.495868 1.0000000
## 2003 120.18755 12.382923 1.0000000
## 4001  84.76189  8.922548 1.0000000
## 4002  46.18326  5.098969 1.0000000
## 4003  53.48829  6.736699 1.0000000

Here we extract T-test p-values for the estimates:

LRCDE.outputs = as.data.frame(power.analysis.df[[1]] %>% 
  dplyr::select(site, cell, p.val.t) %>% 
  pivot_wider(id_cols = NULL, names_from = cell,values_from = p.val.t))

names(LRCDE.outputs)[1] <- c("gene.id")
# Correct for multiple comparisons. 
for( zz in 1:6){
  LRCDE.outputs[,1+zz] =  p.adjust(LRCDE.outputs[,1+zz], method = "BH")
}

LRCDE.outputs <- merge(x = LRCDE.outputs, y = gene_CT_DE_connect, 
                       by = "gene.id",all.x =T)
LRCDE.outputs = LRCDE.outputs[order(as.numeric(LRCDE.outputs$gene.id)),]
rownames(LRCDE.outputs) = LRCDE.outputs$gene.id
head(LRCDE.outputs,10)

##    gene.id     Celltype1    Celltype2    Celltype3    Celltype4     Celltype5
## 1        1  4.539523e-93 1.000000e+00 1.000000e+00 1.000000e+00 5.025893e-174
## 2        2 4.814112e-119 1.000000e+00 7.075428e-39 1.000000e+00  3.053245e-08
## 3        3  2.232900e-93 1.000000e+00 7.564043e-63 9.543993e-36  1.000000e+00
## 4        4  2.670971e-39 5.266458e-17 1.000000e+00 1.000000e+00  4.677184e-90
## 5        5  9.862755e-79 2.467829e-18 1.000000e+00 1.000000e+00  9.174420e-45
## 6        6  2.393608e-45 4.178652e-15 1.000000e+00 3.867938e-38  8.375209e-01
## 7        7  4.762983e-50 9.881423e-01 1.000000e+00 1.000000e+00  1.000000e+00
## 8        8  2.368155e-87 5.088581e-19 1.000000e+00 1.000000e+00  4.998919e-05
## 9        9  3.477816e-73 1.000000e+00 7.434881e-33 9.181229e-08  1.000000e+00
## 10      10 1.577104e-175 1.000000e+00 1.000000e+00 1.000000e+00  4.838563e-93
##       Celltype6 csDE.True
## 1  1.000000e+00         1
## 2  2.394650e-99         1
## 3  1.000000e+00         1
## 4  1.000000e+00         1
## 5  1.000000e+00         1
## 6  1.000000e+00         1
## 7  3.034906e-20         1
## 8  1.253472e-25         1
## 9  1.000000e+00         1
## 10 1.000000e+00         1

csSAM

The first model designed for csDEGs calling process. csSAM implements the method that computes cell-specific differential expression from measured cell proportions using SAM.

csSamWrapper performs the entire functionality of csSAM. G is a sample by gene gene expression matrix. cc is a sample by cell type cell abundance matrix. y is a numeric vector of group assignment for each sample (1 or 2).

res.csSAM = csSamWrapper(G = t(RNAseq_final_count), 
                         cc = est_CT_prop,
                         y = as.factor(rep(c(1,2),100)), 
                         nperms = 50, 
                         alternative = "two.sided",
                         standardize = TRUE,
                         medianCenter = TRUE)

The function returns a list. “sigGene.csSAM” is a genes by cell types matrix object with the result of contrasting the average cell-specific expression profile of the two groups, per cell-type.

names(res.csSAM)

## [1] "deconv"        "fdr.csSAM"     "fdr.SAM"       "sigGene.csSAM"
## [5] "fileName"

head(t(res.csSAM$sigGene.csSAM))

##           [,1]      [,2]      [,3]      [,4]       [,5]       [,6]
## [1,] 0.0000000 0.7446222 0.6274246 0.1825442 0.00000000 0.99062000
## [2,] 0.0000000 0.7446222 0.3084977 0.1825442 0.98211419 0.02761905
## [3,] 0.0000000 0.0000000 0.1219841 0.5040726 0.07919283 0.68638152
## [4,] 0.4883158 0.7446222 0.9827167 0.9988200 0.28979253 0.99062000
## [5,] 0.1757495 0.7446222 0.8556053 0.9988200 0.98211419 0.99062000
## [6,] 0.4883158 0.7446222 0.8556053 0.5040726 0.98211419 0.11173554

csSAM.outputs <- as.data.frame(t(res.csSAM$sigGene.csSAM))
csSAM.outputs$gene.id <- (rownames(RNAseq_final_count))
colnames(csSAM.outputs)[1:6] <- paste0("Celltype",seq(1:6))
csSAM.outputs <- csSAM.outputs[,c(7,1:6)]
csSAM.outputs <- merge(x = csSAM.outputs, y = gene_CT_DE_connect, 
                       by = "gene.id",all.x =T)
csSAM.outputs = csSAM.outputs[order(as.numeric(csSAM.outputs$gene.id)),]
rownames(csSAM.outputs) = csSAM.outputs$gene.id
head(csSAM.outputs,10)

##    gene.id  Celltype1 Celltype2 Celltype3 Celltype4  Celltype5  Celltype6
## 1        1 0.00000000 0.7446222 0.6274246 0.1825442 0.00000000 0.99062000
## 2        2 0.00000000 0.7446222 0.3084977 0.1825442 0.98211419 0.02761905
## 3        3 0.00000000 0.0000000 0.1219841 0.5040726 0.07919283 0.68638152
## 4        4 0.48831579 0.7446222 0.9827167 0.9988200 0.28979253 0.99062000
## 5        5 0.17574953 0.7446222 0.8556053 0.9988200 0.98211419 0.99062000
## 6        6 0.48831579 0.7446222 0.8556053 0.5040726 0.98211419 0.11173554
## 7        7 0.95581000 0.9277755 0.9827167 0.9988200 0.98211419 0.99062000
## 8        8 0.17574953 0.8367708 0.8556053 0.9988200 0.98211419 0.68638152
## 9        9 0.04919753 0.2451613 0.5213170 0.9988200 0.98211419 0.99062000
## 10      10 0.00000000 0.6372567 0.5213170 0.5040726 0.07919283 0.68638152
##    csDE.True
## 1          1
## 2          1
## 3          1
## 4          1
## 5          1
## 6          1
## 7          1
## 8          1
## 9          1
## 10         1

Example of Real Data Analysis:

Preprosessing: Load into Datasets, Structure Manipulations:

The dataset used in this analysis is the bulk RNA-seq data from The Cancer Genome Atlas Lung Adenocarcinoma TCGA-LUAD. The goal is to predict the stages of pathologic cancer, which are categorized into a binary variable: stage-I (n=330) versus stage-II/III/IV (n=271). To ensure data quality, we filtered out genes with a 75% quantile expression level lower than 20 across all study samples.

load("rse_gene_lung.Rdata")
## sample variable hdle ----
covar <- as.data.frame(rse_gene@colData)
table(covar$gdc_cases.project.project_id)
# Select TCGA-LUAD dataset
covar.luad <- covar[covar$gdc_cases.project.project_id == "TCGA-LUAD",]
# Select TCGA-LUAD phenotypical information
covar.luad <- covar.luad[,c('gdc_cases.diagnoses.tumor_stage',
                            'cgc_case_pathologic_stage')]
#table(covar.luad$cgc_case_pathologic_stage)
#   Stage I   Stage IA   Stage IB   Stage II  Stage IIA  Stage IIB Stage IIIA 
#         5        154        171          1         58         83         85 
#Stage IIIB   Stage IV 
#        12         29 

# Create binary pathologic stage variable. Combine I and IA stage cancer: 
covar.luad$cgc_case_patho_stage_binary <- 
  ifelse(covar.luad$cgc_case_pathologic_stage %in% c("Stage I","Stage IA",
                                                     "Stage IB"), 0, 1)
#table(covar.luad$cgc_case_patho_stage_binary)
#   0   1 
# 330 271 

## count data hdle ----
count <- rse_gene@assays$data$counts
count.luad <- count[,rownames(covar.luad)]
# dim(count.luad) # [1] 58037   601
# Filter genes based on 75% quantile
count.luad.filter <- count.luad[apply(count.luad,1,
                                      function(x) quantile(x, 0.75)) >= 20,]
# dim(count.luad.filter) # [1] 43975 features   601 samples

count.luad.filter <- cbind(count.luad.filter, 
                           gene_id = gsub("\\.\\d+$", "", rownames(count.luad.filter)))
count.luad.filter <- count.luad.filter[,c(602,1:601)]
# Check no duplicated genes: which(duplicated(count.luad.filter[,'gene_id']))

Genes without Ensembl Annotations and with replications are also removed, which led to a total of 31,294 genes.

## Gene Annotations ----
# Filter genes based on gene annotation
ensembl.con <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
# check for gene stable ID in the description     
G_list <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol", # "external_gene_name", 
                               "start_position","end_position"),
                filters = "ensembl_gene_id",
                values = as.vector(count.luad.filter[,1]),
                mart = ensembl.con)
G_list <- G_list[G_list$hgnc_symbol != '',]
# Check for duplicated genes
duplicated.id = which(duplicated(G_list$hgnc_symbol))
G_list = G_list[-duplicated.id,] # [1] 31296     4
duplicated.id = which(duplicated(G_list$ensembl_gene_id))
G_list = G_list[-duplicated.id,]
# Creat gene size
G_list$gene_size <- G_list$end_position - G_list$start_position

# Merge gene name with raw count data
count.luad.filter <- left_join(G_list, as.data.frame(count.luad.filter), 
                               by = c("ensembl_gene_id" = "gene_id"))
# Data structure manipulations
numeric.count <- t(apply(count.luad.filter[,6:606],1,function(x) 
  as.numeric(unlist(x))))
colnames(numeric.count) <- colnames(count.luad.filter[,6:606])
count.luad.filter[,6:606] <- numeric.count
# Final data dimension: 606 patients and ~30k genes. 
# dim(count.luad.filter) 31294   606
# Clear memory. 
rm(numeric.count,ensembl.con,rse_gene,duplicated.id,count,
   covar,count.luad)

Deconvolution:

Because tumor-infiltrating immune cells are crucial in cancer patients and the compositions of immune cells in tumor samples play important roles in tumor heterogeneities, we use TIMER2.0 from immunedeconv package to deconvolute cell proportions. The cell composition includes six immune cell types are B cell, T cell CD4, T cell CD8, Neutrophil, Macrophage, and Myeloid Dendritic.

We need to first convert RNA sequencing count data to TMP using the gene length information provided by Ensembl because TIMER2.0 deconvolution algorithms require standardized TMP inputs. The cell type composition matrix needs to be further normalized to 1 across six cells to reach the constrained criteria.

## Deconvolute ----
# Create TPMs tandardized data for cell type composition analysis
TPM_luad.filter <- count.luad.filter[,-c(3,4)]
TPM_luad.filter$gene_size <- TPM_luad.filter$gene_size/1000
names(TPM_luad.filter)[3] <- "gene_size_kb"
TPM_luad.filter[,4:604] <- sweep(TPM_luad.filter[,4:604], 1,TPM_luad.filter[,3], 
                                 FUN = '/')
test <- as.matrix(apply(TPM_luad.filter[,4:604],2,sum)/1000000)
TPM_luad.filter[,4:604] <- sweep(TPM_luad.filter[,4:604], 2, test, FUN = "/")
rm(test)
TPM_luad.filter.timer <- TPM_luad.filter[,c(2,4:604)]
rownames(TPM_luad.filter.timer) <- TPM_luad.filter.timer$hgnc_symbol
TPM_luad.filter.timer <- TPM_luad.filter.timer[,-1]
TPM_luad.filter.timer <- round(TPM_luad.filter.timer,digits = 2)
# Use TIMER for deconvolution, from immunedeconv package
CT.Prop.luad.TIMER <- deconvolute(TPM_luad.filter.timer, "timer",
                                  indications=as.vector(c(rep("luad",601))))

# Other deconvolution methods: from IOBR package
# CT.Prop.luad.CIBERSORT = deconvo_tme(eset = count_luad.timer, 
#                                      method = "cibersort", 
#                                      arrays = FALSE, perm = 200 )
# CT.Prop.luad.CIBERSORT = deconvo_tme(eset = TPM_luad.filter.timer, 
#                                      method = "cibersort", 
#                                      arrays = FALSE, perm = 200)
# CT.Prop.luad.TIMER2 = deconvo_tme(eset = count_luad.timer, 
#                                   method = "timer",
#                                   group_list = rep("luad",dim(CT.Prop.luad.TIMER)[1]))

## CT prop normalize----
rownames(count.luad.filter) <- count.luad.filter$ensembl_gene_id
count.luad.filter <- count.luad.filter[,6:606]

CT.Prop.luad.TIMER <- as.data.frame(CT.Prop.luad.TIMER)
rownames(CT.Prop.luad.TIMER) <- CT.Prop.luad.TIMER$cell_type
CT.Prop.luad.TIMER <- CT.Prop.luad.TIMER[,-1]
CT.Prop.luad.TIMER <- t(CT.Prop.luad.TIMER)
colnames(CT.Prop.luad.TIMER) <- c("B_cell","T_cell_CD4", "T_cell_CD8",
                                  "Neutrophil","Macrophage","Myeloid_dentritic_c")
CT.Prop.luad.TIMER <- sweep(CT.Prop.luad.TIMER,1,
                            1/apply(CT.Prop.luad.TIMER,1,sum),
                            FUN = "*")
head(apply(CT.Prop.luad.TIMER,1,sum),10)

## 672AFEB7-E9AA-4A44-AA9C-EF6344AE5C5C 670D8333-6723-4B4F-B533-D2BFF803A9BF 
##                                    1                                    1 
## 4B3D8B07-8B44-45F3-BE30-A0B6EDBF8265 05E2E228-C0C5-4D6D-8FE1-DA25D78B0EC1 
##                                    1                                    1 
## AE818BC4-05BF-4E85-A4AD-F3E43DD7D37A 1002C658-3E4D-4F5A-BCB2-E81193CF209A 
##                                    1                                    1 
## 2634796D-5545-49C3-AA21-939D1EE443D6 8085F165-9AE3-4079-9F7F-711716D68BF0 
##                                    1                                    1 
## 3E6800B3-B156-46ED-A949-1CF9C79F2C88 A6CC4EB8-BB74-4CBD-85BD-F60FD9D4DC50 
##                                    1                                    1

TCA

Notice here that the clinical outcome is treated as an independent covariate in csDEGs calling process.

## TCA ----
tca.design <- as.matrix(as.numeric(covar.luad$cgc_case_patho_stage_binary))
colnames(tca.design) <- "stage"
rownames(tca.design) <- rownames(covar.luad)
### Run model
count.luad.filter2 <- count.luad.filter[apply(count.luad.filter,1,sd) <= 35000,]
tca.model <- tca(X = count.luad.filter2, 
                 W = as.matrix(CT.Prop.luad.TIMER), refit_W = F,
                 C1 = tca.design, verbose = T)
#  extract p-values for the estimates in gammas_hat (based on a T-test)
tca.outputs <- as.data.frame(tca.model[["gammas_hat_pvals"]])
tca.outputs$gene.id <- rownames(tca.outputs)
colnames(tca.outputs)[1:6] <- colnames(CT.Prop.luad.TIMER)
tca.outputs <- tca.outputs[,c(7,1:6)]
tca.smry <- tca.outputs
tca.smry.symbol <- merge(tca.smry,G_list, by = "gene.id")
head(tca.smry.symbol,15)

##            gene.id      Astro         Exc         Inh      Micro       Oligo
## 1  ENSG00000000003 0.75610643 0.929799835 0.183145345 0.10130346 0.003612741
## 2  ENSG00000000419 0.09143639 0.042033683 0.607801011 0.79984218 0.302678077
## 3  ENSG00000000457 0.90941863 0.740706553 0.392678985 0.34839534 0.312184203
## 4  ENSG00000000460 0.31948279 0.762589104 0.478549522 0.25237920 0.084136204
## 5  ENSG00000000938 0.80405838 0.223628958 0.477214695 0.83075836 0.884505706
## 6  ENSG00000000971 0.61009856 0.999088525 0.838252895 0.40056510 0.421305082
## 7  ENSG00000001036 0.96399096 0.404510558 0.569663484 0.17241321 0.426463478
## 8  ENSG00000001084 0.64027595 0.426597331 0.841178132 0.17810117 0.419323416
## 9  ENSG00000001167 0.90197934 0.771422207 0.700177593 0.19859443 0.375012411
## 10 ENSG00000001460 0.40404415 0.127584520 0.432182333 0.51997939 0.335826679
## 11 ENSG00000001461 0.32984070 0.706440349 0.773923427 0.37606696 0.350318294
## 12 ENSG00000001497 0.99376041 0.188302333 0.556366441 0.53821038 0.597548629
## 13 ENSG00000001561 0.91833879 0.754994661 0.196644583 0.06687532 0.513414944
## 14 ENSG00000001617 0.50483991 0.001345846 0.009221129 0.74192914 0.657240921
## 15 ENSG00000001626 0.31276473 0.563097410 0.128776499 0.52039704 0.114163183
##          OPC hgnc_symbol
## 1  0.1487570      TSPAN6
## 2  0.3557776        DPM1
## 3  0.5518572       SCYL3
## 4  0.6204813    C1orf112
## 5  0.3586093         FGR
## 6  0.8481829         CFH
## 7  0.3325489       FUCA2
## 8  0.8463933        GCLC
## 9  0.6392501        NFYA
## 10 0.9200947       STPG1
## 11 0.1865906      NIPAL3
## 12 0.6760131       LAS1L
## 13 0.5740340       ENPP4
## 14 0.9364709      SEMA3F
## 15 0.6154911        CFTR

TOAST

## TOAST ----
toast.design <- as.data.frame(factor(covar.luad$cgc_case_patho_stage_binary)) 
colnames(toast.design) <- "stage"
sim.Design_out <- makeDesign(toast.design,CT.Prop.luad.TIMER)
Toast.model <- fitModel(sim.Design_out, as.matrix(count.luad.filter))
Toast.out <- csTest(Toast.model, coef = "stage",cell_type = NULL, 
                    verbose = F, sort = T)
aa <- lapply(seq(1:6), function(i,x) {
  x[[i]]$gene.id  <- rownames(x[[i]])
  x[[i]] <- x[[i]] %>% dplyr::select(gene.id,fdr) 
  names(x[[i]]) <- c("gene.id",names(x)[i])
  assign(names(x)[i],x[[i]], envir=.GlobalEnv)
},x=Toast.out) 
remove(aa)
Toast.smry <- Reduce(function(x,y) merge(x = x, y = y, by = "gene.id",all.x =T),
                     list(B_cell,T_cell_CD4,T_cell_CD8,Neutrophil,Macrophage,
                          Myeloid_dentritic_c))
remove(B_cell,T_cell_CD4,T_cell_CD8,Neutrophil,Macrophage,
       Myeloid_dentritic_c, sim.Design_out,Toast.model,Toast.out,toast.design)
Toast.smry.symbol <- merge(Toast.smry,G_list, by = "gene.id")
head(Toast.smry.symbol,15)

##            gene.id     Astro       Exc       Inh    Micro     Oligo       OPC
## 1  ENSG00000000003 0.9998135 0.9284587 0.9999893 0.999998 0.8319204 0.9953115
## 2  ENSG00000000419 0.9998135 0.4043824 0.9999893 0.999998 0.8319204 0.9953115
## 3  ENSG00000000457 0.9998135 0.8286659 0.9999893 0.999998 0.8330658 0.9992906
## 4  ENSG00000000460 0.9998135 0.7965821 0.9999893 0.999998 0.8319204 0.9992906
## 5  ENSG00000000938 0.9998135 0.5274979 0.9999893 0.999998 0.9928465 0.9953115
## 6  ENSG00000000971 0.9998135 0.7772749 0.9999893 0.999998 0.9474317 0.9962820
## 7  ENSG00000001036 0.9998135 0.6333875 0.9999893 0.999998 0.9439384 0.9953115
## 8  ENSG00000001084 0.9998135 0.7680022 0.9999893 0.999998 0.8319204 0.9992906
## 9  ENSG00000001167 0.9999913 0.9615638 0.9999893 0.999998 0.8319204 0.9953115
## 10 ENSG00000001460 0.9998135 0.3383280 0.9999893 0.999998 0.8782956 0.9997742
## 11 ENSG00000001461 0.9998135 0.8985756 0.9999893 0.999998 0.8319204 0.9953115
## 12 ENSG00000001497 0.9998135 0.4777003 0.9999893 0.999998 0.9660650 0.9968641
## 13 ENSG00000001561 0.9998135 0.5172285 0.9999893 0.999998 0.8319204 0.9953115
## 14 ENSG00000001617 0.9998135 0.2475045 0.9999893 0.999998 0.9737621 0.9992906
## 15 ENSG00000001626 0.9998135 0.6690331 0.9999893 0.999998 0.8319204 0.9953115
##    hgnc_symbol
## 1       TSPAN6
## 2         DPM1
## 3        SCYL3
## 4     C1orf112
## 5          FGR
## 6          CFH
## 7        FUCA2
## 8         GCLC
## 9         NFYA
## 10       STPG1
## 11      NIPAL3
## 12       LAS1L
## 13       ENPP4
## 14      SEMA3F
## 15        CFTR

DESeq2

## DESeq2 ----
# create coldata for model fitting
covar.luad2 <- covar.luad[order(covar.luad$cgc_case_patho_stage_binary),]
CT.Prop.luad.TIMER2 <- CT.Prop.luad.TIMER[rownames(covar.luad2),]
count.luad.filter2 <- count.luad.filter[,rownames(covar.luad2)]
coldata = cbind(stage = as.factor(covar.luad2$cgc_case_patho_stage_binary), 
                CT.Prop.luad.TIMER2)
dds.bl <- DESeqDataSetFromMatrix(countData = count.luad.filter2, 
                                 colData = coldata,  
                                 design = ~ 0 + B_cell + T_cell_CD4 + 
                                   T_cell_CD8 + Neutrophil + Macrophage + 
                                   Myeloid_dentritic_c +
                                   stage:B_cell + stage:T_cell_CD4 +  
                                   stage:T_cell_CD8 + stage:Neutrophil + 
                                   stage:Macrophage + stage:Myeloid_dentritic_c)
dds.bl <- DESeq(dds.bl)
DESeq2.out <- dds.bl


# resultsNames(dds.bl)
deseq2resultname <- resultsNames(dds.bl)[7:12]
DESeq2.smry <-data.frame(gene.id = rownames(count.luad.filter2))
for(i in 1:length(deseq2resultname)){
  res1.bl <- as.data.frame(results(dds.bl, name = deseq2resultname[i]))
  res1.bl$padj <- ifelse(is.na(res1.bl$padj), 1, res1.bl$padj)
  res1.bl$gene.id <- rownames(res1.bl)
  Celltype1.bl <- res1.bl %>% 
    dplyr::select(gene.id, padj) %>% 
    dplyr::rename(!!substr(deseq2resultname[i],1,nchar(deseq2resultname[i])-6):=padj)
  DESeq2.smry <- merge(DESeq2.smry, Celltype1.bl, by = "gene.id",all.x =T)
}

DESeq2.smry.symbol <- merge(DESeq2.smry,G_list, by = "gene.id")
head(DESeq2.smry.symbol,15)

##            gene.id     Astro       Exc       Inh     Micro     Oligo       OPC
## 1  ENSG00000000003 0.9611865 0.6541721 0.3762072 0.9992306 1.0000000 0.8083559
## 2  ENSG00000000419 0.3467613 0.8499556 0.8312811 0.9992306 0.2480934 0.9880205
## 3  ENSG00000000457 0.8457842 0.6363731 0.8186414 0.9992306 0.4901586 0.7434262
## 4  ENSG00000000460 0.3434591 0.6474127 0.5434581 0.9992306 0.4973274 0.7442209
## 5  ENSG00000000938 0.7337655 0.9891557 0.9184069 0.9992306 1.0000000 0.9912205
## 6  ENSG00000000971 0.8418209 0.8775300 0.9876620 0.9992306 0.2688641 0.9068843
## 7  ENSG00000001036 0.9871490 0.9691142 0.4801493 0.9992306 1.0000000 0.8564742
## 8  ENSG00000001084 0.5442480 0.6675758 0.5266777 0.9992306 0.2386741 0.8118223
## 9  ENSG00000001167 0.6784735 0.4183163 0.2360247 0.9992306 1.0000000 0.8160445
## 10 ENSG00000001460 0.3909547 0.5685522 0.3156943 0.9992306 1.0000000 0.7033200
## 11 ENSG00000001461 0.4630501 0.4835969 0.4355589 0.9992306 0.1053649 0.7399374
## 12 ENSG00000001497 0.7062219 0.9807327 0.7865962 0.9992306 0.2039998 0.9424297
## 13 ENSG00000001561 0.7721005 0.4062672 0.2631234 0.9992306 0.1012494 0.8411948
## 14 ENSG00000001617 0.6417434 0.3953291 0.2421134 0.9992306 1.0000000 0.8477374
## 15 ENSG00000001626 0.8991963 0.4487335 0.3472886 0.9992306 1.0000000 0.9334153
##    hgnc_symbol
## 1       TSPAN6
## 2         DPM1
## 3        SCYL3
## 4     C1orf112
## 5          FGR
## 6          CFH
## 7        FUCA2
## 8         GCLC
## 9         NFYA
## 10       STPG1
## 11      NIPAL3
## 12       LAS1L
## 13       ENPP4
## 14      SEMA3F
## 15        CFTR

CellDMC

## CellDMC ----
pheno.v <- as.numeric(covar.luad$cgc_case_patho_stage_binary)
celldmc.model <- CellDMC(as.matrix(count.luad.filter), pheno.v, CT.Prop.luad.TIMER)




celldmc.smry <- as.data.frame(cbind(celldmc.model[["coe"]][["B_cell"]][["adjP"]],
                                    celldmc.model[["coe"]][["T_cell_CD4"]][["adjP"]],
                                    celldmc.model[["coe"]][["T_cell_CD8"]][["adjP"]],
                                    celldmc.model[["coe"]][["Neutrophil"]][["adjP"]],
                                    celldmc.model[["coe"]][["Macrophage"]][["adjP"]],
                                    celldmc.model[["coe"]][["Myeloid_dentritic_c"]][["adjP"]]))

colnames(celldmc.smry) <- colnames(CT.Prop.luad.TIMER)
celldmc.smry$gene.id <- as.matrix(rownames(celldmc.model[["coe"]][["B_cell"]]))
celldmc.smry <- celldmc.smry[,c(7,1:6)]
celldmc.smry.symbol <- merge(celldmc.smry,G_list, by = "gene.id")
head(celldmc.smry.symbol,15)

##            gene.id     Astro       Exc       Inh     Micro     Oligo       OPC
## 1  ENSG00000000003 0.9985790 0.9213590 0.9999137 0.9999787 0.7994531 0.9656499
## 2  ENSG00000000419 0.9985790 0.3765622 0.9999137 0.9999787 0.7994531 0.9656499
## 3  ENSG00000000457 0.9985790 0.8144335 0.9999137 0.9999787 0.8006029 0.9916577
## 4  ENSG00000000460 0.9985790 0.7795219 0.9999137 0.9999787 0.7994531 0.9913214
## 5  ENSG00000000938 0.9985790 0.5010179 0.9999137 0.9999787 0.9860303 0.9656499
## 6  ENSG00000000971 0.9985790 0.7590710 0.9999137 0.9999787 0.9184396 0.9723666
## 7  ENSG00000001036 0.9994527 0.6079064 0.9999137 0.9999787 0.9144004 0.9656499
## 8  ENSG00000001084 0.9985790 0.7491604 0.9999137 0.9999787 0.7994531 0.9892898
## 9  ENSG00000001167 0.9999891 0.9585184 0.9999137 0.9999787 0.7994531 0.9656499
## 10 ENSG00000001460 0.9985790 0.3118733 0.9999137 0.9999787 0.8435897 0.9992170
## 11 ENSG00000001461 0.9985790 0.8897347 0.9999137 0.9999787 0.7994531 0.9656499
## 12 ENSG00000001497 0.9985790 0.4507549 0.9999137 0.9999787 0.9439301 0.9776327
## 13 ENSG00000001561 0.9985790 0.4903285 0.9999137 0.9999787 0.7994531 0.9656499
## 14 ENSG00000001617 0.9985790 0.2025085 0.9999137 0.9999787 0.9549264 0.9942869
## 15 ENSG00000001626 0.9985790 0.6461350 0.9999137 0.9999787 0.7994531 0.9656499
##    hgnc_symbol
## 1       TSPAN6
## 2         DPM1
## 3        SCYL3
## 4     C1orf112
## 5          FGR
## 6          CFH
## 7        FUCA2
## 8         GCLC
## 9         NFYA
## 10       STPG1
## 11      NIPAL3
## 12       LAS1L
## 13       ENPP4
## 14      SEMA3F
## 15        CFTR

CARseq

## CARseq ----
carseq.model = run_CARseq(count_matrix = count.luad.filter,
                          cellular_proportions = CT.Prop.luad.TIMER,
                          groups = as.factor(covar.luad$cgc_case_patho_stage_binary),
                          shrunken_lfc = TRUE,
                          cores = 1,
                          fix_overdispersion = FALSE,
                          useSocket = TRUE)
carseq.smry <- as.data.frame(carseq.model$padj)
carseq.smry$gene.id <- rownames(carseq.model$padj)
carseq.smry <- carseq.smry[,c(7,1:6)]
carseq.smry.symbol <- merge(carseq.smry,G_list, by = "gene.id")
head(carseq.smry.symbol,15)

##            gene.id Astro        Exc Inh Micro Oligo OPC hgnc_symbol
## 1  ENSG00000000003     1 0.18029575   1     1     1   1      TSPAN6
## 2  ENSG00000000419     1 0.08717086   1     1     1   1        DPM1
## 3  ENSG00000000457     1 0.69350773   1     1     1   1       SCYL3
## 4  ENSG00000000460     1 0.21816729   1     1     1   1    C1orf112
## 5  ENSG00000000938     1 0.24611833   1     1     1   1         FGR
## 6  ENSG00000000971     1 0.47869053   1     1     1   1         CFH
## 7  ENSG00000001036     1 0.09292355   1     1     1   1       FUCA2
## 8  ENSG00000001084     1 0.22411438   1     1     1   1        GCLC
## 9  ENSG00000001167     1 0.36186032   1     1     1   1        NFYA
## 10 ENSG00000001460     1 0.55750753   1     1     1   1       STPG1
## 11 ENSG00000001461     1 0.81757233   1     1     1   1      NIPAL3
## 12 ENSG00000001497     1 0.24136670   1     1     1   1       LAS1L
## 13 ENSG00000001561     1 0.54540356   1     1     1   1       ENPP4
## 14 ENSG00000001617     1 0.18288062   1     1     1   1      SEMA3F
## 15 ENSG00000001626     1 0.28204673   1     1     1   1        CFTR