Authors: Sophie Le Bars, Mohamed Soudy, and Enrico Glaab
This vignette provides a guide for using the XYomics package to analyze sex-related patterns in single-cell RNA-sequencing (scRNA-seq) data. We will walk through a complete workflow, from data simulation to network analysis, demonstrating how to identify and interpret sex-specific gene expression changes at the cell-type level.
The tutorial covers: 1. Simulating a scRNA-seq dataset. 2. Standard preprocessing using the Seurat package. 3. Performing differential expression analysis using different strategies: sex-stratified analysis and interaction analysis. 4. Detecting and categorizing sex-specific DEGs for each cell type. 5. Visualizing gene expression using dot plots. 6. Performing pathway enrichment analysis to uncover biological functions. 7. Constructing and visualizing protein-protein networks.
We start by simulating a scRNA-seq dataset containing 10 mock samples. This creates a Seurat object with a count matrix ready for analysis.
library(Seurat)
library(org.Hs.eg.db)
library(stringr)
library(clusterProfiler)
library(igraph)
library(XYomics)
library(ggrepel)
library(ggraph)
library(dplyr)
library(tidyr)
set.seed(123)
gene_sample <- keys(org.Hs.eg.db, keytype = "SYMBOL") %>%
as.data.frame() %>%
setNames("genes") %>%
filter(!str_starts(genes, "LOC"))
gene_sample <- sample(gene_sample$genes, 300, replace = FALSE)
matrices <- lapply(1:10, function(i) {
batch_effect <- rgamma(300, shape = 2, scale = 1.5)
m <- matrix(
ifelse(runif(300*100) < 0.1, 0, rnbinom(300*100, size = 1/0.5, mu = 10 * batch_effect)),
nrow = 300,
dimnames = list(gene_sample, paste("Sample", i, ": Cell", 1:100))
)
CreateSeuratObject(counts = m, meta.data = data.frame(sample = paste("Sample", i)))
})
sim.seurat <- Reduce(merge, matrices)
sim.seurat@meta.data$sample <- sapply(strsplit(rownames(sim.seurat@meta.data), ":"), "[", 1)
We perform standard scRNA-seq preprocessing, including normalization, feature selection, and scaling. We then apply dimensionality reduction (PCA and UMAP) and clustering to identify cell populations.
sim.seurat <- NormalizeData(sim.seurat)
sim.seurat <- FindVariableFeatures(sim.seurat, selection.method = "vst", nfeatures = 2000)
sim.seurat <- ScaleData(sim.seurat)
sim.seurat <- RunPCA(sim.seurat)
sim.seurat <- FindNeighbors(sim.seurat, dims = 1:10)
sim.seurat <- FindClusters(sim.seurat)
sim.seurat <- RunUMAP(sim.seurat, dims = 1:10)
DimPlot(sim.seurat)
For the analysis, the Seurat object must contain metadata columns for
cell type (cell_type), sex (sex), and
condition (status). Here, we add mock annotations to our
simulated data.
# Assign mock cell types
cellTypes <- c("cell type 1", "cell type 2", "cell type 3", "cell type 4", "cell type 5")
sim.seurat@meta.data$cell_type <- sample(cellTypes, nrow(sim.seurat@meta.data), replace = TRUE)
Idents(sim.seurat) <- "cell_type"
# Assign mock status (WT/TG) and sex (M/F)
samples <- sim.seurat@meta.data$sample
sim.seurat@meta.data$status <- ifelse(grepl("1|3|5|8|9", samples), "TG", "WT")
sim.seurat@meta.data$sex <- ifelse(grepl("1|2|4|8|10", samples), "M", "F")
When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.
Sex-stratified analyses use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.
Interaction analysis (using a multiplicative term
like sex*disease) formally tests whether the relationship
between disease and molecular changes differs significantly between
males and females. Not only can such interaction terms reveal
complexities in disease mechanisms that might otherwise be obscured in
analyses that do not consider SABV, but they also have the potential to
detect changes that are limited to the magnitude of an effect, an aspect
that sex-stratified analyses do not capture. Nevertheless, robust
estimation of interaction effects requires large sample sizes, which are
often not available due to the costs associated with advanced molecular
profiling techniques such as single-cell RNA sequencing.
The XYomics package provides functions for both types of analyses.
We use sex_stratified_analysis_sc() to identify DEGs
between conditions separately for males and females within each cell
type.
# Run for all cell types
sim.seurat <- JoinLayers(sim.seurat) #this is for the simulated object but not always required for your own object
results <- sex_stratified_analysis_sc(sim.seurat, min_logfc = -Inf)
sex_degs <- lapply(levels(as.factor(sim.seurat@meta.data$cell_type)), function(cell) {
list(
male = results$male_DEGs[[cell]],
female = results$female_DEGs[[cell]]
)
})
names(sex_degs) <- levels(as.factor(sim.seurat@meta.data$cell_type))
result_categories <- lapply(sex_degs, function(degs) categorize_sex_sc(degs$male, degs$female))
# Example for one cell type
result_one <- result_categories$`cell type 1`
cat("\nTop categorized DEGs for 'cell type 1' (from stratified analysis):\n")
##
## Top categorized DEGs for 'cell type 1' (from stratified analysis):
head(result_one)
## DEG_Type Gene_Symbols Male_avg_logFC Male_FDR
## male-specific1 male-specific CTSH -1.965567 1.139671e-06
## male-specific2 male-specific GOLGA6L24 -1.938910 5.096954e-06
## male-specific3 male-specific TRAV41 2.146682 7.169862e-06
## male-specific4 male-specific FAN1 -2.102872 8.528492e-06
## male-specific5 male-specific CUTC 2.173129 1.440704e-05
## male-specific6 male-specific SCML2P1 1.785135 4.082302e-05
## Female_avg_logFC Female_FDR
## male-specific1 0.22074818 0.7717056
## male-specific2 0.35324695 0.9197194
## male-specific3 0.55890099 0.8394288
## male-specific4 -0.09405884 0.6653794
## male-specific5 -0.21954437 0.7529602
## male-specific6 0.19363191 0.9925615
We use sex_interaction_analysis_sc() to perform a formal
interaction analysis, directly comparing the condition effect between
sexes.
# Example for one cell type (e.g., "cell type 1")
target_cell <- "cell type 1"
interaction_results_one_cell <- sex_interaction_analysis_sc(sim.seurat, target_cell_type = target_cell)
cat(paste0("\nSummary of Sex-Phenotype Interaction analysis results for '", target_cell, "':\n"))
##
## Summary of Sex-Phenotype Interaction analysis results for 'cell type 1':
print(interaction_results_one_cell$summary_stats)
## cell_type n_total_genes n_sig_genes
## 1 cell type 1 300 86
cat(paste0("\nTop genes from Sex-Phenotype Interaction analysis for '", target_cell, "':\n"))
##
## Top genes from Sex-Phenotype Interaction analysis for 'cell type 1':
print(head(interaction_results_one_cell$all_results[[1]]))
## logFC AveExpr t P.Value adj.P.Val B
## CACYBP 3.695025 10.29579 6.019264 4.235834e-09 5.428263e-07 10.427515
## CSF2 3.887220 10.07678 6.017174 4.285904e-09 5.428263e-07 10.416950
## PLA2G2E -3.571534 10.30650 -5.975030 5.428263e-09 5.428263e-07 10.194008
## SPATA31D2P -3.569100 10.69454 -5.867508 9.861416e-09 7.396062e-07 9.631411
## GSX1 -3.408116 10.52532 -5.495808 7.273736e-08 4.001010e-06 7.754312
## TMEM116 -3.371443 10.20957 -5.477563 8.002019e-08 4.001010e-06 7.664884
## cell_type gene
## CACYBP cell type 1 CACYBP
## CSF2 cell type 1 CSF2
## PLA2G2E cell type 1 PLA2G2E
## SPATA31D2P cell type 1 SPATA31D2P
## GSX1 cell type 1 GSX1
## TMEM116 cell type 1 TMEM116
Dot plots are an effective way to visualize gene expression in scRNA-seq data. Here, we show the expression of the top male-specific, female-specific, and sex-dimorphic genes across all cell types, split by sex and condition.
# Get top gene from each category for 'cell type 1'
top_male_gene <- result_one %>% filter(DEG_Type == "male-specific") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_female_gene <- result_one %>% filter(DEG_Type == "female-specific") %>% top_n(1, -Female_FDR) %>% pull(Gene_Symbols)
top_dimorphic_gene <- result_one %>% filter(DEG_Type == "sex-dimorphic") %>% top_n(1, -Male_FDR) %>% pull(Gene_Symbols)
top_genes <- c(top_male_gene, top_female_gene, top_dimorphic_gene)
# Create dot plot
if (length(top_genes) > 0) {
sim.seurat$group_plot <- paste(sim.seurat$cell_type, sim.seurat$sex, sim.seurat$status, sep = "_")
Idents(sim.seurat) <- "group_plot"
# Seurat v5 requires specifying assay for DotPlot
DotPlot(sim.seurat, features = top_genes, cols = c("blue", "red"), assay = "RNA") +
coord_flip() +
labs(title = "Expression of Top Sex-Specific and Dimorphic Genes") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
}
We perform pathway enrichment analysis to identify biological
pathways over-represented in our DEG categories. The
categorized_enrich_sc function automates this for all
categories.
# Run for all cell types
pathway_category <- lapply(result_categories, function(cell) categorized_enrich_sc(cell, enrichment_db = "REACTOME"))
cat("\nTop pathway results for 'cell type 1':\n")
##
## Top pathway results for 'cell type 1':
print(head(pathway_category$`cell type 1`))
## $male_specific
## [1] category pvalue p.adjust qvlaue term_size
## [6] ID Description Count intersection GeneRatio
## [11] geneID
## <0 rows> (or 0-length row.names)
##
## $female_specific
## category pvalue p.adjust
## 1 Acrosome Reaction and Sperm:Oocyte Membrane Binding 0.03114528 0.03114528
## 2 ER to Golgi Anterograde Transport 0.03114528 0.03114528
## 3 Transport to the Golgi and subsequent modification 0.03114528 0.03114528
## 4 Asparagine N-linked glycosylation 0.03518879 0.03518879
## 5 Mitochondrial iron-sulfur cluster biogenesis 0.03518879 0.03518879
## 6 Physiological factors 0.03518879 0.03518879
## 7 YAP1- and WWTR1 (TAZ)-stimulated gene expression 0.03518879 0.03518879
## 8 Fertilization 0.04998135 0.04998135
## qvlaue term_size ID
## 1 0.03114528 6 REAC:R-HSA-1300645
## 2 0.03114528 155 REAC:R-HSA-199977
## 3 0.03114528 186 REAC:R-HSA-948021
## 4 0.03518879 305 REAC:R-HSA-446203
## 5 0.03518879 11 REAC:R-HSA-1362409
## 6 0.03518879 14 REAC:R-HSA-5578768
## 7 0.03518879 14 REAC:R-HSA-2032785
## 8 0.04998135 26 REAC:R-HSA-1187000
## Description Count intersection
## 1 Acrosome Reaction and Sperm:Oocyte Membrane Binding 1 ACR
## 2 ER to Golgi Anterograde Transport 2 PREB,ANK3
## 3 Transport to the Golgi and subsequent modification 2 PREB,ANK3
## 4 Asparagine N-linked glycosylation 2 PREB,ANK3
## 5 Mitochondrial iron-sulfur cluster biogenesis 1 SLC25A37
## 6 Physiological factors 1 NPPA
## 7 YAP1- and WWTR1 (TAZ)-stimulated gene expression 1 NPPA
## 8 Fertilization 1 ACR
## GeneRatio geneID
## 1 0.166666667 ACR
## 2 0.012903226 PREB/ANK3
## 3 0.010752688 PREB/ANK3
## 4 0.006557377 PREB/ANK3
## 5 0.090909091 SLC25A37
## 6 0.071428571 NPPA
## 7 0.071428571 NPPA
## 8 0.038461538 ACR
##
## $sex_dimorphic
## [1] category pvalue p.adjust qvlaue term_size
## [6] ID Description Count intersection GeneRatio
## [11] geneID
## <0 rows> (or 0-length row.names)
##
## $sex_neutral
## [1] category pvalue p.adjust qvlaue term_size
## [6] ID Description Count intersection GeneRatio
## [11] geneID
## <0 rows> (or 0-length row.names)
# Run for one specific cell type
pathway_category_one <- categorized_enrich_sc(result_one)
cat("\nTop pathway results for 'cell type 1' (re-run separately):\n")
##
## Top pathway results for 'cell type 1' (re-run separately):
print(head(pathway_category_one))
## $male_specific
## category
## hsa03460 Genetic Information Processing
## hsa04142 Cellular Processes
## hsa04210 Cellular Processes
## hsa04514 Environmental Information Processing
## subcategory ID
## hsa03460 Replication and repair hsa03460
## hsa04142 Transport and catabolism hsa04142
## hsa04210 Cell growth and death hsa04210
## hsa04514 Signaling molecules and interaction hsa04514
## Description GeneRatio BgRatio pvalue
## hsa03460 Fanconi anemia pathway 1/3 55/9521 0.01723199
## hsa04142 Lysosome 1/3 133/9521 0.04132896
## hsa04210 Apoptosis 1/3 137/9521 0.04255397
## hsa04514 Cell adhesion molecule (CAM) interaction 1/3 160/9521 0.04957752
## p.adjust qvalue geneID Count
## hsa03460 0.04957752 NA 22909 1
## hsa04142 0.04957752 NA 1512 1
## hsa04210 0.04957752 NA 1512 1
## hsa04514 0.04957752 NA 80381 1
##
## $female_specific
## [1] category subcategory ID Description GeneRatio BgRatio
## [7] pvalue p.adjust qvalue geneID Count
## <0 rows> (or 0-length row.names)
##
## $sex_dimorphic
## category subcategory ID Description
## hsa04978 Organismal Systems Digestive system hsa04978 Mineral absorption
## GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
## hsa04978 2/7 61/9521 0.0008306107 0.03654687 0.02622981 7018/491 2
##
## $sex_neutral
## [1] category subcategory ID Description GeneRatio BgRatio
## [7] pvalue p.adjust qvalue geneID Count
## <0 rows> (or 0-length row.names)
Finally, we construct a protein-protein interaction network to explore the interactions between our DEGs.
We first fetch a protein-protein interaction network from the STRING
database. Then, we use the PCSF algorithm via
construct_ppi_pcsf to extract a relevant subnetwork based
on “prizes” derived from our DEG p-values.
# Fetch STRING network (can be replaced with a custom network)
# g <- get_string_network(organism = "9606", score_threshold = 900)
# Load a pre-existing network from a file
g <- readRDS(system.file("extdata", "string_example_network.rds", package = "XYomics")) # This should load the 'g' variable
### Run for all cell type
network_results <- list()
for (cell_type in names(result_categories)) {
# Extract DEG results for the current cell type
cell_results <- result_categories[[cell_type]]
# Filter DEGs by type
male_specific <- cell_results[cell_results$DEG_Type == "male-specific", ]
female_specific <- cell_results[cell_results$DEG_Type == "female-specific", ]
sex_dimorphic <- cell_results[cell_results$DEG_Type == "sex-dimorphic", ]
sex_neutral <- cell_results[cell_results$DEG_Type == "sex-neutral", ]
# Convert to log-transformed prizes
male_prizes <- -log10(male_specific$Male_FDR)
names(male_prizes) <- male_specific$Gene_Symbols
female_prizes <- -log10(female_specific$Female_FDR)
names(female_prizes) <- female_specific$Gene_Symbols
dimorphic_prizes <- -log10((sex_dimorphic$Male_FDR + sex_dimorphic$Female_FDR) / 2)
names(dimorphic_prizes) <- sex_dimorphic$Gene_Symbols
neutral_prizes <- -log10((sex_neutral$Male_FDR + sex_neutral$Female_FDR) / 2)
names(neutral_prizes) <- sex_neutral$Gene_Symbols
# Construct ppis
male_network <- construct_ppi_pcsf(g = g, prizes = male_prizes)
female_network <- construct_ppi_pcsf(g = g, prizes = female_prizes)
dimorphic_network <- construct_ppi_pcsf(g = g, prizes = dimorphic_prizes)
neutral_network <- construct_ppi_pcsf(g = g, prizes = neutral_prizes)
# Store all results per cell type
network_results[[cell_type]] <- list(
male_network = male_network,
female_network = female_network,
dimorphic_network = dimorphic_network,
neutral_network = neutral_network
)
}
network_results
## $`cell type 1`
## $`cell type 1`$male_network
## IGRAPH fe0f794 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe0f794 (vertex names):
##
## $`cell type 1`$female_network
## IGRAPH fe103c9 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe103c9 (vertex names):
##
## $`cell type 1`$dimorphic_network
## IGRAPH fe1207f UNWB 5 4 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe1207f (vertex names):
## [1] CACYBP--GSK3A DVL3 --GSK3A GSK3B --POLR1E CACYBP--GSK3B
##
## $`cell type 1`$neutral_network
## IGRAPH fe13ae8 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe13ae8 (vertex names):
##
##
## $`cell type 2`
## $`cell type 2`$male_network
## IGRAPH fe14b8a UNWB 3 2 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe14b8a (vertex names):
## [1] CCL2--CXCR2 CCL2--CSF2
##
## $`cell type 2`$female_network
## IGRAPH fe16406 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe16406 (vertex names):
##
## $`cell type 2`$dimorphic_network
## IGRAPH fe1805b UNWB 3 2 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe1805b (vertex names):
## [1] FBXO43--SKP1 CACYBP--SKP1
##
## $`cell type 2`$neutral_network
## IGRAPH fe192e6 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe192e6 (vertex names):
##
##
## $`cell type 3`
## $`cell type 3`$male_network
## IGRAPH fe19f3c UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe19f3c (vertex names):
##
## $`cell type 3`$female_network
## IGRAPH fe1ae68 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe1ae68 (vertex names):
##
## $`cell type 3`$dimorphic_network
## IGRAPH fe1c417 UNWB 7 6 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe1c417 (vertex names):
## [1] CACYBP--GSK3B CACYBP--GSK3A DVL3 --GSK3A POLR1E--POLR2K POLR2K--SUZ12
## [6] GSK3B --POLR1E
##
## $`cell type 3`$neutral_network
## IGRAPH fe1d8eb UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe1d8eb (vertex names):
##
##
## $`cell type 4`
## $`cell type 4`$male_network
## IGRAPH fe1e722 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe1e722 (vertex names):
##
## $`cell type 4`$female_network
## IGRAPH fe1f516 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe1f516 (vertex names):
##
## $`cell type 4`$dimorphic_network
## IGRAPH fe20d40 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe20d40 (vertex names):
##
## $`cell type 4`$neutral_network
## IGRAPH fe2249d UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe2249d (vertex names):
##
##
## $`cell type 5`
## $`cell type 5`$male_network
## IGRAPH fe233b7 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe233b7 (vertex names):
##
## $`cell type 5`$female_network
## IGRAPH fe2439c UNWB 3 2 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe2439c (vertex names):
## [1] SEC13--SUMO3 PREB --SEC13
##
## $`cell type 5`$dimorphic_network
## IGRAPH fe253b9 UNWB 3 2 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe253b9 (vertex names):
## [1] POLR1E--POLR2K POLR2K--SUZ12
##
## $`cell type 5`$neutral_network
## IGRAPH fe26310 UNWB 0 0 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edges from fe26310 (vertex names):
# Example for one specific cell type
# Create prizes from dimorphic DEGs in 'cell type 1'
dimorphic_specific <- result_one[result_one$DEG_Type == "sex-dimorphic", ]
dimorphic_prizes <- -log10((dimorphic_specific$Male_FDR + dimorphic_specific$Female_FDR) / 2)
names(dimorphic_prizes) <-dimorphic_specific$Gene_Symbols
# Construct the PCSF subnetwork
dimorphic_network <- construct_ppi_pcsf(g = g, prizes = dimorphic_prizes)
We use plot_network() to visualize the network, highlighting nodes based on their degree (i.e., significance). Each node also includes a barplot showing logFC values, blue for males and pink for females.
if (!is.null(dimorphic_network) && igraph::vcount(dimorphic_network) > 0) {
#Generate network visualization
plot_network(dimorphic_network, "Cell type 1", "sex-dimorphic", result_one)
} else {
cat("No network could be constructed for this category.")
}
The generate_cat_report function can be used to compile
all results into a single HTML report.
# This command generates a comprehensive HTML report
network_results <- lapply(network_results , function(cell) {
Filter(Negate(is.null), cell)
})
generate_cat_report(result_categories, pathway_category, network_results)