Generate Simulated Pseudo-Bulk Data for Two Phenotypes
Source:R/generate_simulated_bulk_data.R
generate_simulated_bulk_data.Rd
Simulates pseudo-bulk data from single-cell or spatial transcriptomics input by aggregating data across regions/phenotypes, with optional perturbation.
Usage
generate_simulated_bulk_data(
input_data,
region_labels,
phenotypes,
perturbation_percent = 0.1,
num_samples = 50,
mode = c("proportion", "expression"),
seed = 123
)
Arguments
- input_data
A matrix or data.frame. For
mode = "expression"
, it should be a numeric matrix with rows as genes and columns as spots. For"proportion"
, it can be a vector of cell types.- region_labels
A named vector with phenotype labels. Names must match column names (spots) of
input_data
.- phenotypes
A character vector of two phenotype labels to compare, e.g.,
c("A", "B")
.- perturbation_percent
Numeric; percentage (between 0 and 1) of random noise to add. Default is 0.1.
- num_samples
Integer; number of pseudo-bulk samples to generate for each phenotype. Default is 50.
- mode
Character; either
"expression"
or"proportion"
. Default is"proportion"
.- seed
Integer; random seed for reproducibility. Default is 123.
Value
A list with two data.frames:
First element: pseudo-bulk samples for phenotype I
Second element: pseudo-bulk samples for phenotype II
Details
The function supports two modes:
"expression"
: Uses numeric expression matrix (e.g., genes × spots)."proportion"
: Uses categorical input (e.g., cell type labels) and converts to proportions.
For each phenotype group, the function randomly perturbs the data and averages columns to simulate multiple pseudo-bulk samples. This is useful for benchmarking or downstream analysis.
Examples
if (FALSE) { # \dontrun{
# Simulate from gene expression matrix
gene_mat <- matrix(runif(1000), nrow = 100, ncol = 10)
colnames(gene_mat) <- paste0("Spot", 1:10)
labels <- setNames(rep(c("A", "B"), each = 5), colnames(gene_mat))
result <- generate_simulated_bulk_data(
input_data = gene_mat,
region_labels = labels,
phenotypes = c("A", "B"),
mode = "expression"
)
data("osmFISH_metadata_cellType")
data("osmFISH_metadata_region")
data("osmFISH_phenotype_simu")
pseudo_bulk_simi <- generate_simulated_bulk_data(
input_data = osmFISH_metadata_cellType,
region_labels = osmFISH_metadata_region,
phenotypes = osmFISH_phenotype_simu,
perturbation_percent = 0.1,
num_samples = 50,
mode = "proportion")
head(pseudo_bulk_simi[[1]])
} # }