Skip to contents

Simulates pseudo-bulk data from single-cell or spatial transcriptomics input by aggregating data across regions/phenotypes, with optional perturbation.

Usage

generate_simulated_bulk_data(
  input_data,
  region_labels,
  phenotypes,
  perturbation_percent = 0.1,
  num_samples = 50,
  mode = c("proportion", "expression"),
  seed = 123
)

Arguments

input_data

A matrix or data.frame. For mode = "expression", it should be a numeric matrix with rows as genes and columns as spots. For "proportion", it can be a vector of cell types.

region_labels

A named vector with phenotype labels. Names must match column names (spots) of input_data.

phenotypes

A character vector of two phenotype labels to compare, e.g., c("A", "B").

perturbation_percent

Numeric; percentage (between 0 and 1) of random noise to add. Default is 0.1.

num_samples

Integer; number of pseudo-bulk samples to generate for each phenotype. Default is 50.

mode

Character; either "expression" or "proportion". Default is "proportion".

seed

Integer; random seed for reproducibility. Default is 123.

Value

A list with two data.frames:

  • First element: pseudo-bulk samples for phenotype I

  • Second element: pseudo-bulk samples for phenotype II

Details

The function supports two modes:

  • "expression": Uses numeric expression matrix (e.g., genes × spots).

  • "proportion": Uses categorical input (e.g., cell type labels) and converts to proportions.

For each phenotype group, the function randomly perturbs the data and averages columns to simulate multiple pseudo-bulk samples. This is useful for benchmarking or downstream analysis.

Author

Bin Duan

Examples

if (FALSE) { # \dontrun{
# Simulate from gene expression matrix
gene_mat <- matrix(runif(1000), nrow = 100, ncol = 10)
colnames(gene_mat) <- paste0("Spot", 1:10)
labels <- setNames(rep(c("A", "B"), each = 5), colnames(gene_mat))
result <- generate_simulated_bulk_data(
  input_data = gene_mat,
  region_labels = labels,
  phenotypes = c("A", "B"),
  mode = "expression"
)

data("osmFISH_metadata_cellType")
data("osmFISH_metadata_region")
data("osmFISH_phenotype_simu")
pseudo_bulk_simi <- generate_simulated_bulk_data(
  input_data = osmFISH_metadata_cellType,
  region_labels = osmFISH_metadata_region,
  phenotypes = osmFISH_phenotype_simu,
  perturbation_percent = 0.1,
  num_samples = 50,
  mode = "proportion")

head(pseudo_bulk_simi[[1]])
} # }