Is the Copula Independent of Batch Effects in scDesign3?

Mar 22, 2025 · 6 min read

The Question

Is the copula independent of batch effects in scDesign3?

This is an excellent and insightful question that goes to the heart of why you would separate the marginal and copula fitting steps in a single-cell simulation workflow.

The Answer

The Vine Copula structure you see in scDesign3 (e.g., cp$copula_list$techctrl) is NOT theoretically independent of the batch effect (e.g., orig.ident), but in your scDesign3 implementation, it is modeled to be independent of the linear effects of that batch.

Here is a breakdown of the dependence:

How Batch Effects Influence the Copula

A. Direct Influence (The Marginals)

Batch effects strongly influence the marginal distributions (the mean μ and dispersion σ) of the genes. Your gamlss model specifically accounts for this by including orig.ident in the formulas:

μ_formula = orig.ident + cellType.v1 + offset(log(nCount_RNA))
σ_formula = orig.ident + cellType.v1

B. Residual Influence (The Copula)

The copula is fitted to the residuals or quantile-transformed values of the data, after the effects of the marginal model (μ and σ) have been removed.

$$\text{Quantile-transformed value} = F_i(Y_i \mid \text{Covariates})$$

Where $F_i$ is the estimated cumulative distribution function (CDF) for gene $i$.

Because the marginal model includes orig.ident, the variations due to batch are mostly absorbed by the $\hat{\beta}_{\text{orig.ident}}$ coefficients in the μ and σ models. The copula, therefore, is fitted to the remaining, unexplained dependency between genes.

Conclusion: The copula models the dependency structure that exists after controlling for the linear, predictable effects of the batch and other covariates.

Why You Can Use This Copula for Simulation

When you use this copula in your simulation step (simu_new), you are leveraging a key assumption:

The batch effect primarily scales and shifts the mean and variance (the marginals), but does not fundamentally alter the intrinsic biological relationships (the correlations) between genes.

By using the original copula structure with your adjusted mean matrix (mean_mat_adjusted) where the orig.ident effect has been neutralized (i.e., the batch term is set to zero or a reference level), you achieve the desired outcome:

  • New Marginal Means: Expression levels are set to the desired, batch-corrected values.
  • Preserved Correlation: The simulation imposes the original, biologically realistic pattern of gene-gene dependencies onto the new mean values.

In summary, the copula is independent of the specific level of the batch effect (e.g., whether orig.ident is Batch 1 or Batch 2), but is absolutely dependent on the underlying biological and technical noise that shapes the gene co-expression patterns in your data.

Why is the Copula Fitted to the Residuals?

The copula is fitted to the residuals (or, more precisely, the probability integral transform values, often called u-values) for a fundamental reason related to the separation of a joint distribution into two independent parts: marginals and dependence structure.

This separation is the core principle of Sklar’s Theorem, which underlies all copula modeling.

Here is why the fitting is done on the transformed values:

1. Separating the Signal: Marginals vs. Dependence

A. The Marginal Model (μ and σ)

The gamlss model (which fits μ, σ, and potentially ν) captures the univariate properties of each gene. That is, for Gene $i$, it tells you its expected count ($\mu_i$) and its variability ($\sigma_i$) based on the observed covariates (batch, cell type, library size).

This marginal model effectively removes the major, predictable shifts in expression caused by covariates.

B. The Copula Model

The copula is only interested in the correlation or co-movement between genes that remains after the marginal effects have been accounted for.

If you fitted the copula directly to the raw, noisy, count data (or log-transformed counts), the strong effects of μ and σ would contaminate the estimate of the correlation. The correlation observed might simply be due to two genes both being expressed highly in a high-nCount_RNA cell, rather than an intrinsic biological link.

2. Achieving the Uniform Scale

The probability integral transform (PIT) converts the original data $Y$ into a uniform distribution $U \sim U(0,1)$. This is the key step:

$$U_i = F_i(Y_i \mid \text{Covariates})$$

Where $F_i$ is the estimated cumulative distribution function (CDF) for gene $i$.

Why Uniformity is Necessary for Copulas:

  • Standardization: The PIT process standardizes the data by removing the influence of the specific marginal distribution (Negative Binomial in your case) and its parameters (μ, σ). The u-values represent the percentile rank of the observation within its own estimated distribution.

  • Copula Definition: By definition, a copula is a multivariate CDF of a vector of uniform random variables. It exists on the standardized unit hypercube $[0,1]^d$.

  • Isolating Dependence: By operating on the uniform scale, the copula is forced to only model the structure of dependence (how the genes co-move) because the individual properties (the shape and location of the Negative Binomial distribution) have been stripped away.

In essence, fitting the copula to the u-values allows you to assume that the dependence structure is independent of the marginal distributions, making the model far more flexible and robust for simulation and counterfactual analysis (like your batch correction).

The Critical Correction: Workflow Order Matters

This is statistically critical for the scDesign3 workflow.

The copula models the dependency structure on the original data’s quantiles/residuals. It must be calculated before the mean parameters are modified for simulation. Modifying the means before calculating the copula would mean the copula is based on a distribution that does not reflect the original data’s correlation, leading to a meaningless dependence structure.

Since the goal is to produce a simulated count matrix (newcount_no_origident) that uses the adjusted mean matrix and the original copula structure, the standard scDesign3 workflow is:

  1. Fit Marginals (fit_marginal).
  2. Fit Copula (fit_copula) using the unadjusted marginals. (This is the critical step that must remain early.)
  3. Extract Parameters (extract_para) for the no-batch effect scenario.
  4. Adjust Means (to align tech_ctrl to real_sample).
  5. Simulate (simu_new) using the adjusted means and the original copula.

Why This Order is Essential

If you calculate the copula after modifying the means, you would be fitting it to a distribution that never existed in your original data. This would corrupt the biological signal—the gene-gene relationships you’re trying to preserve.

The copula is like capturing a snapshot of the correlation structure in your actual observed data. That snapshot must be taken from the real data, not from an artificially adjusted version.

Summary

The copula in scDesign3:

  • Is NOT theoretically independent of batch effects
  • Is modeled to be independent of the linear effects of batch after marginal fitting
  • Models the dependency structure after controlling for predictable batch effects
  • Must be fitted to the original data before any mean adjustments
  • Can then be applied to adjusted marginals to preserve biological correlations while removing batch effects

This workflow preserves the intrinsic biological relationships between genes while achieving the desired batch correction in simulation.