🧬 Dynamic RNA velocity model-- (3) post hoc velocity graph

Introduction

In this third installment of our blog series on effectively applying the dynamic model to infer RNA velocity from single-cell RNA-seq, we start our deep dive into the post hoc computations in scVelo that shape both the visualization and interpretation of RNA velocity.

Here specifically looks into the two key components that are computed post hoc in scVelo to derive the velocity graph:

cosine similarity in section A)
exponential kernel transformation in section B)

And how velocity graph are projected onto an embedding, such as UMAP or t-SNE, which is the common way to visualize the inferred RNA velocity in section C)

Finally, the reconstructability score $r$ in section D), which quantifies how well a subset of genes can recapitulate the overall RNA velocity dynamics inferred from the full set of genes.

A) Cosine Similarity to compute velocity graph

Example Setup

We have two cells $ i $ and $ j $, with gene expression vectors:

$$ s_i = \begin{bmatrix} 2.0 \\ 3.5 \\ 1.0 \end{bmatrix}, \quad s_j = \begin{bmatrix} 3.5 \\ 2.5 \\ 1.5 \end{bmatrix} $$

We also have the velocity vector for cell $ i $:

$$ v_i = \begin{bmatrix} 1.2 \\ -0.5 \\ 0.3 \end{bmatrix} $$

First, we compute the difference between the expression values of the two cells:

$$ \delta_{ij} = s_j - s_i = \begin{bmatrix} 3.5 - 2.0 \\ 2.5 - 3.5 \\ 1.5 - 1.0 \end{bmatrix} = \begin{bmatrix} 1.5 \\ -1.0 \\ 0.5 \end{bmatrix} $$

1. Standard Cosine Similarity

The standard cosine similarity is given by:

$$ \pi_{ij} = \cos(\delta_{ij}, v_i) = \frac{\delta_{ij}^T v_i}{\|\delta_{ij}\| \|v_i\|} $$

Step 1: Compute the dot product $$ \delta_{ij}^T v_i = (1.5)(1.2) + (-1.0)(-0.5) + (0.5)(0.3) = 1.8 + 0.5 + 0.15 = 2.45 $$

Step 2: Compute the magnitudes $$ \|\delta_{ij}\| = \sqrt{(1.5)^2 + (-1.0)^2 + (0.5)^2} = \sqrt{2.25 + 1.00 + 0.25} = \sqrt{3.5} \approx 1.87 $$

$$ \|v_i\| = \sqrt{(1.2)^2 + (-0.5)^2 + (0.3)^2} = \sqrt{1.44 + 0.25 + 0.09} = \sqrt{1.78} \approx 1.34 $$

Step 3: Compute cosine similarity $$ \pi_{ij} = \frac{2.45}{(1.87)(1.34)} = \frac{2.45}{2.51} \approx 0.98 $$

The vectors are strongly aligned in direction.

2. Variance-Stabilized Transformation

Instead of using raw values, we apply:

$$ \text{sign}(\delta_{ij}) \sqrt{|\delta_{ij}|}, \quad \text{sign}(v_i) \sqrt{|v_i|} $$

Step 1: Transform $ \delta_{ij} $

$$ \text{sign}(\delta_{ij}) \sqrt{|\delta_{ij}|} = \begin{bmatrix} \text{sign}(1.5) \sqrt{1.5} \\ \text{sign}(-1.0) \sqrt{1.0} \\ \text{sign}(0.5) \sqrt{0.5} \end{bmatrix} = \begin{bmatrix} 1.22 \\ -1.00 \\ 0.71 \end{bmatrix} $$

Step 2: Transform $ v_i $

$$ \text{sign}(v_i) \sqrt{|v_i|} = \begin{bmatrix} \text{sign}(1.2) \sqrt{1.2} \\ \text{sign}(-0.5) \sqrt{0.5} \\ \text{sign}(0.3) \sqrt{0.3} \end{bmatrix} = \begin{bmatrix} 1.10 \\ -0.71 \\ 0.55 \end{bmatrix} $$

Step 3: Compute transformed cosine similarity $$ \pi_{ij}^{\text{stabilized}} = \frac{\left( 1.22, -1.00, 0.71 \right) \cdot \left( 1.10, -0.71, 0.55 \right)}{\| \text{sign}(\delta_{ij}) \sqrt{|\delta_{ij}|}\| \|\text{sign}(v_i) \sqrt{|v_i|}\|} $$

Dot product $$ 1.22(1.10) + (-1.00)(-0.71) + (0.71)(0.55) = 1.342 + 0.71 + 0.391 = 2.443 $$

Compute magnitudes $$ \sqrt{(1.22)^2 + (-1.00)^2 + (0.71)^2} = \sqrt{1.488 + 1.000 + 0.504} = \sqrt{2.992} \approx 1.73 $$

$$ \sqrt{(1.10)^2 + (-0.71)^2 + (0.55)^2} = \sqrt{1.210 + 0.504 + 0.302} = \sqrt{2.016} \approx 1.42 $$

Final computation $$ \pi_{ij}^{\text{stabilized}} = \frac{2.443}{(1.73)(1.42)} = \frac{2.443}{2.46} \approx 0.99 $$

B) Exponential Kernel Transformation of Cosine Similarity

The raw cosine similarity $ \pi_{ij} $ measures directional alignment between velocity vectors, but to obtain meaningful transition probabilities, we apply an exponential kernel:

$$ \tilde{\pi}_{ij} = \frac{1}{z_i} \exp \left(\frac{\pi_{ij}}{\sigma^2} \right) $$

where:

$ \pi_{ij} $ is the cosine similarity between cell $ i $ and $ j $ .
$ \sigma $ is the kernel width parameter, which controls the spread of probabilities.
$ z_i $ is a row normalization factor, ensuring probabilities sum to 1:

$$ z_i = \sum_j \exp \left(\frac{\pi_{ij}}{\sigma^2} \right) $$

1. Why Exponential Transformation

The exponential function ensures:

Amplification of strong correlations: Large values of $ \pi_{ij} $ get boosted.
Suppression of weak correlations: Small or negative values decay rapidly, reducing their transition influence.
Nonlinear scaling: Instead of treating all similarity scores equally, this transformation sharpens distinctions between stronger and weaker connections.

Think of it as a softmax-like weighting that enhances directional flow probability based on cosine similarity.

2. Row Normalization Ensures Probabilities Sum to 1

The normalization factor $ z_i $ guarantees valid probability distributions:

$$ \sum_j \tilde{\pi}_{ij} = 1 $$

Without this step, raw exponentiated values might grow arbitrarily large, leading to improper probability distributions.

3. Role of Kernel Width σ

Large $ \sigma $ → Softens probability differences, making transitions smoother.
Small $ \sigma $ → Sharper transitions, favoring strong directional alignments.

Choosing an appropriate $ \sigma $ ensures biological interpretability in applications like RNA velocity and cell fate prediction.

C) Project RNA Velocities into an Embedding (e.g., UMAP)

In scVelo, you have:

High-dimensional RNA velocities (in gene expression space)
Low-dimensional embeddings (e.g., UMAP, t-SNE)

The challenge is to project velocities into the embedding so that arrows in UMAP space reflect true dynamics.

1. Key Variables

Let

$\vec{s}_i$ be the embedding position of cell $i$ (e.g., in 2D UMAP space)
$\vec{\delta}_{ij} = \frac{\vec{s}_j - \vec{s}_i}{\|\vec{s}_j - \vec{s}_i\|}$ be the normalized direction vector from cell $i$ to cell $j$
$\pi_{ij}$ be the transition probability from cell $i$ to $j$ in the velocity-derived transition matrix
$\vec{\nu}_i$ be the velocity vector in the embedding space for cell $i$

2. Formula: Projected Velocity Vector

The projected velocity for cell $i$ in the embedding space is:

$$\vec{\nu}_i = E_{\vec{\pi}_i}[\vec{\delta}_{ij}] = \sum_{j \neq i} \pi_{ij} \vec{\delta}_{ij} - \frac{1}{n} \sum_{j \neq i} \vec{\delta}_{ij}$$

Or in compact form:

$$\vec{\nu}_i = \vec{\delta}_{i \cdot}^{\top} \vec{\pi}_i - \frac{1}{n} \sum_{j \neq i} \vec{\delta}_{ij}$$

3. Intuition

The first term is the expected movement direction in embedding space, weighted by the transition probabilities derived from RNA velocity.

The second term, $\frac{1}{n} \sum \vec{\delta}_{ij}$ , corrects for non-uniform density in the embedding. Without it, areas with more cells would bias the velocity field.

This ensures that the resulting velocity arrows:

Follow the dynamics inferred from spliced/unspliced RNA
Are not artifacts of cell density in UMAP
Represent true directionality in cellular state transitions

4. In Practice

This velocity vector $\vec{\nu}_i$ is:

Computed for each cell
Overlaid as arrows on UMAP or t-SNE plots using scv.pl.velocity_embedding()

5. Summary

Term	Meaning
$\vec{s}_i$	Position of cell $i$ in embedding (e.g., UMAP)
$\vec{\delta}_{ij}$	Normalized direction vector from $i$ to $j$
$\pi_{ij}$	Velocity-based transition probability from $i$ to $j$
$\vec{\nu}_i$	Estimated velocity vector in embedding for cell $i$
$\frac{1}{n} \sum \vec{\delta}_{ij}$	Density correction term

6. Mathematical Breakdown

Step 1: Compute Direction Vectors

For each cell $i$ and all its neighbors $j$ :

$$\vec{\delta}_{ij} = \frac{\vec{s}_j - \vec{s}_i}{\|\vec{s}_j - \vec{s}_i\|_2}$$

Step 2: Get Transition Probabilities

From the velocity graph computed by scVelo:

$$\pi_{ij} = P(\text{cell } i \rightarrow \text{cell } j | \text{RNA velocity})$$

Step 3: Weighted Average Direction

$$\text{Expected direction} = \sum_{j \neq i} \pi_{ij} \vec{\delta}_{ij}$$

Step 4: Density Correction

$$\text{Uniform correction} = \frac{1}{n} \sum_{j \neq i} \vec{\delta}_{ij}$$

Step 5: Final Velocity

$$\vec{\nu}_i = \text{Expected direction} - \text{Uniform correction}$$

7. Alternative Formulations

Kernel-Based Approach

Some implementations use a kernel-weighted version:

$$\vec{\nu}_i = \frac{\sum_j K(\vec{s}_i, \vec{s}_j) \pi_{ij} \vec{\delta}_{ij}}{\sum_j K(\vec{s}_i, \vec{s}_j)}$$

where $K(\vec{s}_i, \vec{s}_j)$ is a distance-based kernel (e.g., Gaussian).

Confidence Weighting

Incorporate velocity confidence scores:

$$\vec{\nu}_i = \frac{\sum_j c_{ij} \pi_{ij} \vec{\delta}_{ij}}{\sum_j c_{ij}}$$

where $c_{ij}$ represents the confidence in the transition from $i$ to $j$ .

8. Implementation Notes

The projected velocities are typically stored in:

adata.obsm['velocity_umap'] for UMAP projections
adata.obsm['velocity_tsne'] for t-SNE projections

These can be accessed and visualized using scVelo’s plotting functions to show the directional flow of cellular development in low-dimensional space.

D) The Reconstructability Score r

1. Mathematical Definition

$$r = \text{median}_i \, \text{corr}(\pi_i, \pi_i')$$

Where:

$\pi_i$ : The row vector of the full velocity graph for cell $i$ , representing its outgoing transition probabilities to all other cells.
$\pi_i'$ : The row vector of the reduced velocity graph for cell $i$ , representing its outgoing transition probabilities based only on the selected genes.
$\text{corr}(\pi_i, \pi_i')$ : This calculates the correlation between the full outgoing transition probabilities for cell $i$ and the outgoing transition probabilities derived from the subset of genes for cell $i$ . A high correlation means that the subset of genes largely reproduces the directional information provided by all genes for that particular cell.
$\text{median}_i$ : The score is then computed as the median of these per-cell correlations across all cells $i$ . Taking the median makes the score robust to outliers (e.g., a few cells where the subset of genes might perform poorly due to local noise).

2. Meaning of Reconstructability Score

The reconstructability score quantifies how well a specific subset of genes can “reconstruct” or recapitulate the overall RNA velocity dynamics inferred from the full set of genes.

High Score (close to 1)

If $r$ is high (e.g., $> 0.8$ ), it suggests that the selected subset of genes (e.g., top likelihood genes) are indeed the primary drivers of the observed cellular transitions. Their dynamics alone are largely sufficient to explain the overall directionality and speed of differentiation. This validates their “driver” status and indicates that focusing on these genes provides a good summary of the system’s dynamics.

Low Score

If $r$ is low, it means that the selected subset of genes does not adequately capture the full dynamic picture. The overall cellular transitions are likely influenced by a broader set of genes not included in your selection, or the selected genes might not be truly representative of the overall dynamics.

3. Mathematical Breakdown

Transition Probability Vectors

For each cell $i$ , the transition probabilities are derived from the velocity graph:

$$\pi_i = [\pi_{i,1}, \pi_{i,2}, \ldots, \pi_{i,N}]$$

where $\pi_{i,j}$ represents the probability of cell $i$ transitioning to cell $j$ , and $N$ is the total number of cells.

Correlation Calculation

The correlation between full and reduced transition probabilities:

$$\text{corr}(\pi_i, \pi_i') = \frac{\text{cov}(\pi_i, \pi_i')}{\sigma_{\pi_i} \sigma_{\pi_i'}}$$

where:

$\text{cov}(\pi_i, \pi_i')$ is the covariance
$\sigma_{\pi_i}$ and $\sigma_{\pi_i'}$ are the standard deviations

Median Aggregation

The final score uses median aggregation for robustness:

$$r = \text{median}\{c_1, c_2, \ldots, c_N\}$$

where $c_i = \text{corr}(\pi_i, \pi_i')$ for each cell $i$ .

4. Usage and Biology

Validation of Driver Gene Selection

It’s commonly used to validate that the “top likelihood genes” (which scVelo identifies as dynamically strong) are indeed the key players shaping the global trajectory.

Identifying Essential Gene Sets

You can test specific gene modules or pathways to see if they are collectively responsible for a particular fate decision or developmental progression.

Dimensionality Reduction/Summarization

If a small subset of genes yields a high reconstructability score, it suggests that these genes form a powerful, low-dimensional representation of the system’s dynamics.

Biological Interpretation

High reconstructability from a specific gene set points to their significant biological role in driving the observed cellular changes.