LocalModule

Guide Ranking by Cluster-Dependent Distribution

This module provides two main statistical approaches for ranking guides in spatial transcriptomics data based on their distribution across clusters: PERMANOVA-based ranking and Aitchison distance-based ranking. Both methods compare each guide’s distribution to that of a reference (usually a negative control guide, e.g., “sgNon-targeting”) and support permutation testing for empirical significance assessment.

PERMANOVA-Based Clustering Guide Ranking

permanova(gdata, cluster_field, result_field='permanova_f_value', reference_guide='sgNon-targeting', library_key=None, count_bins=10, n_permutations=1000, show_progress=True, copy=False)

Ranks guides by using the PERMANOVA (Permutational Multivariate Analysis of Variance) test to compare their distributions across clusters to a specified reference guide. This method is adapted to spatial transcriptomics by comparing the distributions of cluster counts for each guide to the reference guide via kernel density estimation and the Bray-Curtis distance.

Mathematical Principle:

Let \(X_g = \{x_{gc}\}\) be the vector of cell counts (or expression) for guide \(g\) across each cluster \(c\). For guide \(g\) and the reference guide, kernel density estimation is performed over the binned counts. The two group distributions are compared using Bray-Curtis dissimilarity. The PERMANOVA test pseudo F-statistic for the two groups is then computed as

\[F = \frac{SS_B}{SS_W + \epsilon}\]

where \(SS_B\) is the between-group mean distance, \(SS_W\) is the within-group mean distance, and \(\epsilon\) is a small constant for numerical stability.

Permutation testing is used to assess significance by randomly swapping cluster densities between the guide group and the reference across permutation runs.

Parameters:

gdata – AnnData object with .obs (for cell metadata), .var (for guide info), and .X (expression/count matrix).
cluster_field – Key in .obs indicating cluster labels.
result_field – Field name in .var to write PERMANOVA F statistic (default: “permanova_f_value”).
reference_guide – The guide name used as the negative control/reference (default: “sgNon-targeting”).
library_key – If set, analysis will be performed separately per batch/sample group (default: None).
count_bins – Number of bins for cluster-wise density estimation (default: 10).
n_permutations – Number of permutations for empirical p-value estimation (default: 1000).
show_progress – Whether to display progress during permutations (default: True).
copy – If True, return a new modified AnnData object; else, modify in-place (default: False).

Returns:

None if inplace; otherwise returns AnnData with results written to .var.

Example:

import tardis as td
td.stats.permanova(adata, cluster_field='leiden')
adata.var.sort_values('permanova_f_value', ascending=False)

After running, the field “permanova_f_value” in .var contains F statistics for each guide, and “permanova_f_value.p_value” contains empirical p-values from permutation tests.

Note: The method compares the shape of cluster abundance distributions between guides and the reference, independently for each guide. Clustering assignment is required in advance.

Aitchison Distance-Based Guide Ranking

aitchison_distance(gdata, cluster_field, result_field='aitchison_dist', reference_guide='sgNon-targeting', library_key=None, n_permutations=1000, show_progress=True, p_swap=0.1, copy=False)

Computes the Aitchison distance between each guide and the reference guide based on their compositional (cluster-wise) abundances, ranking guides by the resulting distance. The method supports permutation testing by swapping cluster values between the guide and reference with probability p_swap.

Mathematical Principle:

Given the abundance composition of each guide \(g\) over clusters \(c\) (counts \(x_{gc}\)), the composition vector is transformed as:

\[v_{g} = \log(x_{gc} + 1) - \frac{1}{C} \sum_{c'} \log(x_{gc'} + 1)\]

where C is the number of clusters.

The Aitchison distance is then:

\[d(g, r) = \sqrt{ \sum_c (v_{g,c} - v_{r,c})^2 }\]

Permutation testing proceeds by swapping counts between guide and reference in each cluster with probability p_swap and recomputing Aitchison distances.

Parameters:

gdata – AnnData object containing .obs, .var, .X.
cluster_field – Key in .obs denoting cluster assignment for each cell.
result_field – Output field name in .var for storing Aitchison distances (default: “aitchison_dist”).
reference_guide – Reference guide name (default: “sgNon-targeting”).
library_key – Key in .obs for performing per-sample (library) analysis (default: None).
n_permutations – Number of permutations for p-value estimation (default: 1000).
show_progress – Display progress bar for permutations (default: True).
p_swap – Probability of swapping the cluster counts between the sample and reference per permutation (default: 0.1).
copy – Return new AnnData if True; operate in-place if False (default: False).

Returns:

None if inplace; otherwise modified AnnData.

Example:

import tardis as td
td.stats.aitchison_distance(adata, cluster_field='leiden')
adata.var.sort_values('aitchison_dist', ascending=False)

After running, “aitchison_dist” and “aitchison_dist.p_value” fields in .var will contain distance and corresponding empirical p-values.

Note: This method treats guide cluster abundance as a composition and measures divergence from the reference using Aitchison geometry (Euclidean distance after log-ratio transform on composition). The permutation null ensures fair empirical significance control. Clustering assignment is required in advance.

Result Storage & Usage

All ranking and statistical results (scores and p-values) are written to fields in adata.var as specified by result_field.
Use the result to filter, rank, or further visualize guides that drive distinct cluster distributions compared to controls.
Both methods optionally allow per-sample or per-library group analysis via library_key, writing per-group results.
Appropriate permutation-based p-values help control for statistical significance across the high-dimensional distribution space.