NMF

Gene program NMF Principle

Principle: Gene program NMF is a dimensionality reduction technique that factorizes a non-negative data matrix ( X ) into two lower-rank non-negative matrices ( W ) and ( H ), such that:

[ X approx W H ]

where: - ( X ) is the original data matrix (shape: ( n times m )), with all elements ( x_{ij} geq 0 ), - ( W ) (shape: ( n times k ), ( k ll m )) is the basis matrix (or “bin”/”gene program” matrix), - ( H ) (shape: ( k times m )) is the coefficient matrix (encodes the representation of original features in the lower-dimensional space).

Optimization Problem: Gene program NMF minimizes the difference between ( X ) and ( W H ) subject to non-negativity constraints:

[ min_{W, H} |X - WH|_F^2 ] [ text{subject to: } W geq 0, H geq 0 ]

where ( |cdot|_F ) indicates the Frobenius norm.

Multiplicative Update Rules: Gene program NMF uses a common algorithm for updating ( W ) and ( H ) is as follows (Lee and Seung, 2001):

[ H_{kj} leftarrow H_{kj} cdot frac{(W^T X)_{kj}}{(W^T W H)_{kj}} ] [ W_{ik} leftarrow W_{ik} cdot frac{(X H^T)_{ik}}{(W H H^T)_{ik}} ]

Interpretation: - Each column of ( W ) defines a “gene program” or latent pattern. - Each column of ( H ) describes the mixture coefficients (i.e., how much of each gene program is present) for the original samples/cells/spots. - All matrices contain only non-negative values, enhancing interpretability for non-negative biological data (e.g., gene expression).

\[ \begin{align}\begin{aligned}\mathbf{X} \approx \mathbf{W} \mathbf{H}\\\quad \text{with} \quad \mathbf{X} \in \mathbb{R}_{\geq 0}^{n \times m}, \quad \mathbf{W} \in \mathbb{R}_{\geq 0}^{n \times k}, \quad \mathbf{H} \in \mathbb{R}_{\geq 0}^{k \times m}\\\quad \Rightarrow \quad \underset{\mathbf{W},\,\mathbf{H}\,\geq\,0}{\mathrm{argmin}} \ \| \mathbf{X} - \mathbf{W} \mathbf{H} \|_F^2\end{aligned}\end{align} \]

Gene program NMF clustering using tardis_spac.utils.nmf_clustering()

tardis_spac.utils.nmf_clustering(adata, n_components=10, random_state=42, max_iter=1000, verbose=0, n_top_genes=2000)

Perform NMF-based clustering on highly variable genes.

Parameters:

adata – AnnData object. The raw expression matrix.
n_components – Number of NMF components/clusters.
random_state – Random seed for reproducibility.
max_iter – Maximum number of NMF iterations.
verbose – Verbosity level for fitting NMF.
n_top_genes – Number of highly variable genes to use.

Returns:

AnnData object containing NMF results in obsm[‘X_nmf’] and uns[‘X_nmf_components’].

Usage

With NMF clustering, highly variable genes that are ‘dominantly’ effecting the tumor landscape are identified. These genes are then used to cluster the spots into different gene programs.

Gene program NMF Consensus Clustering using tardis_spac.utils.nmf_consensus()

tardis_spac.utils.nmf_consensus(adata, min_clusters=4, max_clusters=10, n_resamples=100, resample_frac=0.8, random_state=42, n_cluster_genes=50)

Perform consensus clustering using NMF results and compute cluster gene scores.

Parameters:

adata – AnnData object where NMF results have already been computed.
min_clusters – Minimum number of clusters to test.
max_clusters – Maximum number of clusters to test.
n_resamples – Number of resamplings for consensus clustering.
resample_frac – Fraction of samples used in each resample.
random_state – Random seed.
n_cluster_genes – Number of top genes per cluster to use for scoring.

Returns:

AnnData object with consensus clustering and normalized gene scores added to .obs.

Usage

With NMF consensus clustering, the gene programs are further refined by computing the consensus clustering of the NMF results.

This is done by resampling the spots and computing the NMF results for each resample.

The consensus clustering is then computed by averaging the NMF results across the resamples.

The gene scores are then computed by averaging the NMF results across the resamples for each gene.