Preprocessing

Note

This here is an API version of all the functions applicable in TARDIS, For more detailed and thorough reference please visit our Build your first project site.

Visium HD

With Visium HD data, you should run Spaceranger with your guide reference and probe set reference.

Try the following tutorial for Visium HD customized reference building: - Customized guide reference building

Successfully setup, you should be able to find guide reads in your filtered_bc_matrix.mtx file in Spaceranger output directory.

Read Spaceranger output

You can simply apply scanpy.read_10x_mtx to read the guide reads in your filtered_bc_matrix.mtx file.

Remember to set genome only to FALSE for proper guide read assignment.

Stereoseq

With BGI stereoseq data, you should follow the instructions on SAW to process the data.

Use SPAC-seq script SeekSeq to process your enriched guide FASTQ files.

Note

SeekSeq is a tool for processing stereoseq data. It is not a part of TARDIS toolkit. refer to SPAC-seq official code repository SPAC-seq for more details.

In SeekSeq, we filter the guide reads by the following criteria: - R1 containing constant region “TTGTCTTCCTAAGAC” with a max error rate of 1 base mismatch. - R2 containing scaffold region “GTTTTAGA” with a max error rate of 1 base mismatch.

See our paper for more details: Uncovering Spatially Resolved Functional Genomics with CRISPR Screen Sequencing..

Successfully processed, you should be able to find guide reads in your guide.gem file in SAW output directory.

Read SAW output

Read in SAW output from BGI stereoseq data with tardis_spac.utils.load_bin()

tardis_spac.utils.load_bin(gem_file, bin_size)

Read in SAW output from BGI stereoseq data with tardis_spac.utils.load_bin()

Parameters:

gem_file – Path to the input SAW GEM file.
bin_size – The size of the bin.

Returns:

AnnData object with the guide reads mapped to the bin.

Note

For Stereoseq data only.

Function Explanation

The tardis_spac.utils.load_bin() function reads a SAW GEM file from BGI Stereoseq data and aggregates UMI (Unique Molecular Identifier) counts into spatial bins of a specified size. It generates an AnnData object whose spots represent these bins, making downstream analysis (such as integration with gene expression/transcriptome data) more convenient.

Choosing the right ``bin_size`` is important: - A small bin_size provides higher spatial resolution but may result in many bins with few reads. (20-50um) - A large bin_size increases read counts per bin but reduces spatial resolution. (50-100um) - For integrative analysis (e.g., with spatial transcriptome data), select a bin_size consistent with the transcriptome binning.

Tip: You can read and bin the guide GEM file and transcriptome GEM file together using the same bin_size for compatible spatial mapping.

Tip: For BGI stereoseq data, a bin is approximately 500nm in size.

from tardis_spac.utils import load_bin
guide_adata = load_bin("guide.gem", bin_size=50)
gene_adata = load_bin("transcriptome.gem", bin_size=50)
# Now guide_adata and gene_adata are spatially comparable

TARDIS preprocessing functions

Quality Control of Guide Bins using tardis_spac.utils.qc_guide_bins()

Note

For Stereoseq data only. For Visium HD data, you can apply scanpy.pp.calculate_qc_metrics to calculate the quality metrics of the guide bins.

tardis_spac.utils.qc_guide_bins(gem_path, guide_prefix=None, fig_path=None)

Summarize and visualize the quality of guide bins in the provided GEM file.

Parameters:

gem_path – Path to the input GEM file.
guide_prefix – Optional string. If provided, filter guide IDs starting with this prefix.
fig_path – Optional string. Path to save the output plot image. If None, the plot will be displayed instead.

Returns:

None. Shows or saves a figure displaying summary statistics of the guide bins.

Remove Mitochondrial, Ribosomal, Housekeeping, and lncRNA Genes using tardis_spac.utils.remove_mito_ribo_hk_lnc_genes()

Note

Housekeeping gene list is from He2020Nature_mouseHK.txt. Refer to Nature paper for more details.

tardis_spac.utils.remove_mito_ribo_hk_lnc_genes(adata, housekeeping_list='He2020Nature_mouseHK.txt')

Remove genes related to mitochondria, ribosomes, housekeeping, or annotated as lncRNAs from the dataset.

Parameters:

adata – AnnData object. The input data matrix with genes in adata.var_names.
housekeeping_list – Path to a text file containing housekeeping gene names. Default: “He2020Nature_mouseHK.txt”

Returns:

AnnData object with unwanted genes removed.

Filter and Process Guide Reads using tardis_spac.utils.filter_guide_reads() This process is essential for removing noise guide reads and keeping major signals for more accurate analysis.

Note

For Stereoseq data only. For Visium HD data, we recommend running TARDIS directly.

tardis_spac.utils.filter_guide_reads(gem_path, guide_prefix=None, output_path=None, binarilize=False, assign_pattern='max', filter_threshold=None)

Filter and process guide reads from a GEM file.

Parameters:

gem_path – Path to the input GEM file.
guide_prefix – Optional string. Filter guide names with this prefix.
output_path – Path to save filtered results. If None, returns the filtered DataFrame.
binarilize – If True, binarize all bin counts to 1 (recommended for high-resolution/high-library datasets).
assign_pattern – How to handle bins with multiple guides: ‘max’ (keep guide with max count), ‘drop’ (remove such bins), or ‘all’ (keep all).
filter_threshold – Integer. Minimum MIDCount threshold for bin filtering.

Returns:

Filtered pandas DataFrame if output_path is None, otherwise None.

from tardis_spac.utils import filter_guide_reads
filter_guide_reads(
   "guide.gem",
   guide_prefix="sg",
   output_path="filtered_guide.gem",
   binarilize=True, # Recommended for tumor / immune cell perturbation datasets
   assign_pattern="max", # Recommended for more clean guide assignment
   filter_threshold=10 # Minimum MIDCount (UMI) threshold for bin filtering
)

Combine Guide Replicates using tardis_spac.utils.combine_guide_replicates()

Note

Guide replicates are guide probes targeting the same gene and are typically labeled with an underscore and a number.

For example, a guide probe targeting the gene A with replicate number 1 would be labeled as A_1.

Similarly, a guide probe targeting the gene B with replicate number 2 would be labeled as B_2.

These guide probes are often used to measure the expression of the same gene in different conditions or tissues.

In TARDIS, we collapse these guide probes into a single guide name by summing the counts of the replicate guides.

tardis_spac.utils.combine_guide_replicates(gdata)

Merge guide replicates in the input AnnData object by collapsing underscore-delimited replicates into a single guide name.

Parameters:: gdata – AnnData object of guide count data with var_names containing replicates.
Returns:: AnnData object with guides collapsed (replicates summed per guide).