Analysis Pipeline

The following tutorial gives an introduction to the basic analysis functions of the velocyto library.

Velocyto Loom

Let’s start with loading the content of the .loom file into an interactive session of python.

import velocyto as vcy
vlm = vcy.read_loom("YourData.loom")

Different steps of analysis can be carried on by calling the velocyto methods on this AnnData object. New variables, normalized versions of the data matrices and other parameters will be stored as attributes of the AnnData object. They are organized by the dimensions they occupy: Number of observations (cells, samples) and number of variables (genes). Layers (vlm.layers) have dimension n_obs × n_vars and contain expression values at different steps of processing. Observations (vlm.obs) are annotations specific to each cell, while variables (vlm.var) contain information about each gene. Unstructured annotation like constants are saved to vlm.uns. vlm.obsm and vlm.varm save attributes of dimension n_obs × m and n_vars x m, respectively. The whole object can be sliced like a DataFrame. For more information on the AnnData type, read these docs. To view the object and its current shape and attributes, you can at any moment simply enter:

vlm

Normalization and log transformation can be performed by calling the normalize method:

vcy.normalize(vlm, which='S', size=True, log=True)
vlm.layers['S_norm']  # contains log normalized

The docstring of every function specifies which attributes will be generated or modified at each method call.

The velocyto object supports some ready-made plotting functions. For example, one of the first checks of spliced/unspliced fractions of the dataset can be done by calling:

vcy.plot_fractions(vlm)

You can save the results of your analysis in a serialized object at any time by running:

vlm.write('my_velocyto_analysis.h5ad')

In another session you can reload the vlm object by running:

vlm = vcy.read_h5ad('my_velocyto_analysis.h5ad')

This is similar to what the pickle module in python standard library is doing but here the attributes of the AnnData object are saved and stored as a h5ad file. Notice that the size on disk of the serialized file can change depending on the step of the analysis the object is saved (e.g. pre/post filtering or before/after calculating distance matrices).

Note

velocyto methods operate on the object attributes performing filtering, normalization and other calcualtion. Therefore, the order in which they are run is important to get a meaningful output from velocyto. We suggest calling these functions in the order shown in this tutorial or in the example notebooks.

Start a new analysis - Preliminary Filtering

A good first step is to clean up the data a bit. Let’s remove the cells with extremely low unspliced detection:

vcy.filter_cells(vlm, bool_array=vlm.obs['initial_Ucell_size'] > np.percentile(vlm.obs['initial_Ucell_size'], 0.5))

Let’s try now to select relevant features for the downstream analysis. Let’s make velocyto aware of the clustering annotation, if we have some:

vcy.set_clusters(vlm, vlm.obsm['ClusterName'])

Now using the clustering annotation select the genes that are expressed above a threshold of total number of molecules in any of the clusters:

vcy.score_detection_levels(vlm, min_expr_counts=40, min_cells_express=30)
vlm = vcy.filter_genes(vlm, by_detection_levels=True)

We can perform feature selection:

vcy.score_cv_vs_mean(vlm, 3000, plot=True, max_expr_avg=35)
vlm = vcy.filter_genes(vlm, by_cv_vs_mean=True)

Finally we can normalize our data by size (total molecule count):

vcy.normalize(vlm,
            which='S',
            relative_size=vlm.layers['spliced'].sum(1),
            target_size=(vlm.layers['spliced'].sum(1).mean(),0))
vcy.normalize(vlm,
            which='U',
            relative_size=vlm.layers['unspliced'].sum(1),
            target_size=(0,vlm.layers['unspliced'].sum(1).mean()))

For a better understend how to fine-tune parameters please consult the API page or just inspect the docstring of each function.

Preparation for gamma fit

For the preparation of the gamma fit we smooth the data using a kNN neighbors pooling approach. kNN neighbors can be calculated directly in gene expression space or reduced PCA space, using either correlation distance or Euclidean distance. One example of set of parameters is provided below:

vcy.perform_PCA(vlm)
vcy.knn_imputation(vlm, n_pca_dims=20, k=500, balanced=True, b_sight=3000, b_maxl=1500, n_jobs=16)

Gamma fit and extrapolation

To fit gamma to every gene that survived the filtering step run:

vcy.fit_gammas(vlm)

The fit can be visualized by calling plot_phase_portraits and listing the gene names:

vcy.plot_phase_portraits(vlm, ['Igfbpl1', 'Pdgfra'])

Then calculate velocity and extrapolate the future state of the cells:

vcy.predict_U(vlm)
vcy.calculate_velocity(vlm)
vcy.calculate_shift(vlm, assumption = 'constant_velocity')
vcy.extrapolate_cell_at_t(vlm, delta_t = 1.)

An alternative extrapolation can be performed using the constant unspliced assumption (for more information consult our preprint):

vcy.calculate_shift(vlm, assumption= 'constant_unspliced', delta_t = 10)
vcy.extrapolate_cell_at_t(vlm, delta_t = 1.)

Projection of velocity onto embeddings

The extrapolated cell state is a vector in expression space (available as the attribute vlm.layers['Sx_sz_t']). One of the most convenient ways to visualize the extrapolated state is to project it onto a low dimensional embedding that appropriately summarizes the variability of the data that is of interest. The embedding can be calculated with your favorite method or external package as soon as it is saved as an attribute of the AnnData object. For example, let’s use the scikit-learn TSNE implementation and make it available as ts attribute as follows:

from sklearn.manifold import TSNE
bh_tsne = TSNE()
vlm.obsm['ts'] = bh_tsne.fit_transform(vlm.obsm['pcs'][:, :25])

Now we can project onto vlm.obsm['ts'] by calling estimate_transition_prob.

Warning

For big datasets this code can take long time to run! We suggest to run it on multicore machines (since the implementation is fully multithreaded).

vcy.estimate_transition_prob(vlm, hidim='Sx_sz', embed='ts', transform='sqrt', psc=1,
                             neighs_embed=3500, knn_random=True, frac=0.5)
vcy.calculate_embedding_shift(vlm, sigma_corr = 0.05, expression_scaling=True)

In case of very big datasets a good way to summarize the velocity is to visualize it as a velocity field calculated on a grid.

vcy.calculate_grid_arrows(vlm, smooth=0.8, steps=(40, 40), n_neighbors=300)
plt.figure(None,(20,10))
vcy.plot_grid_arrows(vlm, quiver_scale=0.6,
                    scatter_kwargs_dict={'alpha':0.35, 'lw':0.35, 'edgecolor':'0.4', 's':38, 'rasterized':True}, min_mass=24, angles='xy', scale_units='xy',
                    headaxislength=2.75, headlength=5, headwidth=4.8, minlength=1.5,
                    plot_random=True, scale_type="absolute")