Analysis Pipeline¶
The following tutorial gives an introduction to the basic analysis functions of the velocyto library.
Velocyto Loom¶
Let’s start with loading the content of the .loom file into an interactive session of python.
import velocyto as vcy
vlm = vcy.read_loom("YourData.loom")
Different steps of analysis can be carried on by calling the velocyto
methods on this AnnData
object.
New variables, normalized versions of the data matrices and other parameters will be stored as attributes of the AnnData
object. They are organized by the dimensions they occupy: Number of observations (cells, samples) and number of variables (genes). Layers (vlm.layers
) have dimension n_obs × n_vars
and contain expression values at different steps of processing. Observations (vlm.obs
) are annotations specific to each cell, while variables (vlm.var
) contain information about each gene. Unstructured annotation like constants are saved to vlm.uns
. vlm.obsm
and vlm.varm
save attributes of dimension n_obs × m
and n_vars x m
, respectively. The whole object can be sliced like a DataFrame
. For more information on the AnnData
type, read these docs. To view the object and its current shape and attributes, you can at any moment simply enter:
vlm
Normalization and log transformation can be performed by calling the normalize
method:
vcy.normalize(vlm, which='S', size=True, log=True)
vlm.layers['S_norm'] # contains log normalized
The docstring of every function specifies which attributes will be generated or modified at each method call.
The velocyto
object supports some ready-made plotting functions.
For example, one of the first checks of spliced/unspliced fractions of the dataset can be done by calling:
vcy.plot_fractions(vlm)
You can save the results of your analysis in a serialized object at any time by running:
vlm.write('my_velocyto_analysis.h5ad')
In another session you can reload the vlm
object by running:
vlm = vcy.read_h5ad('my_velocyto_analysis.h5ad')
This is similar to what the pickle
module in python standard library is doing but here the attributes of the AnnData
object are saved and stored as a h5ad file.
Notice that the size on disk of the serialized file can change depending on the step of the analysis the object is saved (e.g. pre/post filtering or before/after calculating distance matrices).
Note
velocyto
methods operate on the object attributes performing filtering, normalization and other calcualtion. Therefore, the order in which they are run is important to get a meaningful output from velocyto
.
We suggest calling these functions in the order shown in this tutorial or in the example notebooks.
Start a new analysis - Preliminary Filtering¶
A good first step is to clean up the data a bit. Let’s remove the cells with extremely low unspliced detection:
vcy.filter_cells(vlm, bool_array=vlm.obs['initial_Ucell_size'] > np.percentile(vlm.obs['initial_Ucell_size'], 0.5))
Let’s try now to select relevant features for the downstream analysis.
Let’s make velocyto
aware of the clustering annotation, if we have some:
vcy.set_clusters(vlm, vlm.obsm['ClusterName'])
Now using the clustering annotation select the genes that are expressed above a threshold of total number of molecules in any of the clusters:
vcy.score_detection_levels(vlm, min_expr_counts=40, min_cells_express=30)
vlm = vcy.filter_genes(vlm, by_detection_levels=True)
We can perform feature selection:
vcy.score_cv_vs_mean(vlm, 3000, plot=True, max_expr_avg=35)
vlm = vcy.filter_genes(vlm, by_cv_vs_mean=True)
Finally we can normalize our data by size (total molecule count):
vcy.normalize(vlm,
which='S',
relative_size=vlm.layers['spliced'].sum(1),
target_size=(vlm.layers['spliced'].sum(1).mean(),0))
vcy.normalize(vlm,
which='U',
relative_size=vlm.layers['unspliced'].sum(1),
target_size=(0,vlm.layers['unspliced'].sum(1).mean()))
For a better understend how to fine-tune parameters please consult the API page or just inspect the docstring of each function.
Preparation for gamma fit¶
For the preparation of the gamma fit we smooth the data using a kNN neighbors pooling approach. kNN neighbors can be calculated directly in gene expression space or reduced PCA space, using either correlation distance or Euclidean distance. One example of set of parameters is provided below:
vcy.perform_PCA(vlm)
vcy.knn_imputation(vlm, n_pca_dims=20, k=500, balanced=True, b_sight=3000, b_maxl=1500, n_jobs=16)
Gamma fit and extrapolation¶
To fit gamma to every gene that survived the filtering step run:
vcy.fit_gammas(vlm)
The fit can be visualized by calling plot_phase_portraits
and listing the gene names:
vcy.plot_phase_portraits(vlm, ['Igfbpl1', 'Pdgfra'])
Then calculate velocity and extrapolate the future state of the cells:
vcy.predict_U(vlm)
vcy.calculate_velocity(vlm)
vcy.calculate_shift(vlm, assumption = 'constant_velocity')
vcy.extrapolate_cell_at_t(vlm, delta_t = 1.)
An alternative extrapolation can be performed using the constant unspliced assumption (for more information consult our preprint):
vcy.calculate_shift(vlm, assumption= 'constant_unspliced', delta_t = 10)
vcy.extrapolate_cell_at_t(vlm, delta_t = 1.)
Projection of velocity onto embeddings¶
The extrapolated cell state is a vector in expression space (available as the attribute vlm.layers['Sx_sz_t']
).
One of the most convenient ways to visualize the extrapolated state is to project it onto a low dimensional embedding that appropriately summarizes the variability of the data that is of interest.
The embedding can be calculated with your favorite method or external package as soon as it is saved as an attribute of the AnnData
object.
For example, let’s use the scikit-learn
TSNE implementation and make it available as ts
attribute as follows:
from sklearn.manifold import TSNE
bh_tsne = TSNE()
vlm.obsm['ts'] = bh_tsne.fit_transform(vlm.obsm['pcs'][:, :25])
Now we can project onto vlm.obsm['ts']
by calling estimate_transition_prob
.
Warning
For big datasets this code can take long time to run! We suggest to run it on multicore machines (since the implementation is fully multithreaded).
vcy.estimate_transition_prob(vlm, hidim='Sx_sz', embed='ts', transform='sqrt', psc=1,
neighs_embed=3500, knn_random=True, frac=0.5)
vcy.calculate_embedding_shift(vlm, sigma_corr = 0.05, expression_scaling=True)
In case of very big datasets a good way to summarize the velocity is to visualize it as a velocity field calculated on a grid.
vcy.calculate_grid_arrows(vlm, smooth=0.8, steps=(40, 40), n_neighbors=300)
plt.figure(None,(20,10))
vcy.plot_grid_arrows(vlm, quiver_scale=0.6,
scatter_kwargs_dict={'alpha':0.35, 'lw':0.35, 'edgecolor':'0.4', 's':38, 'rasterized':True}, min_mass=24, angles='xy', scale_units='xy',
headaxislength=2.75, headlength=5, headwidth=4.8, minlength=1.5,
plot_random=True, scale_type="absolute")