Dynamical Modeling

Here, we use the generalized dynamical model to solve the full transcriptional dynamics.

That yields several additional insights such as latent time and identification of putative driver genes.

As in the previous tutorial, it is illustratively applied to endocrine development in the pancreas.

[ ]:

# update to the latest version, if not done yet.
!pip install scvelo --upgrade --quiet

[1]:

import scvelo as scv
scv.logging.print_version()

Running scvelo 0.2.0 (python 3.8.2) on 2020-05-15 00:27.

[2]:

scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.settings.set_figure_params('scvelo')  # for beautified visualization

Prepare the Data

Processing consists of gene selection, normalizing by total size, logarithmizing X, and computing moments for velocity estimation. See the previous tutorial for further explanation.

[3]:

adata = scv.datasets.pancreas()

[4]:

scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

Filtered out 20801 genes that are detected in less than 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Logarithmized X.
computing neighbors
    finished (0:00:03) --> added
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:00) --> added
    'Ms' and 'Mu', moments of spliced/unspliced abundances (adata.layers)

Dynamical Model

We run the dynamical model to learn the full transcriptional dynamics of splicing kinetics.

It is solved in a likelihood-based expectation-maximization framework, by iteratively estimating the parameters of reaction rates and latent cell-specific variables, i.e. transcriptional state and cell-internal latent time. It thereby aims to learn the unspliced/spliced phase trajectory for each gene.

[5]:

scv.tl.recover_dynamics(adata)

recovering dynamics
    finished (0:13:31) --> added
    'fit_pars', fitted parameters for splicing dynamics (adata.var)

[6]:

scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

computing velocities
    finished (0:00:04) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph
    finished (0:00:08) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

Running the dynamical model can take a while. Hence, you may want to store the results for re-use, with adata.write('data/pancreas.h5ad', which can later be read with adata = scv.read('data/pancreas.h5ad').

[7]:

#adata.write('data/pancreas.h5ad', compression='gzip')
#adata = scv.read('data/pancreas.h5ad')

[8]:

scv.pl.velocity_embedding_stream(adata, basis='umap')

computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)

Kinetic rate paramters

The rates of RNA transcription, splicing and degradation are estimated without the need of any experimental data.

They can be useful to better understand the cell identity and phenotypic heterogeneity.

[9]:

df = adata.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]

kwargs = dict(xscale='log', fontsize=16)
with scv.GridSpec(ncols=3) as pl:
    pl.hist(df['fit_alpha'], xlabel='transcription rate', **kwargs)
    pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs)
    pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)

scv.get_df(adata, 'fit*', dropna=True).head()

[9]:

	fit_r2	fit_alpha	fit_beta	fit_gamma	fit_t_	fit_scaling	fit_std_u	fit_std_s	fit_likelihood	fit_u0	fit_s0	fit_pval_steady	fit_steady_u	fit_steady_s	fit_variance	fit_alignment_scaling
index
Sntg1	0.401981	0.015726	0.005592	0.088792	23.404254	42.849447	1.029644	0.030838	0.406523	0.0	0.0	0.159472	2.470675	0.094304	0.149138	5.355590
Sbspon	0.624803	0.464865	2.437113	0.379645	3.785993	0.154771	0.058587	0.178859	0.252135	0.0	0.0	0.182088	0.164805	0.430623	0.674312	1.193015
Mcm3	0.292389	3.096367	39.995796	0.638543	2.049463	0.013943	0.016253	0.673142	0.228207	0.0	0.0	0.467683	0.051432	1.927742	0.687468	0.887607
Fam135a	0.384662	0.172335	0.118088	0.204538	11.239574	1.124040	0.356525	0.149868	0.283343	0.0	0.0	0.387921	1.345830	0.393197	0.671096	3.390194
Adgrb3	0.384552	0.046828	0.006750	0.196856	6.992542	71.850736	2.153206	0.030417	0.250195	0.0	0.0	0.068851	5.214500	0.093570	0.556878	1.893389

The estimated gene-specific parameters comprise rates of transription (fit_alpha), splicing (fit_beta), degradation (fit_gamma), switching time point (fit_t_), a scaling parameter to adjust for under-represented unspliced reads (fit_scaling), standard deviation of unspliced and spliced reads (fit_std_u, fit_std_s), the gene likelihood (fit_likelihood), inferred steady-state levels (fit_steady_u, fit_steady_s) with their corresponding p-values (fit_pval_steady_u, fit_pval_steady_s), the overall model variance (fit_variance), and a scaling factor to align the gene-wise latent times to a universal, gene-shared latent time (fit_alignment_scaling).

Latent time

The dynamical model recovers the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and approximates the real time experienced by cells as they differentiate, based only on its transcriptional dynamics.

[10]:

scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80)

computing terminal states
    identified 2 regions of root cells and 1 region of end points
    finished (0:00:00) --> added
    'root_cells', root cells of Markov diffusion process (adata.obs)
    'end_points', end points of Markov diffusion process (adata.obs)
computing latent time
    finished (0:00:02) --> added
    'latent_time', shared time (adata.obs)

[11]:

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata, var_names=top_genes, sortby='latent_time', col_color='clusters', n_convolve=100)

Top-likelihood genes

Driver genes display pronounced dynamic behavior and are systematically detected via their characterization by high likelihoods in the dynamic model.

[12]:

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata, basis=top_genes[:15], ncols=5, frameon=False)

[13]:

var_names = ['Actn4', 'Ppp3ca', 'Cpe', 'Nnat']
scv.pl.scatter(adata, var_names, frameon=False)
scv.pl.scatter(adata, x='latent_time', y=var_names, frameon=False)

Cluster-specific top-likelihood genes

Moreover, partial gene likelihoods can be computed for a each cluster of cells to enable cluster-specific identification of potential drivers.

[14]:

scv.tl.rank_dynamical_genes(adata, groupby='clusters')
df = scv.get_df(adata, 'rank_dynamical_genes/names')
df.head(5)

ranking genes by cluster-specific likelihoods
    finished (0:00:03) --> added
    'rank_dynamical_genes', sorted scores by group ids (adata.uns)

[14]:

	Ductal	Ngn3 low EP	Ngn3 high EP	Pre-endocrine	Beta	Alpha	Delta	Epsilon
0	Dcdc2a	Dcdc2a	Rbfox3	Abcc8	Pcsk2	Cpe	Pcsk2	Tox3
1	Top2a	Adk	Mapre3	Tmem163	Ank	Gnao1	Rap1b	Rnf130
2	Nfib	Mki67	Btbd17	Gnao1	Tmem163	Pak3	Pak3	Meis2
3	Wfdc15b	Rap1gap2	Sulf2	Ank	Tspan7	Pim2	Abcc8	Adk
4	Cdk1	Top2a	Tcp11	Tspan7	Map1b	Map1b	Klhl32	Rap1gap2

[15]:

for cluster in ['Ductal', 'Ngn3 high EP', 'Pre-endocrine', 'Beta']:
    scv.pl.scatter(adata, df[cluster][:5], ylabel=cluster, frameon=False)