ENCODE Long Read Sequel2 dataset¶

This tutorial guides through a typical Long Read Transcriptome Sequencing (LRTS) analysis workflow with isotools, using ENCODE Isoseq Sequel2 Data. It demonstrates the analysis of alternative splicing events within and between sample groups.

The original analysis integrates quite a large number of samples, and consequently runs several hours. Users interested in reproducing the notebook should consider restricting the data to the samples they are interested in.

Preparation¶

In this notebook, “long read RNA-seq” samples are downloaded from ENCODE to the “encode” subdirectory. You can also manually download the files using the data portal (https://www.encodeproject.org/) and download aligned .bam files. Here I use all Sequel II samples leukemia and b-cell samples the time of writing, but you can choose to process a subset.

Further, you need a reference annotation and a genome fastq file. Please see the Alzheimer tutorial how to get these files.

Data import¶

[1]:

from  isotools import Transcriptome
from isotools import __version__ as isotools_version
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from pathlib import Path
import logging
from collections import Counter
from urllib.request import urlretrieve
import pysam

logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
logger=logging.getLogger('isotools')
logger.info(f'This is isootools version {isotools_version}')

INFO:This is isootools version 0.2.7

[2]:

#I assume the encode data (e.g. the bam files) are in an subdirctory "encode"
#If not, they will be downloaded
Path('./encode').mkdir(parents=True, exist_ok=True)

#first, check what samples are present
metadata_link="https://www.encodeproject.org/metadata/?type=Experiment&assay_title=long+read+RNA-seq&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&files.file_type=bam&files.file_type=fastq&advancedQuery=date_released%3A%5B2009-02-01+TO+2021-12-31%5D&searchTerm=long+read+RNA-seq"
metadata=pd.read_csv(metadata_link, sep='\t')
platform=metadata.set_index('Experiment accession').Platform.dropna().to_dict()
for pf,count in Counter(platform.values()).items():
    logging.info(f'{pf}: In total {count} samples from ENCODE')

#Select the samples from the metadata, make sure there are only bam files are in the table
samples=metadata[(metadata['Output type']=='alignments') & (metadata['File Status']=='released')].copy()
samples['Platform']=[platform.get(ea,'unknown') for ea in samples['Experiment accession']] #This info is missing for some files
samples=samples[['File accession','Output type','Biosample term name','Biosample type','Technical replicate(s)','Platform']].reset_index(drop=True)

#we select all leukemia and b-cell SequelII files - adjust this as needed
kariotype={"K562":'leukemia', "GM12878":'normal', "HL-60":'leukemia'}
samples=samples.query('Platform == "Pacific Biosciences Sequel II" and `Biosample term name` in ["K562", "GM12878", "HL-60"]')
samples=samples.sort_values('Biosample term name').reset_index(drop=True)


# define the paths for reference and annotation files
genome='reference/GRCh38.p13.genome.fa'
anno='reference/gencode.v36.chr_patch_hapl_scaff.annotation_sorted'
chrom=[f'chr{i+1}'for i in range(22)]+['chrX','chrY']

#download and index the selected files, if not present
for accession in samples['File accession']:
    bam=f"encode/{accession}.bam"
    bai=bam+'.bai'
    if not os.path.isfile(bam):
        logger.info(f'downloading bam file for {accession}')
        url=metadata.loc[metadata['File accession']==accession,'File download URL'].values[0]
        file=os.path.split(url)[1]
        urlretrieve(url, f"encode/{file}")
    if not os.path.isfile(bai) or os.path.getmtime(bam)>os.path.getmtime(bai):
        logger.info(f'indexing {accession}')
        pysam.index(bam)


#print the sample table
samples

INFO:Oxford Nanopore PromethION: In total 4 samples from ENCODE
INFO:Pacific Biosciences Sequel: In total 16 samples from ENCODE
INFO:Pacific Biosciences Sequel II: In total 58 samples from ENCODE
INFO:Oxford Nanopore MinION: In total 11 samples from ENCODE
INFO:downloading bam file for ENCFF648NAR
INFO:indexing ENCFF648NAR
INFO:downloading bam file for ENCFF225CCJ
INFO:indexing ENCFF225CCJ
INFO:downloading bam file for ENCFF219UJG
INFO:indexing ENCFF219UJG
INFO:downloading bam file for ENCFF810FRP
INFO:indexing ENCFF810FRP
INFO:downloading bam file for ENCFF600MGT
INFO:indexing ENCFF600MGT
INFO:downloading bam file for ENCFF322UJU
INFO:indexing ENCFF322UJU
INFO:downloading bam file for ENCFF661OEY
INFO:indexing ENCFF661OEY
INFO:downloading bam file for ENCFF645UVN
INFO:indexing ENCFF645UVN

[2]:

	File accession	Output type	Biosample term name	Biosample type	Technical replicate(s)	Platform
0	ENCFF648NAR	alignments	GM12878	cell line	1_1	Pacific Biosciences Sequel II
1	ENCFF225CCJ	alignments	GM12878	cell line	2_1	Pacific Biosciences Sequel II
2	ENCFF219UJG	alignments	GM12878	cell line	1_1	Pacific Biosciences Sequel II
3	ENCFF810FRP	alignments	HL-60	cell line	1_1	Pacific Biosciences Sequel II
4	ENCFF600MGT	alignments	HL-60	cell line	2_1	Pacific Biosciences Sequel II
5	ENCFF322UJU	alignments	K562	cell line	1_1	Pacific Biosciences Sequel II
6	ENCFF661OEY	alignments	K562	cell line	1_1	Pacific Biosciences Sequel II
7	ENCFF645UVN	alignments	K562	cell line	2_1	Pacific Biosciences Sequel II

[3]:

try:
    isoseq=Transcriptome.load(f'encode/encode_isotools.pkl')
except FileNotFoundError:
    try:
        isoseq=Transcriptome.from_reference(anno+'.isotools.pkl')
    except FileNotFoundError:
        isoseq=Transcriptome.from_reference(anno+'.gff3.gz')
        isoseq.save_reference(anno+'.isotools.pkl')
    for i,row in samples.iterrows():
        sname=f"{row['Biosample term name']}_{row['Technical replicate(s)']}".replace(' ','_')
        if sname in isoseq.samples:
            l=ord('b')
            while f'{sname}_{l:c}' in isoseq.samples:
                l+=1
            sname=f'{sname}_{l:c}'
        grp=row['Biosample term name']
        isoseq.add_sample_from_bam(f"encode/{row['File accession']}.bam",
                                   use_satag=True,
                                   sample_name=sname ,
                                   group=grp,
                                   kariotype=kariotype[grp],
                                   sample_type=row['Biosample type'],
                                   platform=row['Platform'])
    isoseq.add_qc_metrics(genome) #3:41 min
    isoseq.make_index()
    isoseq.save('encode/encode_isotools.pkl')

INFO:loading transcriptome from encode/encode_isotoolsXXX.pkl
INFO:importing reference from pkl file reference/gencode.v36.chr_patch_hapl_scaff.annotation_sorted.isotoolsXXX.pkl
INFO:importing reference from gff3 file reference/gencode.v36.chr_patch_hapl_scaff.annotation_sorted.gff3.gz
3270946it [01:59, 27434.88it/s]
INFO:skipped the following categories: {'three_prime_UTR', 'CDS', 'five_prime_UTR', 'stop_codon_redefined_as_selenocysteine'}
INFO:saving reference to reference/gencode.v36.chr_patch_hapl_scaff.annotation_sorted.isotools.pkl
INFO:adding sample GM12878_1_1 from file encode/ENCFF648NAR.bam
100%|██████████| 2039578.0/2039578 [05:39<00:00, 6016.14reads/s, chr=ERCC-00171]
WARNING:skipped 84 reads aligned fraction of less than 0.75.
INFO:ignoring 11145 chimeric alignments with less than 2 reads
INFO:imported 2029866 nonchimeric reads (including  1879 chained chimeric alignments) and 362 chimeric reads with coverage of at least 2.
INFO:adding sample GM12878_2_1 from file encode/ENCFF225CCJ.bam
100%|██████████| 2442854.0/2442854 [05:38<00:00, 7219.94reads/s, chr=ERCC-00171]
WARNING:skipped 118 reads aligned fraction of less than 0.75.
INFO:ignoring 11609 chimeric alignments with less than 2 reads
INFO:imported 2432521 nonchimeric reads (including  1862 chained chimeric alignments) and 468 chimeric reads with coverage of at least 2.
INFO:adding sample GM12878_1_1_b from file encode/ENCFF219UJG.bam
100%|██████████| 1526058.0/1526058 [05:04<00:00, 5010.93reads/s, chr=ERCC-00171]
WARNING:skipped 302 reads aligned fraction of less than 0.75.
INFO:ignoring 36704 chimeric alignments with less than 2 reads
INFO:imported 1490038 nonchimeric reads (including  5416 chained chimeric alignments) and 4430 chimeric reads with coverage of at least 2.
INFO:adding sample HL-60_1_1 from file encode/ENCFF810FRP.bam
100%|██████████| 2784296.0/2784296 [08:21<00:00, 5552.21reads/s, chr=ERCC-00171]
WARNING:skipped 128 reads aligned fraction of less than 0.75.
INFO:ignoring 11037 chimeric alignments with less than 2 reads
INFO:imported 2774419 nonchimeric reads (including  2009 chained chimeric alignments) and 721 chimeric reads with coverage of at least 2.
INFO:adding sample HL-60_2_1 from file encode/ENCFF600MGT.bam
100%|██████████| 2374443.0/2374443 [07:59<00:00, 4954.91reads/s, chr=ERCC-00171]
WARNING:skipped 118 reads aligned fraction of less than 0.75.
INFO:ignoring 8611 chimeric alignments with less than 2 reads
INFO:imported 2366965 nonchimeric reads (including  1637 chained chimeric alignments) and 386 chimeric reads with coverage of at least 2.
INFO:adding sample K562_1_1 from file encode/ENCFF661OEY.bam
100%|██████████| 2617489.0/2617489 [09:06<00:00, 4788.78reads/s, chr=ERCC-00171]
WARNING:skipped 599 reads aligned fraction of less than 0.75.
INFO:ignoring 78475 chimeric alignments with less than 2 reads
INFO:imported 2515223 nonchimeric reads (including  1137 chained chimeric alignments) and 24329 chimeric reads with coverage of at least 2.
INFO:adding sample K562_2_1 from file encode/ENCFF645UVN.bam
100%|██████████| 2729961.0/2729961 [07:55<00:00, 5737.59reads/s, chr=ERCC-00171]
WARNING:skipped 626 reads aligned fraction of less than 0.75.
INFO:ignoring 88468 chimeric alignments with less than 2 reads
INFO:imported 2619912 nonchimeric reads (including  1308 chained chimeric alignments) and 22263 chimeric reads with coverage of at least 2.
INFO:adding sample K562_1_1_b from file encode/ENCFF322UJU.bam
100%|██████████| 1958209.0/1958209 [06:10<00:00, 5278.58reads/s, chr=ERCC-00171]
WARNING:skipped 388 reads aligned fraction of less than 0.75.
INFO:ignoring 52227 chimeric alignments with less than 2 reads
INFO:imported 1896354 nonchimeric reads (including  1125 chained chimeric alignments) and 10365 chimeric reads with coverage of at least 2.
WARNING:148 contigs are not contained in genome, affecting 1004 genes. Some metrics cannot be computed: {'chrUn_KI270579v1', 'ERCC-00035', 'chr14_KI270724v1_random', 'chrUn_KI270538v1', 'ERCC-00031', 'SIRV2', 'chrUn_GL000219v1', 'chr17_GL000205v2_random', 'ERCC-00157', 'ERCC-00116', 'chr9_KI270718v1_random', 'ERCC-00092', 'ERCC-00028', 'ERCC-00160', 'chrUn_KI270751v1', 'chr1_KI270711v1_random', 'ERCC-00062', 'chrUn_KI270754v1', 'chrUn_GL000216v2', 'chrUn_KI270438v1', 'ERCC-00075', 'chrUn_GL000220v1', 'ERCC-00144', 'chrEBV', 'ERCC-00085', 'ERCC-00108', 'ERCC-00095', 'chr1_KI270710v1_random', 'chr14_GL000009v2_random', 'chr1_KI270713v1_random', 'ERCC-00051', 'chr1_KI270707v1_random', 'ERCC-00145', 'ERCC-00099', 'chrUn_KI270374v1', 'chrUn_KI270747v1', 'chrUn_KI270442v1', 'SIRV7', 'chr9_KI270719v1_random', 'ERCC-00111', 'chr4_GL000008v2_random', 'chr22_KI270737v1_random', 'chrUn_KI270411v1', 'chrUn_GL000213v1', 'ERCC-00014', 'ERCC-00034', 'chrUn_KI270750v1', 'ERCC-00044', 'chrUn_KI270466v1', 'chr11_KI270721v1_random', 'ERCC-00096', 'chr5_GL000208v1_random', 'chrUn_KI270589v1', 'chrUn_KI270745v1', 'chr14_KI270722v1_random', 'ERCC-00150', 'ERCC-00170', 'chrUn_KI270748v1', 'ERCC-00136', 'chr22_KI270733v1_random', 'chr22_KI270731v1_random', 'chr1_KI270712v1_random', 'ERCC-00131', 'chr9_KI270720v1_random', 'chrUn_KI270337v1', 'ERCC-00074', 'ERCC-00130', 'ERCC-00113', 'chrUn_GL000195v1', 'ERCC-00162', 'ERCC-00078', 'ERCC-00002', 'chr3_GL000221v1_random', 'chrUn_GL000224v1', 'SIRV4', 'chr1_KI270708v1_random', 'ERCC-00054', 'ERCC-00109', 'ERCC-00004', 'chrUn_KI270465v1', 'ERCC-00019', 'chr15_KI270727v1_random', 'SIRV5', 'chr1_KI270706v1_random', 'ERCC-00077', 'ERCC-00022', 'chrUn_GL000214v1', 'chrUn_KI270743v1', 'ERCC-00168', 'SIRV1', 'chr1_KI270714v1_random', 'chrUn_KI270516v1', 'ERCC-00059', 'chr14_GL000225v1_random', 'chrUn_KI270590v1', 'ERCC-00073', 'chr14_KI270726v1_random', 'chr2_KI270715v1_random', 'ERCC-00071', 'chrUn_KI270468v1', 'chr14_GL000194v1_random', 'ERCC-00143', 'chr22_KI270738v1_random', 'ERCC-00043', 'ERCC-00165', 'ERCC-00126', 'ERCC-00003', 'chr2_KI270716v1_random', 'ERCC-00171', 'ERCC-00079', 'chrUn_KI270741v1', 'ERCC-00025', 'chrUn_KI270467v1', 'chrUn_KI270333v1', 'chrUn_GL000226v1', 'ERCC-00076', 'SIRV6', 'chr22_KI270739v1_random', 'chr1_KI270709v1_random', 'chrUn_KI270742v1', 'chrUn_KI270584v1', 'ERCC-00046', 'ERCC-00163', 'chr22_KI270735v1_random', 'chrUn_KI270512v1', 'ERCC-00009', 'chr17_KI270729v1_random', 'chrUn_KI270544v1', 'chrUn_KI270749v1', 'SIRV3', 'ERCC-00042', 'ERCC-00033', 'chrUn_GL000218v1', 'chrUn_KI270448v1', 'chrUn_KI270744v1', 'chrUn_KI270757v1', 'ERCC-00084', 'chrUn_KI270435v1', 'chrUn_KI270581v1', 'chrUn_KI270756v1', 'chrUn_KI270511v1', 'chr16_KI270728v1_random', 'ERCC-00053', 'ERCC-00060', 'ERCC-00112', 'chr22_KI270734v1_random', 'chrUn_KI270519v1', 'chrUn_KI270746v1'}
100%|██████████| 309557/309557 [12:23<00:00, 416.46it/s]
INFO:saving transcriptome to encode/encode_isotools.pkl

[5]:

#add gencode specific filters

isoseq.add_filter( "HIGH_SUPPORT", 'transcript_support_level=="1"', context='reference')
isoseq.add_filter( "PROTEIN_CODING", 'transcript_type=="protein_coding"', context='reference')

Saturation Anlalysis¶

To estimate the saturation of discovered isoforms, IsoTools offers two complimentary figures: * The saturation plot models the expected probability of discovering a transcript * at given level of confidence (number of supporting long reads) * which is present at a given abundance level in the sample (in transcripts per milion, TPM) * depending on the total number of long reads * The rarefaction analysis depicts the number of different discovered transcripts when subsampling the long reads

The saturation analysis reveals that at the given sequencing depth, transcripts expressed at 2 TPM have > 80% probability of beeing covered by at least two reads in the individual samples, while transcripts expressed at 1 TPM have 40% to 75% probability. The slope of the rarefaction curve at the right end reflects the number of novel transcripts that could be discovered if sequencing depth would be increased.

[6]:

from isotools.plots import plot_saturation, plot_rarefaction
import matplotlib
font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 15}
matplotlib.rc('font', **font)
plt.rcParams["figure.figsize"] = (14,7)
fig, axs=plt.subplots(1,2)

plot_saturation(isoseq,cov_th=2, x_range=(1e4,5e6,1e4), ax=axs[0])

rarefraction, total=isoseq.rarefaction(min_coverage=2, tr_filter={'query':'FSM'})
plot_rarefaction(rarefraction, total=total, ax=axs[1])

fig.tight_layout()

[ ]:

Quality control and filtering¶

First, we will depict some quality metrics and define filtering criteria. For more information on technical artifacts, quality metrics, and filtering, see the Alzheimer tutorial.

[7]:

#compute some summary statistics on technical artifacts.
# For this analysis samples are grouped by 'Biosample term name'
tr_stats=[
    isoseq.transcript_length_hist(groups=isoseq.groups(), add_reference=True, min_coverage=2,tr_filter=dict( query='not NOVEL_GENE'), ref_filter=dict(query='HIGH_SUPPORT')),
    isoseq.downstream_a_hist(groups=isoseq.groups(), tr_filter=dict( query='not NOVEL_TRANSCRIPT'), ref_filter=dict(query='not REF_UNSPLICED')),
    isoseq.downstream_a_hist(groups=isoseq.groups(), tr_filter=dict(query='NOVEL_GENE and UNSPLICED')),
    isoseq.direct_repeat_hist(groups=isoseq.groups(),bins=np.linspace(-.5,10.5,12))]

tr_stats.append((pd.concat([tr_stats[2][0].add_suffix(' novel unspliced'),tr_stats[1][0].add_suffix(' known multiexon')],axis=1),tr_stats[2][1]))

#statistic on the filter flags
f_stats=isoseq.filter_stats(tags=['RTTS','FRAGMENT','INTERNAL_PRIMING'], groups=isoseq.groups(), weight_by_coverage=True,min_coverage=1)
f_stats[0].index=f_stats[0].index.str.replace('_','\n')

[8]:

#QC plot
from isotools.plots import plot_bar, plot_distr

plt.rcParams["figure.figsize"] = (30,25)
plt.rcParams.update({'font.size': 14})

fig, axs = plt.subplots(3,2)
#A) transcript length
plot_distr(tr_stats[0][0],smooth=3,ax=axs[0,0],legend=True,**tr_stats[0][1])
#B) frequency of artifacts
plot_bar(f_stats[0],ax=axs[0,1],drop_categories=['PASS'],bar_width=.9,legend=True,**f_stats[1])
#C/D) internal priming
plot_distr(tr_stats[4][0][[c for c in tr_stats[4][0].columns if 'known' in c]],smooth=3,ax=axs[1,0],legend=True,density=True,fill=True, title= 'A content at known multiexon transcripts', xlabel='fraction of A downstream the transcript')
plot_distr(tr_stats[4][0][[c for c in tr_stats[4][0].columns if 'novel' in c]],smooth=3,ax=axs[1,1],legend=True,density=True,fill=True, title= 'A content at novel unspliced transcripts', xlabel='fraction of A downstream the transcript')
#E/F) RTTS
plot_distr(tr_stats[3][0][[c for c in tr_stats[3][0].columns if 'known' in c]],ax=axs[2,0],legend=True,density=True,title='direct repeats at known splice sites',ylabel='density',xlabel= 'length of direct repeats at splice junctons')
plot_distr(tr_stats[3][0][[c for c in tr_stats[3][0].columns if 'novel' in c]],ax=axs[2,1],legend=True,density=True,title='direct repeats at novel noncanonical splice sites',ylabel='density',xlabel= 'length of direct repeats at splice junctons')

for letter,ax in zip('ABCD',axs.flatten()):
    ax.set_title(letter,{'fontsize':20}, loc='left', pad=20)
fig.tight_layout()
axs[0,1].legend(fontsize='large', ncol=1,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(1.01, 1))

[8]:

<matplotlib.legend.Legend at 0x7fcec0a38c70>

[9]:

f_stats_sample=isoseq.filter_stats(tags=['RTTS','FRAGMENT','INTERNAL_PRIMING'], weight_by_coverage=True,min_coverage=1)

Quality metrics vary substantially between the individual samples, in particular for internal priming artifacts. On the other hand, RRTS seems to be almost absent in this data, indicating that the alignments may have been filtered. The fractions of affected reads for the different artifacts can be depicted as a bar chart.

[10]:

#plot technical artifacts for individual samples
plt.rcParams["figure.figsize"] = (20,7)
plot_bar(f_stats_sample[0],drop_categories=['PASS'],legend=True,rot=0, bar_width=.9,**f_stats_sample[1])
#plt.legend(fontsize='small', ncol=5,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, 1.1), loc='lower left')

[10]:

<AxesSubplot:title={'center':'Expressed Transcripts'}, ylabel='fraction of reads'>

Data exploration¶

To explore the relation of the samples with respect to splicing we look at PCA and UMAP embeddings based on alternative splicing events.

[11]:

#Compute alternative splicing events by finding "bubbles" in the segment graphs
splice_events=isoseq.alternative_splicing_events()

[12]:

#plot PCA embedding
from isotools.plots import plot_embedding
plt.rcParams["figure.figsize"] = (15,30)

pca={}

f,axs=plt.subplots(4,2)
for ax,t in zip(axs.flatten(),['all','3AS','5AS','ES','IR','ME', 'TSS', 'PAS']):
    pca[t]=plot_embedding(splice_events, ax=ax, labels=True, groups=isoseq.groups(), splice_types=t)


#axs[0,0].legend(fontsize='medium', ncol=4,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, 1.1), loc='lower left')
#plt.tight_layout()

/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T

[13]:

#plot UMAP embedding
# the clustering of samples very much depends on the n_neighbors parameter
plt.rcParams["figure.figsize"] = (15,25)
umap={}
f,axs=plt.subplots(4,2)
for ax,t in zip(axs.flatten(),['all','3AS','5AS','ES','IR','ME', 'TSS', 'PAS']):
    umap[t]=plot_embedding(splice_events, method='UMAP',n_neighbors=6, ax=ax, labels=True, groups=isoseq.groups(), splice_types=t)


#xs[0,0].legend(fontsize='medium', ncol=4,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, 1.1), loc='lower left')

/home/lienhard/.local/lib/python3.9/site-packages/numba/core/cpu.py:97: UserWarning: Numba extension module 'sparse._numba_extension' failed to load due to 'FileNotFoundError([Errno 2] No such file or directory: '/home/lienhard/.local/lib/python3.9/site-packages/numba-0.54.0.dist-info/METADATA')'.
  numba.core.entrypoints.init_all()
2021-12-02 14:26:16.740711: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T

Novel alternative splicing classification¶

[14]:

from isotools.transcriptome import SPLICE_CATEGORY
cnr={}
for g, trid, tr in isoseq.iter_transcripts():
    for anno in tr['annotation'][1]:
        cnr[anno]=min(cnr.get(anno,5),tr['annotation'][0])
del cnr['FSM']
altsplice=[ isoseq.altsplice_stats(groups=isoseq.groups(), weight_by_coverage=True, min_coverage=1, tr_filter=dict( query='not (RTTS or FRAGMENT or INTERNAL_PRIMING)')),
            isoseq.altsplice_stats(groups=isoseq.groups(), weight_by_coverage=True, min_coverage=2, tr_filter=dict( query='not (RTTS or FRAGMENT or INTERNAL_PRIMING)')),
            isoseq.altsplice_stats(groups=isoseq.groups(), weight_by_coverage=False, min_coverage=20, tr_filter=dict( query='not (RTTS or FRAGMENT or INTERNAL_PRIMING)'))]
for i in range(3):
    altsplice[i][0].index=altsplice[i][0].index+[f'\n({SPLICE_CATEGORY[cnr[subcat]]})' if subcat in cnr else '' for subcat in altsplice[i][0].index]
    altsplice[i][0].index=altsplice[i][0].index.str.replace('splice ','\nsplice ')

[15]:

from isotools.plots import plot_bar, plot_distr

plt.rcParams["figure.figsize"] = (30,40)
fig, axs = plt.subplots( 2)
for i,ax in enumerate(axs):
    cat=altsplice[0][0].index[i*10+2:(i+1)*10+2]
    plot_bar(altsplice[0][0],bar_width=.9,ax=ax,ylabel='fraction of reads [%]', legend=False, rot=0,drop_categories=[c for c in altsplice[0][0].index if c not in cat])
axs[-1].legend(fontsize='large', ncol=7,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, -.2), loc='upper left')
_=axs[0].set_title('Novel Transcript Categories')

[16]:

#The same for individual samples
altsplice_sample= isoseq.altsplice_stats( weight_by_coverage=True, min_coverage=1, tr_filter=dict(  query='not (RTTS or FRAGMENT or INTERNAL_PRIMING)'))
altsplice_sample[0].index=altsplice_sample[0].index+[f'\n({SPLICE_CATEGORY[cnr[subcat]]})' if subcat in cnr else '' for subcat in altsplice_sample[0].index]

[17]:

plt.rcParams["figure.figsize"] = (30,20)
plt.rcParams.update({'font.size': 12})
fig, axs = plt.subplots( 4)


for i,ax in enumerate(axs.flatten()):
    cat=altsplice_sample[0].index[i*5+2:(i+1)*5+2]
    plot_bar(altsplice_sample[0],bar_width=.9,ax=ax,ylabel='fraction of reads [%]', legend=False, rot=0,drop_categories=[c for c in altsplice_sample[0].index if c not in cat])
handles, labels = axs[0].get_legend_handles_labels()

axs[-1].legend(fontsize='large', ncol=8,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, -.2), loc='upper left')
_=axs[0].set_title('Novel Transcript Categories')

Differential splicing¶

To test for differential splicing events between groups of samples, IsoTools implements a likelihood ratio test based on beta binomial distribution. For each alternative event, the number of reads supporting the alternative is modeled with a binomial distribution, where the probability parmeter is beta distributed. For the null model, we fit one set of maximum likelihood parameters of the beta distribution for all samples, whereas for the alternative model, a set of parameters if fitted for each group. Under the null hypothesis, the model deviance is assumed to follow a chi2 distribution with two degrees of freedom.

Each comparision results in table of alternative splicing events, sorted by significance. To explore the results, we can plot the fitted maximum likelihood beta distributions of the alternative model for the most significant events. We can also look at individual examples and depict the event as a sashimi plot. To this end, we define a helper function, to automatically identify the relevant region and the involved junctions and make the sashimi plot:

[18]:

#first we define a function, to depcit the results
def plot_diff_example(row, isoseq, groups):
    #sashimi plot for differential spliced genes
    plt.rcParams["figure.figsize"] = (15,10)
    #select gene object
    g=isoseq[row.gene_id]
    #make the title
    gn=list(groups)
    novel='novel' if row.novel else 'known'
    title=f'{g.name} {row.splice_type} ({novel}) FDR={row.padj:.2e}: {gn[0]}={row[gn[0]+"_PSI"]*100:.1f} vs {gn[1]}={row[gn[1]+"_PSI"]*100:.1f} PSI'
    print(title)
    start=row.start
    end=row.end
    #select the junctions of interest (which will be marked purple in the plot)
    joi=[(start,end)]
    if row.splice_type=='ME' and g.is_annotated:
        try:
            sg=g.ref_segment_graph
            n1=next(n for n in sg if n[1]==start)
            n2=next(n for n in sg if n[0]==end)
            joi={(start,sg[suc].start) for suc in n1.suc.values() if sg[suc].start<end }
            joi.update({(sg[pre].end,end) for pre in n2.pre.values() if sg[pre].end>start })
        except:
            pass
    #draw the plot for the two sample groups from the comparison
    f,axs=g.sashimi_figure(samples=groups , x_range=(start-500, end+500),junctions_of_interest=joi    )
    axs[0].set_title(title)
    fig.tight_layout()

[19]:

from isotools.plots import plot_diff_results
#We are mostly interested in differential splicing. Alternative TSS and PAS are ignored for now.
types_of_interest=['ES','ME','5AS','3AS']
#dict to store the results in
res={}

Leukemia cell lines¶

Both leukemia cell lines HL-60 and K562 (should) have mutations in the splicing machinery, e.g. in the SRSF2 gene, which should lead to abberant splicing compared to the B-cell derived cell line GM12878.

See Urbanski et al: Alternative-splicing defects in cancer: splicing regulators and their downstream targets, guiding the way to novel cancer therapeutics https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6002934/

[20]:

diff_cmp='GM12878/HL-60'
groups={k:isoseq.groups()[k] for k in diff_cmp.split('/')}
#perform the test:
res[diff_cmp]=isoseq.altsplice_test(groups).sort_values('pvalue').reset_index()
#write result table
res[diff_cmp].to_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv', index=False)
#res[diff_cmp]=pd.read_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv')
sig=res[diff_cmp].padj<.1
print(f'{sum(sig)} differential splice sites in {len(res[diff_cmp].loc[sig,"gene"].unique())} genes for {" vs ".join(groups)}')
res[diff_cmp][res[diff_cmp].splice_type.isin(types_of_interest)].head(10)

INFO:testing differential splicing for GM12878 (3) vs HL-60 (2) using betabinom_lr test
100%|██████████| 309557/309557 [09:40<00:00, 533.66genes/s]

5659 differential splice sites in 3329 genes for GM12878 vs HL-60

[20]:

	index	gene	gene_id	chrom	strand	start	end	splice_type	novel	padj	...	GM12878_1_1_GM12878_in_cov	GM12878_1_1_GM12878_total_cov	GM12878_2_1_GM12878_in_cov	GM12878_2_1_GM12878_total_cov	GM12878_1_1_b_GM12878_in_cov	GM12878_1_1_b_GM12878_total_cov	HL-60_1_1_HL-60_in_cov	HL-60_1_1_HL-60_total_cov	HL-60_2_1_HL-60_in_cov	HL-60_2_1_HL-60_total_cov
3	13046	EIF4A2	ENSG00000156976.17	chr3	+	186787882	186789124	ES	False	0.003859	...	234	708	333	966	49	132	47.0	1748.0	26.0	1205.0
4	13047	EIF4A2	ENSG00000156976.17	chr3	+	186787882	186789124	3AS	True	0.004204	...	210	684	250	883	39	122	25.0	1726.0	14.0	1193.0
9	3869	ANAPC5	ENSG00000089053.13	chr12	-	121328458	121330582	3AS	False	0.006753	...	895	1340	1169	1727	201	305	470.0	1206.0	404.0	989.0
13	15216	MCM3	ENSG00000112118.20	chr6	-	52264786	52266074	ES	False	0.006753	...	47	2403	31	1979	22	704	519.0	2363.0	443.0	1955.0
14	10047	CBWD2	ENSG00000136682.15	chr2	+	113438022	113441350	ES	False	0.006753	...	24	209	37	326	12	88	135.0	199.0	83.0	117.0
17	16292	METTL2B	ENSG00000165055.16	chr7	+	128476875	128479157	ES	False	0.006753	...	5	88	9	144	3	44	122.0	123.0	67.0	68.0
20	1711	SRP9	ENSG00000143742.14	chr1	+	225783368	225789239	ES	False	0.007374	...	95	1453	173	2368	24	266	305.0	1134.0	208.0	749.0
23	7847	WSB1	ENSG00000109046.15	chr17	+	27306882	27309099	3AS	False	0.007374	...	25	403	21	621	6	63	101.0	108.0	144.0	155.0
24	5145	ZNF410	ENSG00000119725.20	chr14	+	73921105	73923394	ES	False	0.007374	...	332	494	497	772	88	134	350.0	353.0	283.0	283.0
26	1284	LGALS8	ENSG00000116977.19	chr1	+	236542787	236543559	ES	False	0.007374	...	82	111	107	150	13	17	14.0	154.0	9.0	120.0

10 rows × 29 columns

[21]:

plt.rcParams["figure.figsize"] = (20,20)
f,axs,plotted=plot_diff_results(res[diff_cmp], min_diff=.1,grid_shape=(7,6),min_support=2, splice_types=types_of_interest)
f.savefig(f'encode_diff_betabinomial_{"_".join(groups)}.png')
plotted.to_csv(f'encode_diff_betabinomial_plot_{"_".join(groups)}.csv', index=False)

[22]:

#sashimi plot for differential spliced genes
plt.rcParams["figure.figsize"] = (15,20)
#1) select gene of interest: 3nd row, exon skipping of EIF4A2
row=res[diff_cmp].iloc[3]
plot_diff_example(row, isoseq, groups)

EIF4A2 ES (known) FDR=3.86e-03: GM12878=34.1 vs HL-60=2.5 PSI

[23]:

diff_cmp='GM12878/K562'
groups={k:isoseq.groups()[k] for k in diff_cmp.split('/')}

res[diff_cmp]=isoseq.altsplice_test(groups).sort_values('pvalue').reset_index()
res[diff_cmp].to_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv', index=False)
#res[diff_cmp]=pd.read_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv')
sig=res[diff_cmp].padj<.1
print(f'{sum(sig)} differential splice sites in {len(res[diff_cmp].loc[sig,"gene"].unique())} genes for {" vs ".join(groups)}')
res[diff_cmp][res[diff_cmp].splice_type.isin(types_of_interest)].head(10)

INFO:testing differential splicing for GM12878 (3) vs K562 (3) using betabinom_lr test
100%|██████████| 309557/309557 [11:48<00:00, 437.09genes/s]

4665 differential splice sites in 3050 genes for GM12878 vs K562

[23]:

	index	gene	gene_id	chrom	strand	start	end	splice_type	novel	padj	...	GM12878_2_1_GM12878_in_cov	GM12878_2_1_GM12878_total_cov	GM12878_1_1_b_GM12878_in_cov	GM12878_1_1_b_GM12878_total_cov	K562_1_1_K562_in_cov	K562_1_1_K562_total_cov	K562_2_1_K562_in_cov	K562_2_1_K562_total_cov	K562_1_1_b_K562_in_cov	K562_1_1_b_K562_total_cov
0	1124	NDUFS5	ENSG00000168653.11	chr1	+	39026398	39028722	5AS	False	0.001004	...	7	634	3	199	208	372	336	580	136	249
5	20075	RIPK2	ENSG00000104312.8	chr8	+	89780160	89786592	ES	True	0.004049	...	224	225	76	76	34	104	33	112	41	152
8	616	PSMB2	ENSG00000126067.12	chr1	-	35631344	35641341	ES	True	0.005460	...	954	954	257	258	681	893	595	765	262	341
9	4998	MYL6	ENSG00000092841.19	chr12	+	56160320	56161386	ES	False	0.005460	...	24	1827	22	911	188	722	235	940	182	611
13	15657	CASP3	ENSG00000164305.19	chr4	-	184638468	184649394	ES	True	0.005645	...	197	958	26	111	71	105	75	119	71	95
21	5028	RAD51AP1	ENSG00000111247.15	chr12	+	4553147	4558856	ES	False	0.006496	...	44	48	34	38	5	34	7	39	7	41
22	6326	HNRNPC	ENSG00000092199.18	chr14	-	21230366	21230996	5AS	False	0.006782	...	387	2092	168	799	431	1287	480	1475	361	1094
24	19040	METTL2B	ENSG00000165055.16	chr7	+	128476875	128479157	ES	False	0.006782	...	9	144	3	44	68	71	61	63	71	71
33	20288	CBWD1	ENSG00000172785.18	chr9	-	175784	178815	ES	False	0.007260	...	99	561	25	138	107	227	125	248	50	112
39	1415	MPZL1	ENSG00000197965.12	chr1	+	167773368	167787819	ES	False	0.009935	...	0	4	6	65	168	273	144	239	124	190

10 rows × 31 columns

[24]:

plt.rcParams["figure.figsize"] = (20,20)
f,axs,plotted=plot_diff_results(res[diff_cmp], min_diff=.1,grid_shape=(7,8),min_support=2, splice_types=types_of_interest)
f.savefig(f'encode_diff_betabinomial_{"_".join(groups)}.png')
plotted.to_csv(f'encode_diff_betabinomial_plot_{"_".join(groups)}.csv', index=False)

[25]:

#sashimi plot for differential spliced genes
plt.rcParams["figure.figsize"] = (15,10)
#1) select gene of interest
row=res[diff_cmp].iloc[5]
plot_diff_example(row, isoseq, groups)

RIPK2 ES (novel) FDR=4.05e-03: GM12878=99.4 vs K562=29.4 PSI

[26]:

gr=isoseq.groups()
groups={'leukemia':gr['K562']+gr['HL-60'], 'GM12878':gr['GM12878']}
diff_cmp='/'.join(groups)

#takes about 30 min
res[diff_cmp]=isoseq.altsplice_test(groups).sort_values('pvalue').reset_index()
res[diff_cmp].to_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv', index=False)
#res[diff_cmp]=pd.read_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv')
sig=res[diff_cmp].padj<.1
print(f'{sum(sig)} differential splice sites in {len(res[diff_cmp].loc[sig,"gene"].unique())} genes for {" vs ".join(groups)}')
res[diff_cmp][res[diff_cmp].splice_type.isin(types_of_interest)].head(10)

INFO:testing differential splicing for leukemia (5) vs GM12878 (3) using betabinom_lr test
100%|██████████| 309557/309557 [11:50<00:00, 435.65genes/s]

1587 differential splice sites in 1181 genes for leukemia vs GM12878

[26]:

	index	gene	gene_id	chrom	strand	start	end	splice_type	novel	padj	...	HL-60_1_1_leukemia_in_cov	HL-60_1_1_leukemia_total_cov	HL-60_2_1_leukemia_in_cov	HL-60_2_1_leukemia_total_cov	GM12878_1_1_GM12878_in_cov	GM12878_1_1_GM12878_total_cov	GM12878_2_1_GM12878_in_cov	GM12878_2_1_GM12878_total_cov	GM12878_1_1_b_GM12878_in_cov	GM12878_1_1_b_GM12878_total_cov
3	20601	METTL2B	ENSG00000165055.16	chr7	+	128476875	128479157	ES	False	0.001373	...	122	123	67	68	5	88	9.0	144.0	3.0	44.0
7	21875	CBWD1	ENSG00000172785.18	chr9	-	175784	178815	ES	False	0.002528	...	192	347	120	228	60	302	99.0	561.0	25.0	138.0
9	1232	NDUFS5	ENSG00000168653.11	chr1	+	39026398	39028722	5AS	False	0.002679	...	17	44	13	43	5	393	7.0	634.0	3.0	199.0
10	7656	INTS14	ENSG00000138614.15	chr15	-	65607442	65611097	5AS	False	0.003423	...	135	224	108	178	35	158	43.0	222.0	18.0	54.0
14	922	C1orf43	ENSG00000143612.21	chr1	-	154214574	154220341	ES	False	0.003568	...	1157	1627	768	1053	159	349	237.0	497.0	49.0	106.0
18	1402	ACADM	ENSG00000117054.14	chr1	+	75724817	75732643	ES	False	0.003568	...	264	325	206	256	802	836	1104.0	1159.0	155.0	164.0
28	19864	MRM2	ENSG00000122687.19	chr7	-	2235564	2239417	ES	False	0.005016	...	94	370	70	301	29	400	25.0	460.0	9.0	107.0
29	14268	DIDO1	ENSG00000101191.17	chr20	-	62914406	62926438	5AS	True	0.005016	...	175	175	156	156	38	76	28.0	58.0	29.0	50.0
30	1685	PTPRC	ENSG00000081237.20	chr1	+	198692373	198703297	ES	False	0.005016	...	67	297	58	331	186	227	259.0	307.0	57.0	62.0
32	19563	ECHDC1	ENSG00000093144.19	chr6	-	127314896	127327001	ES	False	0.005041	...	148	272	109	210	320	363	441.0	502.0	66.0	88.0

10 rows × 35 columns

[27]:

plt.rcParams["figure.figsize"] = (20,20)
f,axs,plotted=plot_diff_results(res[diff_cmp], min_diff=.1,grid_shape=(7,6),min_support=2, splice_types=types_of_interest)
f.savefig(f'encode_diff_betabinomial_{"_".join(groups)}.png')
plotted.to_csv(f'encode_diff_betabinomial_plot_{"_".join(groups)}.csv', index=False)

[28]:

#sashimi plot for differential spliced genes
plt.rcParams["figure.figsize"] = (15,10)
#1) select gene of interest
row=res[diff_cmp].iloc[3]
plot_diff_example(row, isoseq, groups)

METTL2B ES (known) FDR=1.37e-03: leukemia=98.2 vs GM12878=6.2 PSI

[ ]: