ENCODE Long Read Sequel2 dataset

This tutorial guides through a typical Long Read Transcriptome Sequencing (LRTS) analysis workflow with isotools, using ENCODE Isoseq Sequel2 Data. It demonstrates the analysis of alternative splicing events within and between sample groups.

The original analysis integrates quite a large number of samples, and consequently runs several hours. Users interested in reproducing the notebook should consider restricting the data to the samples they are interested in.

Preparation

In this notebook, “long read RNA-seq” samples are downloaded from ENCODE to the “encode” subdirectory. You can also manually download the files using the data portal (https://www.encodeproject.org/) and download aligned .bam files. Here I use all Sequel II samples leukemia and b-cell samples the time of writing, but you can choose to process a subset.

Further, you need a reference annotation and a genome fastq file. Please see the Alzheimer tutorial how to get these files.

Data import

[1]:

from  isotools import Transcriptome
from isotools import __version__ as isotools_version
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from pathlib import Path
import logging
from collections import Counter
from urllib.request import urlretrieve
import pysam

logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
logger=logging.getLogger('isotools')
logger.info(f'This is isootools version {isotools_version}')

INFO:This is isootools version 0.2.7
[2]:
#I assume the encode data (e.g. the bam files) are in an subdirctory "encode"
#If not, they will be downloaded
Path('./encode').mkdir(parents=True, exist_ok=True)

#first, check what samples are present
metadata_link="https://www.encodeproject.org/metadata/?type=Experiment&assay_title=long+read+RNA-seq&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&files.file_type=bam&files.file_type=fastq&advancedQuery=date_released%3A%5B2009-02-01+TO+2021-12-31%5D&searchTerm=long+read+RNA-seq"
metadata=pd.read_csv(metadata_link, sep='\t')
platform=metadata.set_index('Experiment accession').Platform.dropna().to_dict()
for pf,count in Counter(platform.values()).items():
    logging.info(f'{pf}: In total {count} samples from ENCODE')

#Select the samples from the metadata, make sure there are only bam files are in the table
samples=metadata[(metadata['Output type']=='alignments') & (metadata['File Status']=='released')].copy()
samples['Platform']=[platform.get(ea,'unknown') for ea in samples['Experiment accession']] #This info is missing for some files
samples=samples[['File accession','Output type','Biosample term name','Biosample type','Technical replicate(s)','Platform']].reset_index(drop=True)

#we select all leukemia and b-cell SequelII files - adjust this as needed
kariotype={"K562":'leukemia', "GM12878":'normal', "HL-60":'leukemia'}
samples=samples.query('Platform == "Pacific Biosciences Sequel II" and `Biosample term name` in ["K562", "GM12878", "HL-60"]')
samples=samples.sort_values('Biosample term name').reset_index(drop=True)


# define the paths for reference and annotation files
genome='reference/GRCh38.p13.genome.fa'
anno='reference/gencode.v36.chr_patch_hapl_scaff.annotation_sorted'
chrom=[f'chr{i+1}'for i in range(22)]+['chrX','chrY']

#download and index the selected files, if not present
for accession in samples['File accession']:
    bam=f"encode/{accession}.bam"
    bai=bam+'.bai'
    if not os.path.isfile(bam):
        logger.info(f'downloading bam file for {accession}')
        url=metadata.loc[metadata['File accession']==accession,'File download URL'].values[0]
        file=os.path.split(url)[1]
        urlretrieve(url, f"encode/{file}")
    if not os.path.isfile(bai) or os.path.getmtime(bam)>os.path.getmtime(bai):
        logger.info(f'indexing {accession}')
        pysam.index(bam)


#print the sample table
samples
INFO:Oxford Nanopore PromethION: In total 4 samples from ENCODE
INFO:Pacific Biosciences Sequel: In total 16 samples from ENCODE
INFO:Pacific Biosciences Sequel II: In total 58 samples from ENCODE
INFO:Oxford Nanopore MinION: In total 11 samples from ENCODE
INFO:downloading bam file for ENCFF648NAR
INFO:indexing ENCFF648NAR
INFO:downloading bam file for ENCFF225CCJ
INFO:indexing ENCFF225CCJ
INFO:downloading bam file for ENCFF219UJG
INFO:indexing ENCFF219UJG
INFO:downloading bam file for ENCFF810FRP
INFO:indexing ENCFF810FRP
INFO:downloading bam file for ENCFF600MGT
INFO:indexing ENCFF600MGT
INFO:downloading bam file for ENCFF322UJU
INFO:indexing ENCFF322UJU
INFO:downloading bam file for ENCFF661OEY
INFO:indexing ENCFF661OEY
INFO:downloading bam file for ENCFF645UVN
INFO:indexing ENCFF645UVN
[2]:
File accession Output type Biosample term name Biosample type Technical replicate(s) Platform
0 ENCFF648NAR alignments GM12878 cell line 1_1 Pacific Biosciences Sequel II
1 ENCFF225CCJ alignments GM12878 cell line 2_1 Pacific Biosciences Sequel II
2 ENCFF219UJG alignments GM12878 cell line 1_1 Pacific Biosciences Sequel II
3 ENCFF810FRP alignments HL-60 cell line 1_1 Pacific Biosciences Sequel II
4 ENCFF600MGT alignments HL-60 cell line 2_1 Pacific Biosciences Sequel II
5 ENCFF322UJU alignments K562 cell line 1_1 Pacific Biosciences Sequel II
6 ENCFF661OEY alignments K562 cell line 1_1 Pacific Biosciences Sequel II
7 ENCFF645UVN alignments K562 cell line 2_1 Pacific Biosciences Sequel II
[3]:
try:
    isoseq=Transcriptome.load(f'encode/encode_isotools.pkl')
except FileNotFoundError:
    try:
        isoseq=Transcriptome.from_reference(anno+'.isotools.pkl')
    except FileNotFoundError:
        isoseq=Transcriptome.from_reference(anno+'.gff3.gz')
        isoseq.save_reference(anno+'.isotools.pkl')
    for i,row in samples.iterrows():
        sname=f"{row['Biosample term name']}_{row['Technical replicate(s)']}".replace(' ','_')
        if sname in isoseq.samples:
            l=ord('b')
            while f'{sname}_{l:c}' in isoseq.samples:
                l+=1
            sname=f'{sname}_{l:c}'
        grp=row['Biosample term name']
        isoseq.add_sample_from_bam(f"encode/{row['File accession']}.bam",
                                   use_satag=True,
                                   sample_name=sname ,
                                   group=grp,
                                   kariotype=kariotype[grp],
                                   sample_type=row['Biosample type'],
                                   platform=row['Platform'])
    isoseq.add_qc_metrics(genome) #3:41 min
    isoseq.make_index()
    isoseq.save('encode/encode_isotools.pkl')

INFO:loading transcriptome from encode/encode_isotoolsXXX.pkl
INFO:importing reference from pkl file reference/gencode.v36.chr_patch_hapl_scaff.annotation_sorted.isotoolsXXX.pkl
INFO:importing reference from gff3 file reference/gencode.v36.chr_patch_hapl_scaff.annotation_sorted.gff3.gz
3270946it [01:59, 27434.88it/s]
INFO:skipped the following categories: {'three_prime_UTR', 'CDS', 'five_prime_UTR', 'stop_codon_redefined_as_selenocysteine'}
INFO:saving reference to reference/gencode.v36.chr_patch_hapl_scaff.annotation_sorted.isotools.pkl
INFO:adding sample GM12878_1_1 from file encode/ENCFF648NAR.bam
100%|██████████| 2039578.0/2039578 [05:39<00:00, 6016.14reads/s, chr=ERCC-00171]
WARNING:skipped 84 reads aligned fraction of less than 0.75.
INFO:ignoring 11145 chimeric alignments with less than 2 reads
INFO:imported 2029866 nonchimeric reads (including  1879 chained chimeric alignments) and 362 chimeric reads with coverage of at least 2.
INFO:adding sample GM12878_2_1 from file encode/ENCFF225CCJ.bam
100%|██████████| 2442854.0/2442854 [05:38<00:00, 7219.94reads/s, chr=ERCC-00171]
WARNING:skipped 118 reads aligned fraction of less than 0.75.
INFO:ignoring 11609 chimeric alignments with less than 2 reads
INFO:imported 2432521 nonchimeric reads (including  1862 chained chimeric alignments) and 468 chimeric reads with coverage of at least 2.
INFO:adding sample GM12878_1_1_b from file encode/ENCFF219UJG.bam
100%|██████████| 1526058.0/1526058 [05:04<00:00, 5010.93reads/s, chr=ERCC-00171]
WARNING:skipped 302 reads aligned fraction of less than 0.75.
INFO:ignoring 36704 chimeric alignments with less than 2 reads
INFO:imported 1490038 nonchimeric reads (including  5416 chained chimeric alignments) and 4430 chimeric reads with coverage of at least 2.
INFO:adding sample HL-60_1_1 from file encode/ENCFF810FRP.bam
100%|██████████| 2784296.0/2784296 [08:21<00:00, 5552.21reads/s, chr=ERCC-00171]
WARNING:skipped 128 reads aligned fraction of less than 0.75.
INFO:ignoring 11037 chimeric alignments with less than 2 reads
INFO:imported 2774419 nonchimeric reads (including  2009 chained chimeric alignments) and 721 chimeric reads with coverage of at least 2.
INFO:adding sample HL-60_2_1 from file encode/ENCFF600MGT.bam
100%|██████████| 2374443.0/2374443 [07:59<00:00, 4954.91reads/s, chr=ERCC-00171]
WARNING:skipped 118 reads aligned fraction of less than 0.75.
INFO:ignoring 8611 chimeric alignments with less than 2 reads
INFO:imported 2366965 nonchimeric reads (including  1637 chained chimeric alignments) and 386 chimeric reads with coverage of at least 2.
INFO:adding sample K562_1_1 from file encode/ENCFF661OEY.bam
100%|██████████| 2617489.0/2617489 [09:06<00:00, 4788.78reads/s, chr=ERCC-00171]
WARNING:skipped 599 reads aligned fraction of less than 0.75.
INFO:ignoring 78475 chimeric alignments with less than 2 reads
INFO:imported 2515223 nonchimeric reads (including  1137 chained chimeric alignments) and 24329 chimeric reads with coverage of at least 2.
INFO:adding sample K562_2_1 from file encode/ENCFF645UVN.bam
100%|██████████| 2729961.0/2729961 [07:55<00:00, 5737.59reads/s, chr=ERCC-00171]
WARNING:skipped 626 reads aligned fraction of less than 0.75.
INFO:ignoring 88468 chimeric alignments with less than 2 reads
INFO:imported 2619912 nonchimeric reads (including  1308 chained chimeric alignments) and 22263 chimeric reads with coverage of at least 2.
INFO:adding sample K562_1_1_b from file encode/ENCFF322UJU.bam
100%|██████████| 1958209.0/1958209 [06:10<00:00, 5278.58reads/s, chr=ERCC-00171]
WARNING:skipped 388 reads aligned fraction of less than 0.75.
INFO:ignoring 52227 chimeric alignments with less than 2 reads
INFO:imported 1896354 nonchimeric reads (including  1125 chained chimeric alignments) and 10365 chimeric reads with coverage of at least 2.
WARNING:148 contigs are not contained in genome, affecting 1004 genes. Some metrics cannot be computed: {'chrUn_KI270579v1', 'ERCC-00035', 'chr14_KI270724v1_random', 'chrUn_KI270538v1', 'ERCC-00031', 'SIRV2', 'chrUn_GL000219v1', 'chr17_GL000205v2_random', 'ERCC-00157', 'ERCC-00116', 'chr9_KI270718v1_random', 'ERCC-00092', 'ERCC-00028', 'ERCC-00160', 'chrUn_KI270751v1', 'chr1_KI270711v1_random', 'ERCC-00062', 'chrUn_KI270754v1', 'chrUn_GL000216v2', 'chrUn_KI270438v1', 'ERCC-00075', 'chrUn_GL000220v1', 'ERCC-00144', 'chrEBV', 'ERCC-00085', 'ERCC-00108', 'ERCC-00095', 'chr1_KI270710v1_random', 'chr14_GL000009v2_random', 'chr1_KI270713v1_random', 'ERCC-00051', 'chr1_KI270707v1_random', 'ERCC-00145', 'ERCC-00099', 'chrUn_KI270374v1', 'chrUn_KI270747v1', 'chrUn_KI270442v1', 'SIRV7', 'chr9_KI270719v1_random', 'ERCC-00111', 'chr4_GL000008v2_random', 'chr22_KI270737v1_random', 'chrUn_KI270411v1', 'chrUn_GL000213v1', 'ERCC-00014', 'ERCC-00034', 'chrUn_KI270750v1', 'ERCC-00044', 'chrUn_KI270466v1', 'chr11_KI270721v1_random', 'ERCC-00096', 'chr5_GL000208v1_random', 'chrUn_KI270589v1', 'chrUn_KI270745v1', 'chr14_KI270722v1_random', 'ERCC-00150', 'ERCC-00170', 'chrUn_KI270748v1', 'ERCC-00136', 'chr22_KI270733v1_random', 'chr22_KI270731v1_random', 'chr1_KI270712v1_random', 'ERCC-00131', 'chr9_KI270720v1_random', 'chrUn_KI270337v1', 'ERCC-00074', 'ERCC-00130', 'ERCC-00113', 'chrUn_GL000195v1', 'ERCC-00162', 'ERCC-00078', 'ERCC-00002', 'chr3_GL000221v1_random', 'chrUn_GL000224v1', 'SIRV4', 'chr1_KI270708v1_random', 'ERCC-00054', 'ERCC-00109', 'ERCC-00004', 'chrUn_KI270465v1', 'ERCC-00019', 'chr15_KI270727v1_random', 'SIRV5', 'chr1_KI270706v1_random', 'ERCC-00077', 'ERCC-00022', 'chrUn_GL000214v1', 'chrUn_KI270743v1', 'ERCC-00168', 'SIRV1', 'chr1_KI270714v1_random', 'chrUn_KI270516v1', 'ERCC-00059', 'chr14_GL000225v1_random', 'chrUn_KI270590v1', 'ERCC-00073', 'chr14_KI270726v1_random', 'chr2_KI270715v1_random', 'ERCC-00071', 'chrUn_KI270468v1', 'chr14_GL000194v1_random', 'ERCC-00143', 'chr22_KI270738v1_random', 'ERCC-00043', 'ERCC-00165', 'ERCC-00126', 'ERCC-00003', 'chr2_KI270716v1_random', 'ERCC-00171', 'ERCC-00079', 'chrUn_KI270741v1', 'ERCC-00025', 'chrUn_KI270467v1', 'chrUn_KI270333v1', 'chrUn_GL000226v1', 'ERCC-00076', 'SIRV6', 'chr22_KI270739v1_random', 'chr1_KI270709v1_random', 'chrUn_KI270742v1', 'chrUn_KI270584v1', 'ERCC-00046', 'ERCC-00163', 'chr22_KI270735v1_random', 'chrUn_KI270512v1', 'ERCC-00009', 'chr17_KI270729v1_random', 'chrUn_KI270544v1', 'chrUn_KI270749v1', 'SIRV3', 'ERCC-00042', 'ERCC-00033', 'chrUn_GL000218v1', 'chrUn_KI270448v1', 'chrUn_KI270744v1', 'chrUn_KI270757v1', 'ERCC-00084', 'chrUn_KI270435v1', 'chrUn_KI270581v1', 'chrUn_KI270756v1', 'chrUn_KI270511v1', 'chr16_KI270728v1_random', 'ERCC-00053', 'ERCC-00060', 'ERCC-00112', 'chr22_KI270734v1_random', 'chrUn_KI270519v1', 'chrUn_KI270746v1'}
100%|██████████| 309557/309557 [12:23<00:00, 416.46it/s]
INFO:saving transcriptome to encode/encode_isotools.pkl
[5]:
#add gencode specific filters

isoseq.add_filter( "HIGH_SUPPORT", 'transcript_support_level=="1"', context='reference')
isoseq.add_filter( "PROTEIN_CODING", 'transcript_type=="protein_coding"', context='reference')

Saturation Anlalysis

To estimate the saturation of discovered isoforms, IsoTools offers two complimentary figures: * The saturation plot models the expected probability of discovering a transcript * at given level of confidence (number of supporting long reads) * which is present at a given abundance level in the sample (in transcripts per milion, TPM) * depending on the total number of long reads * The rarefaction analysis depicts the number of different discovered transcripts when subsampling the long reads

The saturation analysis reveals that at the given sequencing depth, transcripts expressed at 2 TPM have > 80% probability of beeing covered by at least two reads in the individual samples, while transcripts expressed at 1 TPM have 40% to 75% probability. The slope of the rarefaction curve at the right end reflects the number of novel transcripts that could be discovered if sequencing depth would be increased.

[6]:
from isotools.plots import plot_saturation, plot_rarefaction
import matplotlib
font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 15}
matplotlib.rc('font', **font)
plt.rcParams["figure.figsize"] = (14,7)
fig, axs=plt.subplots(1,2)

plot_saturation(isoseq,cov_th=2, x_range=(1e4,5e6,1e4), ax=axs[0])

rarefraction, total=isoseq.rarefaction(min_coverage=2, tr_filter={'query':'FSM'})
plot_rarefaction(rarefraction, total=total, ax=axs[1])

fig.tight_layout()
../_images/notebooks_ENCODE_S2_7_0.png
[ ]:

Quality control and filtering

First, we will depict some quality metrics and define filtering criteria. For more information on technical artifacts, quality metrics, and filtering, see the Alzheimer tutorial.

[7]:
#compute some summary statistics on technical artifacts.
# For this analysis samples are grouped by 'Biosample term name'
tr_stats=[
    isoseq.transcript_length_hist(groups=isoseq.groups(), add_reference=True, min_coverage=2,tr_filter=dict( query='not NOVEL_GENE'), ref_filter=dict(query='HIGH_SUPPORT')),
    isoseq.downstream_a_hist(groups=isoseq.groups(), tr_filter=dict( query='not NOVEL_TRANSCRIPT'), ref_filter=dict(query='not REF_UNSPLICED')),
    isoseq.downstream_a_hist(groups=isoseq.groups(), tr_filter=dict(query='NOVEL_GENE and UNSPLICED')),
    isoseq.direct_repeat_hist(groups=isoseq.groups(),bins=np.linspace(-.5,10.5,12))]

tr_stats.append((pd.concat([tr_stats[2][0].add_suffix(' novel unspliced'),tr_stats[1][0].add_suffix(' known multiexon')],axis=1),tr_stats[2][1]))

#statistic on the filter flags
f_stats=isoseq.filter_stats(tags=['RTTS','FRAGMENT','INTERNAL_PRIMING'], groups=isoseq.groups(), weight_by_coverage=True,min_coverage=1)
f_stats[0].index=f_stats[0].index.str.replace('_','\n')
[8]:
#QC plot
from isotools.plots import plot_bar, plot_distr

plt.rcParams["figure.figsize"] = (30,25)
plt.rcParams.update({'font.size': 14})

fig, axs = plt.subplots(3,2)
#A) transcript length
plot_distr(tr_stats[0][0],smooth=3,ax=axs[0,0],legend=True,**tr_stats[0][1])
#B) frequency of artifacts
plot_bar(f_stats[0],ax=axs[0,1],drop_categories=['PASS'],bar_width=.9,legend=True,**f_stats[1])
#C/D) internal priming
plot_distr(tr_stats[4][0][[c for c in tr_stats[4][0].columns if 'known' in c]],smooth=3,ax=axs[1,0],legend=True,density=True,fill=True, title= 'A content at known multiexon transcripts', xlabel='fraction of A downstream the transcript')
plot_distr(tr_stats[4][0][[c for c in tr_stats[4][0].columns if 'novel' in c]],smooth=3,ax=axs[1,1],legend=True,density=True,fill=True, title= 'A content at novel unspliced transcripts', xlabel='fraction of A downstream the transcript')
#E/F) RTTS
plot_distr(tr_stats[3][0][[c for c in tr_stats[3][0].columns if 'known' in c]],ax=axs[2,0],legend=True,density=True,title='direct repeats at known splice sites',ylabel='density',xlabel= 'length of direct repeats at splice junctons')
plot_distr(tr_stats[3][0][[c for c in tr_stats[3][0].columns if 'novel' in c]],ax=axs[2,1],legend=True,density=True,title='direct repeats at novel noncanonical splice sites',ylabel='density',xlabel= 'length of direct repeats at splice junctons')

for letter,ax in zip('ABCD',axs.flatten()):
    ax.set_title(letter,{'fontsize':20}, loc='left', pad=20)
fig.tight_layout()
axs[0,1].legend(fontsize='large', ncol=1,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(1.01, 1))

[8]:
<matplotlib.legend.Legend at 0x7fcec0a38c70>
../_images/notebooks_ENCODE_S2_11_1.png
[9]:
f_stats_sample=isoseq.filter_stats(tags=['RTTS','FRAGMENT','INTERNAL_PRIMING'], weight_by_coverage=True,min_coverage=1)

Quality metrics vary substantially between the individual samples, in particular for internal priming artifacts. On the other hand, RRTS seems to be almost absent in this data, indicating that the alignments may have been filtered. The fractions of affected reads for the different artifacts can be depicted as a bar chart.

[10]:
#plot technical artifacts for individual samples
plt.rcParams["figure.figsize"] = (20,7)
plot_bar(f_stats_sample[0],drop_categories=['PASS'],legend=True,rot=0, bar_width=.9,**f_stats_sample[1])
#plt.legend(fontsize='small', ncol=5,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, 1.1), loc='lower left')
[10]:
<AxesSubplot:title={'center':'Expressed Transcripts'}, ylabel='fraction of reads'>
../_images/notebooks_ENCODE_S2_14_1.png

Data exploration

To explore the relation of the samples with respect to splicing we look at PCA and UMAP embeddings based on alternative splicing events.

[11]:
#Compute alternative splicing events by finding "bubbles" in the segment graphs
splice_events=isoseq.alternative_splicing_events()
[12]:
#plot PCA embedding
from isotools.plots import plot_embedding
plt.rcParams["figure.figsize"] = (15,30)

pca={}

f,axs=plt.subplots(4,2)
for ax,t in zip(axs.flatten(),['all','3AS','5AS','ES','IR','ME', 'TSS', 'PAS']):
    pca[t]=plot_embedding(splice_events, ax=ax, labels=True, groups=isoseq.groups(), splice_types=t)


#axs[0,0].legend(fontsize='medium', ncol=4,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, 1.1), loc='lower left')
#plt.tight_layout()

/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
../_images/notebooks_ENCODE_S2_17_1.png
[13]:
#plot UMAP embedding
# the clustering of samples very much depends on the n_neighbors parameter
plt.rcParams["figure.figsize"] = (15,25)
umap={}
f,axs=plt.subplots(4,2)
for ax,t in zip(axs.flatten(),['all','3AS','5AS','ES','IR','ME', 'TSS', 'PAS']):
    umap[t]=plot_embedding(splice_events, method='UMAP',n_neighbors=6, ax=ax, labels=True, groups=isoseq.groups(), splice_types=t)


#xs[0,0].legend(fontsize='medium', ncol=4,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, 1.1), loc='lower left')

/home/lienhard/.local/lib/python3.9/site-packages/numba/core/cpu.py:97: UserWarning: Numba extension module 'sparse._numba_extension' failed to load due to 'FileNotFoundError([Errno 2] No such file or directory: '/home/lienhard/.local/lib/python3.9/site-packages/numba-0.54.0.dist-info/METADATA')'.
  numba.core.entrypoints.init_all()
2021-12-02 14:26:16.740711: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
/home/lienhard/.local/lib/python3.9/site-packages/isotools/plots.py:153: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  p = ((k.values + scaled_mean[:, np.newaxis]) / (n.values + prior_count)).T
../_images/notebooks_ENCODE_S2_18_1.png

Novel alternative splicing classification

[14]:

from isotools.transcriptome import SPLICE_CATEGORY
cnr={}
for g, trid, tr in isoseq.iter_transcripts():
    for anno in tr['annotation'][1]:
        cnr[anno]=min(cnr.get(anno,5),tr['annotation'][0])
del cnr['FSM']
altsplice=[ isoseq.altsplice_stats(groups=isoseq.groups(), weight_by_coverage=True, min_coverage=1, tr_filter=dict( query='not (RTTS or FRAGMENT or INTERNAL_PRIMING)')),
            isoseq.altsplice_stats(groups=isoseq.groups(), weight_by_coverage=True, min_coverage=2, tr_filter=dict( query='not (RTTS or FRAGMENT or INTERNAL_PRIMING)')),
            isoseq.altsplice_stats(groups=isoseq.groups(), weight_by_coverage=False, min_coverage=20, tr_filter=dict( query='not (RTTS or FRAGMENT or INTERNAL_PRIMING)'))]
for i in range(3):
    altsplice[i][0].index=altsplice[i][0].index+[f'\n({SPLICE_CATEGORY[cnr[subcat]]})' if subcat in cnr else '' for subcat in altsplice[i][0].index]
    altsplice[i][0].index=altsplice[i][0].index.str.replace('splice ','\nsplice ')
[15]:
from isotools.plots import plot_bar, plot_distr

plt.rcParams["figure.figsize"] = (30,40)
fig, axs = plt.subplots( 2)
for i,ax in enumerate(axs):
    cat=altsplice[0][0].index[i*10+2:(i+1)*10+2]
    plot_bar(altsplice[0][0],bar_width=.9,ax=ax,ylabel='fraction of reads [%]', legend=False, rot=0,drop_categories=[c for c in altsplice[0][0].index if c not in cat])
axs[-1].legend(fontsize='large', ncol=7,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, -.2), loc='upper left')
_=axs[0].set_title('Novel Transcript Categories')
../_images/notebooks_ENCODE_S2_21_0.png
[16]:
#The same for individual samples
altsplice_sample= isoseq.altsplice_stats( weight_by_coverage=True, min_coverage=1, tr_filter=dict(  query='not (RTTS or FRAGMENT or INTERNAL_PRIMING)'))
altsplice_sample[0].index=altsplice_sample[0].index+[f'\n({SPLICE_CATEGORY[cnr[subcat]]})' if subcat in cnr else '' for subcat in altsplice_sample[0].index]

[17]:
plt.rcParams["figure.figsize"] = (30,20)
plt.rcParams.update({'font.size': 12})
fig, axs = plt.subplots( 4)


for i,ax in enumerate(axs.flatten()):
    cat=altsplice_sample[0].index[i*5+2:(i+1)*5+2]
    plot_bar(altsplice_sample[0],bar_width=.9,ax=ax,ylabel='fraction of reads [%]', legend=False, rot=0,drop_categories=[c for c in altsplice_sample[0].index if c not in cat])
handles, labels = axs[0].get_legend_handles_labels()

axs[-1].legend(fontsize='large', ncol=8,handleheight=2.4, labelspacing=0.05, bbox_to_anchor=(0, -.2), loc='upper left')
_=axs[0].set_title('Novel Transcript Categories')
../_images/notebooks_ENCODE_S2_23_0.png

Differential splicing

To test for differential splicing events between groups of samples, IsoTools implements a likelihood ratio test based on beta binomial distribution. For each alternative event, the number of reads supporting the alternative is modeled with a binomial distribution, where the probability parmeter is beta distributed. For the null model, we fit one set of maximum likelihood parameters of the beta distribution for all samples, whereas for the alternative model, a set of parameters if fitted for each group. Under the null hypothesis, the model deviance is assumed to follow a chi2 distribution with two degrees of freedom.

Each comparision results in table of alternative splicing events, sorted by significance. To explore the results, we can plot the fitted maximum likelihood beta distributions of the alternative model for the most significant events. We can also look at individual examples and depict the event as a sashimi plot. To this end, we define a helper function, to automatically identify the relevant region and the involved junctions and make the sashimi plot:

[18]:
#first we define a function, to depcit the results
def plot_diff_example(row, isoseq, groups):
    #sashimi plot for differential spliced genes
    plt.rcParams["figure.figsize"] = (15,10)
    #select gene object
    g=isoseq[row.gene_id]
    #make the title
    gn=list(groups)
    novel='novel' if row.novel else 'known'
    title=f'{g.name} {row.splice_type} ({novel}) FDR={row.padj:.2e}: {gn[0]}={row[gn[0]+"_PSI"]*100:.1f} vs {gn[1]}={row[gn[1]+"_PSI"]*100:.1f} PSI'
    print(title)
    start=row.start
    end=row.end
    #select the junctions of interest (which will be marked purple in the plot)
    joi=[(start,end)]
    if row.splice_type=='ME' and g.is_annotated:
        try:
            sg=g.ref_segment_graph
            n1=next(n for n in sg if n[1]==start)
            n2=next(n for n in sg if n[0]==end)
            joi={(start,sg[suc].start) for suc in n1.suc.values() if sg[suc].start<end }
            joi.update({(sg[pre].end,end) for pre in n2.pre.values() if sg[pre].end>start })
        except:
            pass
    #draw the plot for the two sample groups from the comparison
    f,axs=g.sashimi_figure(samples=groups , x_range=(start-500, end+500),junctions_of_interest=joi    )
    axs[0].set_title(title)
    fig.tight_layout()
[19]:
from isotools.plots import plot_diff_results
#We are mostly interested in differential splicing. Alternative TSS and PAS are ignored for now.
types_of_interest=['ES','ME','5AS','3AS']
#dict to store the results in
res={}

Leukemia cell lines

Both leukemia cell lines HL-60 and K562 (should) have mutations in the splicing machinery, e.g. in the SRSF2 gene, which should lead to abberant splicing compared to the B-cell derived cell line GM12878.

See Urbanski et al: Alternative-splicing defects in cancer: splicing regulators and their downstream targets, guiding the way to novel cancer therapeutics https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6002934/

[20]:
diff_cmp='GM12878/HL-60'
groups={k:isoseq.groups()[k] for k in diff_cmp.split('/')}
#perform the test:
res[diff_cmp]=isoseq.altsplice_test(groups).sort_values('pvalue').reset_index()
#write result table
res[diff_cmp].to_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv', index=False)
#res[diff_cmp]=pd.read_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv')
sig=res[diff_cmp].padj<.1
print(f'{sum(sig)} differential splice sites in {len(res[diff_cmp].loc[sig,"gene"].unique())} genes for {" vs ".join(groups)}')
res[diff_cmp][res[diff_cmp].splice_type.isin(types_of_interest)].head(10)
INFO:testing differential splicing for GM12878 (3) vs HL-60 (2) using betabinom_lr test
100%|██████████| 309557/309557 [09:40<00:00, 533.66genes/s]
5659 differential splice sites in 3329 genes for GM12878 vs HL-60
[20]:
index gene gene_id chrom strand start end splice_type novel padj ... GM12878_1_1_GM12878_in_cov GM12878_1_1_GM12878_total_cov GM12878_2_1_GM12878_in_cov GM12878_2_1_GM12878_total_cov GM12878_1_1_b_GM12878_in_cov GM12878_1_1_b_GM12878_total_cov HL-60_1_1_HL-60_in_cov HL-60_1_1_HL-60_total_cov HL-60_2_1_HL-60_in_cov HL-60_2_1_HL-60_total_cov
3 13046 EIF4A2 ENSG00000156976.17 chr3 + 186787882 186789124 ES False 0.003859 ... 234 708 333 966 49 132 47.0 1748.0 26.0 1205.0
4 13047 EIF4A2 ENSG00000156976.17 chr3 + 186787882 186789124 3AS True 0.004204 ... 210 684 250 883 39 122 25.0 1726.0 14.0 1193.0
9 3869 ANAPC5 ENSG00000089053.13 chr12 - 121328458 121330582 3AS False 0.006753 ... 895 1340 1169 1727 201 305 470.0 1206.0 404.0 989.0
13 15216 MCM3 ENSG00000112118.20 chr6 - 52264786 52266074 ES False 0.006753 ... 47 2403 31 1979 22 704 519.0 2363.0 443.0 1955.0
14 10047 CBWD2 ENSG00000136682.15 chr2 + 113438022 113441350 ES False 0.006753 ... 24 209 37 326 12 88 135.0 199.0 83.0 117.0
17 16292 METTL2B ENSG00000165055.16 chr7 + 128476875 128479157 ES False 0.006753 ... 5 88 9 144 3 44 122.0 123.0 67.0 68.0
20 1711 SRP9 ENSG00000143742.14 chr1 + 225783368 225789239 ES False 0.007374 ... 95 1453 173 2368 24 266 305.0 1134.0 208.0 749.0
23 7847 WSB1 ENSG00000109046.15 chr17 + 27306882 27309099 3AS False 0.007374 ... 25 403 21 621 6 63 101.0 108.0 144.0 155.0
24 5145 ZNF410 ENSG00000119725.20 chr14 + 73921105 73923394 ES False 0.007374 ... 332 494 497 772 88 134 350.0 353.0 283.0 283.0
26 1284 LGALS8 ENSG00000116977.19 chr1 + 236542787 236543559 ES False 0.007374 ... 82 111 107 150 13 17 14.0 154.0 9.0 120.0

10 rows × 29 columns

[21]:

plt.rcParams["figure.figsize"] = (20,20)
f,axs,plotted=plot_diff_results(res[diff_cmp], min_diff=.1,grid_shape=(7,6),min_support=2, splice_types=types_of_interest)
f.savefig(f'encode_diff_betabinomial_{"_".join(groups)}.png')
plotted.to_csv(f'encode_diff_betabinomial_plot_{"_".join(groups)}.csv', index=False)
../_images/notebooks_ENCODE_S2_30_0.png
[22]:
#sashimi plot for differential spliced genes
plt.rcParams["figure.figsize"] = (15,20)
#1) select gene of interest: 3nd row, exon skipping of EIF4A2
row=res[diff_cmp].iloc[3]
plot_diff_example(row, isoseq, groups)
EIF4A2 ES (known) FDR=3.86e-03: GM12878=34.1 vs HL-60=2.5 PSI
../_images/notebooks_ENCODE_S2_31_1.png
[23]:
diff_cmp='GM12878/K562'
groups={k:isoseq.groups()[k] for k in diff_cmp.split('/')}

res[diff_cmp]=isoseq.altsplice_test(groups).sort_values('pvalue').reset_index()
res[diff_cmp].to_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv', index=False)
#res[diff_cmp]=pd.read_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv')
sig=res[diff_cmp].padj<.1
print(f'{sum(sig)} differential splice sites in {len(res[diff_cmp].loc[sig,"gene"].unique())} genes for {" vs ".join(groups)}')
res[diff_cmp][res[diff_cmp].splice_type.isin(types_of_interest)].head(10)
INFO:testing differential splicing for GM12878 (3) vs K562 (3) using betabinom_lr test
100%|██████████| 309557/309557 [11:48<00:00, 437.09genes/s]
4665 differential splice sites in 3050 genes for GM12878 vs K562
[23]:
index gene gene_id chrom strand start end splice_type novel padj ... GM12878_2_1_GM12878_in_cov GM12878_2_1_GM12878_total_cov GM12878_1_1_b_GM12878_in_cov GM12878_1_1_b_GM12878_total_cov K562_1_1_K562_in_cov K562_1_1_K562_total_cov K562_2_1_K562_in_cov K562_2_1_K562_total_cov K562_1_1_b_K562_in_cov K562_1_1_b_K562_total_cov
0 1124 NDUFS5 ENSG00000168653.11 chr1 + 39026398 39028722 5AS False 0.001004 ... 7 634 3 199 208 372 336 580 136 249
5 20075 RIPK2 ENSG00000104312.8 chr8 + 89780160 89786592 ES True 0.004049 ... 224 225 76 76 34 104 33 112 41 152
8 616 PSMB2 ENSG00000126067.12 chr1 - 35631344 35641341 ES True 0.005460 ... 954 954 257 258 681 893 595 765 262 341
9 4998 MYL6 ENSG00000092841.19 chr12 + 56160320 56161386 ES False 0.005460 ... 24 1827 22 911 188 722 235 940 182 611
13 15657 CASP3 ENSG00000164305.19 chr4 - 184638468 184649394 ES True 0.005645 ... 197 958 26 111 71 105 75 119 71 95
21 5028 RAD51AP1 ENSG00000111247.15 chr12 + 4553147 4558856 ES False 0.006496 ... 44 48 34 38 5 34 7 39 7 41
22 6326 HNRNPC ENSG00000092199.18 chr14 - 21230366 21230996 5AS False 0.006782 ... 387 2092 168 799 431 1287 480 1475 361 1094
24 19040 METTL2B ENSG00000165055.16 chr7 + 128476875 128479157 ES False 0.006782 ... 9 144 3 44 68 71 61 63 71 71
33 20288 CBWD1 ENSG00000172785.18 chr9 - 175784 178815 ES False 0.007260 ... 99 561 25 138 107 227 125 248 50 112
39 1415 MPZL1 ENSG00000197965.12 chr1 + 167773368 167787819 ES False 0.009935 ... 0 4 6 65 168 273 144 239 124 190

10 rows × 31 columns

[24]:

plt.rcParams["figure.figsize"] = (20,20)
f,axs,plotted=plot_diff_results(res[diff_cmp], min_diff=.1,grid_shape=(7,8),min_support=2, splice_types=types_of_interest)
f.savefig(f'encode_diff_betabinomial_{"_".join(groups)}.png')
plotted.to_csv(f'encode_diff_betabinomial_plot_{"_".join(groups)}.csv', index=False)
../_images/notebooks_ENCODE_S2_33_0.png
[25]:
#sashimi plot for differential spliced genes
plt.rcParams["figure.figsize"] = (15,10)
#1) select gene of interest
row=res[diff_cmp].iloc[5]
plot_diff_example(row, isoseq, groups)
RIPK2 ES (novel) FDR=4.05e-03: GM12878=99.4 vs K562=29.4 PSI
../_images/notebooks_ENCODE_S2_34_1.png
[26]:
gr=isoseq.groups()
groups={'leukemia':gr['K562']+gr['HL-60'], 'GM12878':gr['GM12878']}
diff_cmp='/'.join(groups)

#takes about 30 min
res[diff_cmp]=isoseq.altsplice_test(groups).sort_values('pvalue').reset_index()
res[diff_cmp].to_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv', index=False)
#res[diff_cmp]=pd.read_csv(f'encode_diff_betabinomial_{"_".join(groups)}.csv')
sig=res[diff_cmp].padj<.1
print(f'{sum(sig)} differential splice sites in {len(res[diff_cmp].loc[sig,"gene"].unique())} genes for {" vs ".join(groups)}')
res[diff_cmp][res[diff_cmp].splice_type.isin(types_of_interest)].head(10)
INFO:testing differential splicing for leukemia (5) vs GM12878 (3) using betabinom_lr test
100%|██████████| 309557/309557 [11:50<00:00, 435.65genes/s]
1587 differential splice sites in 1181 genes for leukemia vs GM12878
[26]:
index gene gene_id chrom strand start end splice_type novel padj ... HL-60_1_1_leukemia_in_cov HL-60_1_1_leukemia_total_cov HL-60_2_1_leukemia_in_cov HL-60_2_1_leukemia_total_cov GM12878_1_1_GM12878_in_cov GM12878_1_1_GM12878_total_cov GM12878_2_1_GM12878_in_cov GM12878_2_1_GM12878_total_cov GM12878_1_1_b_GM12878_in_cov GM12878_1_1_b_GM12878_total_cov
3 20601 METTL2B ENSG00000165055.16 chr7 + 128476875 128479157 ES False 0.001373 ... 122 123 67 68 5 88 9.0 144.0 3.0 44.0
7 21875 CBWD1 ENSG00000172785.18 chr9 - 175784 178815 ES False 0.002528 ... 192 347 120 228 60 302 99.0 561.0 25.0 138.0
9 1232 NDUFS5 ENSG00000168653.11 chr1 + 39026398 39028722 5AS False 0.002679 ... 17 44 13 43 5 393 7.0 634.0 3.0 199.0
10 7656 INTS14 ENSG00000138614.15 chr15 - 65607442 65611097 5AS False 0.003423 ... 135 224 108 178 35 158 43.0 222.0 18.0 54.0
14 922 C1orf43 ENSG00000143612.21 chr1 - 154214574 154220341 ES False 0.003568 ... 1157 1627 768 1053 159 349 237.0 497.0 49.0 106.0
18 1402 ACADM ENSG00000117054.14 chr1 + 75724817 75732643 ES False 0.003568 ... 264 325 206 256 802 836 1104.0 1159.0 155.0 164.0
28 19864 MRM2 ENSG00000122687.19 chr7 - 2235564 2239417 ES False 0.005016 ... 94 370 70 301 29 400 25.0 460.0 9.0 107.0
29 14268 DIDO1 ENSG00000101191.17 chr20 - 62914406 62926438 5AS True 0.005016 ... 175 175 156 156 38 76 28.0 58.0 29.0 50.0
30 1685 PTPRC ENSG00000081237.20 chr1 + 198692373 198703297 ES False 0.005016 ... 67 297 58 331 186 227 259.0 307.0 57.0 62.0
32 19563 ECHDC1 ENSG00000093144.19 chr6 - 127314896 127327001 ES False 0.005041 ... 148 272 109 210 320 363 441.0 502.0 66.0 88.0

10 rows × 35 columns

[27]:

plt.rcParams["figure.figsize"] = (20,20)
f,axs,plotted=plot_diff_results(res[diff_cmp], min_diff=.1,grid_shape=(7,6),min_support=2, splice_types=types_of_interest)
f.savefig(f'encode_diff_betabinomial_{"_".join(groups)}.png')
plotted.to_csv(f'encode_diff_betabinomial_plot_{"_".join(groups)}.csv', index=False)
../_images/notebooks_ENCODE_S2_36_0.png
[28]:
#sashimi plot for differential spliced genes
plt.rcParams["figure.figsize"] = (15,10)
#1) select gene of interest
row=res[diff_cmp].iloc[3]
plot_diff_example(row, isoseq, groups)
METTL2B ES (known) FDR=1.37e-03: leukemia=98.2 vs GM12878=6.2 PSI
../_images/notebooks_ENCODE_S2_37_1.png
[ ]: