Data access and Filtering

In this tutorial, we will learn to

  • access genes by gene_name/gene_id.

  • iterate over genes/ transcripts, and filter them by their properties, genomic location or coverage.

  • use IsoTools query syntax to filter genes and transcripts.

  • define custom tags and filter expressions, to tailor filter queries.

This tutorial depends on the transcriptome file PacBio_isotools_substantial_isotools.pkl, which can be obtained with this download link

[1]:
from isotools import Transcriptome
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

path='demonstration_dataset'
isoseq=Transcriptome.load(f'{path}/PacBio_isotools_substantial_isotools.pkl')
This is isotools version 0.3.2rc3, but data has been pickled with version 0.3.2rc2, which may be incompatible

Access genes by name or id

Gene objects can be retrieved from the transcriptome object using the square bracket operator. This works with both the gene name and gene id. The gene object contains all information of the gene, like the coverage and the list of transcripts, which in turn contain the list of exons, comparison with reference annotation and much more.

[2]:
g=isoseq['ENSG00000104312.8']
if g==isoseq['RIPK2']:
    print('found the same gene by name and id')

#string representation
print(g)
# obtain the sum coverage over all transcripts and samples
total_cov=g.coverage.sum()
print(f"{total_cov} reads in total")
found the same gene by name and id
Gene RIPK2 chr8:89757805-89791064(+), 4 reference transcripts, 52 expressed transcripts
920 reads in total
[3]:
#This gene has 52 expressed transcripts, only 4 of them are in the reference.
#However, most are supported by few reads only
print(f'{sum([cov>total_cov *.01 for cov in g.coverage.sum(0)])} transcripts contribute at least 1% to that gene')

7 transcripts contribute at least 1% to that gene
[4]:
#lets look at the primary transcript
max_i=np.argmax(g.coverage.sum(0))
max_contribution=g.coverage.sum(0)[max_i]
print(f'The primary transcript is number {max_i} and contributes {max_contribution/total_cov:.2%}  ({max_contribution}/{total_cov} reads)')

The primary transcript is number 2 and contributes 40.33%  (371/920 reads)
[5]:
#all the information for this transcript are stored in this dict:
primary=g.transcripts[max_i]
print(f'\nThese are the infos for this transcript:')
for k,v in primary.items():
    print(f'{k}: {str(v)[:100]}{"..." if len(str(v))>100 else ""}')



These are the infos for this transcript:
exons: [[89757792, 89758233], [89762828, 89762982], [89765340, 89765496], [89769771, 89769929], [89771740, ...
strand: +
coverage: {'GM12878_a': 48, 'GM12878_b': 135, 'GM12878_c': 138, 'K562_a': 19, 'K562_b': 18, 'K562_c': 13}
TSS: {'GM12878_a': {89757763: 1, 89757766: 1, 89757778: 1, 89757782: 3, 89757783: 4, 89757784: 4, 8975778...
PAS: {'GM12878_a': {89790462: 3, 89791057: 3, 89790608: 5, 89790939: 3, 89790606: 3, 89790425: 1, 8979046...
annotation: (0, {'FSM': [1]})
TSS_unified: {'GM12878_a': {89757792: 48}, 'GM12878_b': {89757792: 135}, 'GM12878_c': {89757792: 138}, 'K562_a': ...
PAS_unified: {'GM12878_a': {89790463: 20, 89790983: 10, 89790606: 11, 89790940: 3, 89790525: 4}, 'GM12878_b': {89...
ORF: (89758060, 89790416, {'start': 268, 'length': 1623, 'start_codon': 'ATG', 'stop_codon': 'TAA', 'NMD'...
direct_repeat_len: [3, 3, 4, 5, 7, 3, 5, 4, 6, 4]
downstream_A_content: 0.2
[6]:

# this 'annotation' line reveals that it is a FSM with reference transcript nr 1:
# annotation: (0, {'FSM': [1]})

print(f'\nThe corresponding reference transcript: ')
for k,v in g.ref_transcripts[primary["annotation"][1]["FSM"][0]].items():
    print(f'{k}: {str(v)[:100]}{"..." if len(str(v))>100 else ""}')

The corresponding reference transcript:
transcript_id: ENST00000220751.5
transcript_type: protein_coding
transcript_name: RIPK2-201
transcript_support_level: 1
exons: [(89757815, 89758233), (89762828, 89762982), (89765340, 89765496), (89769771, 89769929), (89771740, ...
CDS: (89758060, 89790416)
downstream_A_content: 0.06666666666666667

Iterating genes and transcripts

To iterate genes and transcripts, the transcriptome object provides the functions iter_transcripts and iter_genes. Both have the option to filter by genomic region or coverage, and with queries. The gene iterator yield the gene, while the transcript iterator function yields a 3 tuple with gene, transcript index, and transcript dictionary.

[7]:
for g in isoseq.iter_genes(region='chr8:89000000-90000000', min_coverage=100):
    print(g)
Gene OSGIN2 chr8:89901848-89927888(+), 4 reference transcripts, 48 expressed transcripts
Gene RIPK2 chr8:89757805-89791064(+), 4 reference transcripts, 52 expressed transcripts
Gene NBN chr8:89924514-90003228(-), 38 reference transcripts, 327 expressed transcripts

Filtering tags and queries

Isotools allows to filter genes and transcripts based on TAGS (single word in ALLCAPS).

Each TAG is defined by a corresponding expression, that gets evaluated on the properties of the gene or transcript.

A TAG is defined in a specific context, either gene, transcript, or reference transcript context, in which the expression gets evaluated. Expressions in gene context depend on properties of the gene, while in transcript context, the properties of the transcript are relevant.

We already used tags in the previous tutorial, for the definition of the sequencing artifacts.’INTERNAL_PRIMING’,’FRAGMENT’,’RTTS’ all have corresponding expressions, that define them. For example, the expression for INTERNAL_PRIMING tag is ‘len(exons)==1 and downstream_A_content and downstream_A_content>.5’, e.g. it selects (e.g. returns True) mono exon genes with more than 50% A downstream of the transcript.

As additional examples, we print the default definitions for all defined tags.

[8]:

#print all defined filter expressions
for context in isoseq.filter:
    print(f'\n{context}\n{"="*len(context)}')
    for tag,expression in isoseq.filter[context].items():
        print(f'- {tag}:\t{expression}')


gene
====
- NOVEL_GENE:   not reference
- EXPRESSED:    transcripts
- CHIMERIC:     chimeric

transcript
==========
- INTERNAL_PRIMING:     len(exons)==1 and downstream_A_content and downstream_A_content>.5
- RTTS: noncanonical_splicing is not None and novel_splice_sites is not None and         any(2*i in novel_splice_sites and 2*i+1 in novel_splice_sites for i,_ in noncanonical_splicing)
- NONCANONICAL_SPLICING:        noncanonical_splicing
- NOVEL_TRANSCRIPT:     annotation[0]>0
- FRAGMENT:     fragments and any("novel exonic " in a or "fragment" in a for a in annotation[1])
- UNSPLICED:    len(exons)==1
- MULTIEXON:    len(exons)>1
- SUBSTANTIAL:  g.coverage.sum() * .01 < g.coverage[:,trid].sum()
- ANTISENSE:    "antisense" in annotation[1]
- INTERGENIC:   "intergenic" in annotation[1]
- GENIC_GENOMIC:        "genic genomic" in annotation[1]
- NOVEL_EXONIC_PAS:     "novel exonic PAS" in annotation[1]
- NOVEL_INTRONIC_PAS:   "novel intronic PAS" in annotation[1]
- READTHROUGH_FUSION:   "readthrough fusion" in annotation[1]
- NOVEL_EXON:   "novel exon" in annotation[1]
- NOVEL_3_SPLICE_SITE:  "novel 3' splice site" in annotation[1]
- INTRON_RETENTION:     "intron retention" in annotation[1]
- NOVEL_5_SPLICE_SITE:  "novel 5' splice site" in annotation[1]
- EXON_SKIPPING:        "exon skipping" in annotation[1]
- NOVEL_COMBINATION:    "novel combination" in annotation[1]
- NOVEL_INTRONIC_TSS:   "novel intronic TSS" in annotation[1]
- NOVEL_EXONIC_TSS:     "novel exonic TSS" in annotation[1]
- MONO_EXON:    "mono-exon" in annotation[1]
- NOVEL_JUNCTION:       "novel junction" in annotation[1]
- _5_FRAGMENT:  "5' fragment" in annotation[1]
- _3_FRAGMENT:  "3' fragment" in annotation[1]
- INTRONIC:     "intronic" in annotation[1]
- FSM:  annotation[0]==0
- ISM:  annotation[0]==1
- NIC:  annotation[0]==2
- NNC:  annotation[0]==3
- NOVEL:        annotation[0]==4
- COVERED:      any(g.coverage[:,trid] > 2)
- HIGH_COVERED: any(g.coverage[:,trid] > 5)

reference
=========
- REF_UNSPLICED:        len(exons)==1
- REF_MULTIEXON:        len(exons)>1
- REF_INTERNAL_PRIMING: downstream_A_content>.5

Custom Tags

Users can modify existing criteria, for example to adjust thresholds, or define additional criteria, based on custom properties.

The following example shows how the user can define additional flags, in this case “HIGH_SUPPORT” and “PROTEIN_CODING” for the reference transcripts, which is based on the GENCODE annotation information on “transcript_support_level” and “transcript_type”.

[9]:
#add /modify custom filter

isoseq.add_filter( "HIGH_SUPPORT", 'transcript_support_level=="1"', context='reference')
isoseq.add_filter( "PROTEIN_CODING", 'transcript_type=="protein_coding"', context='reference')

Filter Queries

The tags can be combined to boolian expressions, to query transcripts of interest. For example, to find novel exon skipping events, that contribute substantially to the genes total expression:

[10]:

i=0
for g,trnr,tr in isoseq.iter_transcripts(query='EXON_SKIPPING and SUBSTANTIAL', min_coverage=50):
    print(f'Transcript nr {trnr} of "{g}" with a coverage of {g.coverage.sum(0)[trnr]}')
    print(f"  -> {tr['annotation'][1]['exon skipping']}")

Transcript nr 5 of "Gene ARMC1 chr8:65602457-65634217(-), 6 reference transcripts, 59 expressed transcripts" with a coverage of 51
  -> [[65605421, 65605538]]
Transcript nr 3 of "Gene ZNF7 chr8:144827517-144847509(+), 18 reference transcripts, 85 expressed transcripts" with a coverage of 52
  -> [[144828007, 144828053], [144828728, 144828950], [144829042, 144829604], [144830930, 144831029]]
Transcript nr 0 of "Gene CIBAR1 chr8:93698560-93731527(+), 18 reference transcripts, 29 expressed transcripts" with a coverage of 296
  -> [[93707201, 93707281], [93708010, 93708016]]
Transcript nr 8 of "Gene RIPK2 chr8:89757805-89791064(+), 4 reference transcripts, 52 expressed transcripts" with a coverage of 115
  -> [[89784049, 89784139]]
Transcript nr 16 of "Gene RIPK2 chr8:89757805-89791064(+), 4 reference transcripts, 52 expressed transcripts" with a coverage of 124
  -> [[89784049, 89784139]]