Transcriptome Reconstruction¶
This tutorial processes the example dataset, based on PacBio Isoseq samples of hematopoetic cells from ENCODE. This dataset contains only a subset of genomic regions, allowing for fast processing of the demonstration tutorials. All required data files to run the tutorials can be obtained here: (download link).
You will need:
sample description file ‘encode_samples.tsv’
six .bam alignment files for the six samples
six corresponding .bam.bai indices
referernce annotation file gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
corresponding .gff3.gz.tbi index file
genomic reference file GRCh38.p13.genome_chr8.fa
cooresponding .fai index file
All files are assumed to be stored in a subfolder called ‘demonstration_dataset’
In this tutorial, we import the reference annotation, specify the alignment files for the samples, and integrate the data into a common data structure. During this step, the transcriptome is reconstructed, and quality control metrics are computed.
[1]:
# preperation: import the libraries
from isotools import Transcriptome
from isotools import __version__ as isotools_version
import pandas as pd
import matplotlib.pyplot as plt
import logging
# set up logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
logger=logging.getLogger('isotools')
logger.info(f'This is isootools version {isotools_version}')
path='demonstration_dataset'
INFO:This is isootools version 0.3.2rc3
Import of reference annotation¶
The first step is to import the reference annotation from a gff or gtf file. It should be sorted and indexed with tabix.
[2]:
annotation_fn=f'{path}/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz'
#create the IsoTools transcriptome object from the reference annotation
isoseq=Transcriptome.from_reference(annotation_fn)
INFO:importing reference from gff3 file demonstration_dataset/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz
100%|█████████▉| 2.82M/2.82M [00:02<00:00, 1.43MB/s]
INFO:skipped the following categories: {'three_prime_UTR', 'five_prime_UTR', 'CDS'}
WARNING:Missing genes! Found gene information in categories ['gene'] for 2540/5080 genes
Import of sequencing data¶
Next the sample table is imported, and sequencing information is added for each sample. After importing the alignments in bam format, quality control metrics are calculated, and the object with all data is stored on disk in a pickle file for later use.
[3]:
sample_fn=f'{path}/encode_samples.tsv'
genome_fn=f'{path}/GRCh38.p13.genome_chr8.fa'
samples=pd.read_csv(sample_fn, sep='\t')
samples.file_name=path+'/'+samples.file_name
samples
[3]:
sample_name | file_name | group | |
---|---|---|---|
0 | GM12878_a | demonstration_dataset/ENCFF417VHJ_aligned_mm2_... | GM12878 |
1 | GM12878_b | demonstration_dataset/ENCFF450VAU_aligned_mm2_... | GM12878 |
2 | GM12878_c | demonstration_dataset/ENCFF694DIE_aligned_mm2_... | GM12878 |
3 | K562_a | demonstration_dataset/ENCFF429VVB_aligned_mm2_... | K562 |
4 | K562_b | demonstration_dataset/ENCFF696GDL_aligned_mm2_... | K562 |
5 | K562_c | demonstration_dataset/ENCFF634YSN_aligned_mm2_... | K562 |
[4]:
# integrate the samples
for i,row in samples.iterrows():
# this step takes about 5-30 seconds per sample
isoseq.add_sample_from_bam(row.file_name, sample_name=row.sample_name, group=row.group)
# the sample table of the transcriptome object contains the number of imported reads
isoseq.sample_table
INFO:adding sample GM12878_a from file demonstration_dataset/ENCFF417VHJ_aligned_mm2_chr8.bam
100%|██████████| 53.0k/53.0k [00:13<00:00, 4.05kreads/s, chr=KI270757.1]
INFO:skipped 110 reads aligned fraction of less than 0.75.
INFO:skipped 10972 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
WARNING:ignored 533 chimeric alignments with only one part aligned to specified chromosomes.
INFO:ignoring 2231 chimeric alignments with less than 2 reads
INFO:imported 40182 nonchimeric reads (including 14 chained chimeric alignments) and 73 chimeric reads with coverage of at least 2.
INFO:adding sample GM12878_b from file demonstration_dataset/ENCFF450VAU_aligned_mm2_chr8.bam
100%|██████████| 68.4k/68.4k [00:12<00:00, 5.36kreads/s, chr=KI270757.1]
INFO:skipped 71 reads aligned fraction of less than 0.75.
INFO:skipped 12700 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
WARNING:ignored 484 chimeric alignments with only one part aligned to specified chromosomes.
INFO:ignoring 1273 chimeric alignments with less than 2 reads
INFO:imported 54853 nonchimeric reads (including 12 chained chimeric alignments) and 7 chimeric reads with coverage of at least 2.
INFO:adding sample GM12878_c from file demonstration_dataset/ENCFF694DIE_aligned_mm2_chr8.bam
100%|██████████| 90.7k/90.7k [00:15<00:00, 5.89kreads/s, chr=KI270757.1]
INFO:skipped 85 reads aligned fraction of less than 0.75.
INFO:skipped 17261 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
WARNING:ignored 455 chimeric alignments with only one part aligned to specified chromosomes.
INFO:ignoring 1410 chimeric alignments with less than 2 reads
INFO:imported 72451 nonchimeric reads (including 38 chained chimeric alignments) and 12 chimeric reads with coverage of at least 2.
INFO:adding sample K562_a from file demonstration_dataset/ENCFF429VVB_aligned_mm2_chr8.bam
100%|██████████| 107k/107k [00:20<00:00, 5.21kreads/s, chr=KI270757.1]
INFO:skipped 297 reads aligned fraction of less than 0.75.
INFO:skipped 23990 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
WARNING:ignored 2160 chimeric alignments with only one part aligned to specified chromosomes.
INFO:ignoring 7445 chimeric alignments with less than 2 reads
INFO:imported 76692 nonchimeric reads (including 57 chained chimeric alignments) and 415 chimeric reads with coverage of at least 2.
INFO:adding sample K562_b from file demonstration_dataset/ENCFF696GDL_aligned_mm2_chr8.bam
100%|██████████| 78.0k/78.0k [00:16<00:00, 4.87kreads/s, chr=KI270757.1]
INFO:skipped 165 reads aligned fraction of less than 0.75.
INFO:skipped 15026 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
WARNING:ignored 1142 chimeric alignments with only one part aligned to specified chromosomes.
INFO:ignoring 4530 chimeric alignments with less than 2 reads
INFO:imported 59118 nonchimeric reads (including 43 chained chimeric alignments) and 284 chimeric reads with coverage of at least 2.
INFO:adding sample K562_c from file demonstration_dataset/ENCFF634YSN_aligned_mm2_chr8.bam
100%|██████████| 117k/117k [00:19<00:00, 5.94kreads/s, chr=KI270757.1]
INFO:skipped 294 reads aligned fraction of less than 0.75.
INFO:skipped 30231 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)
WARNING:ignored 2528 chimeric alignments with only one part aligned to specified chromosomes.
INFO:ignoring 8019 chimeric alignments with less than 2 reads
INFO:imported 80343 nonchimeric reads (including 46 chained chimeric alignments) and 371 chimeric reads with coverage of at least 2.
[4]:
name | file | group | nonchimeric_reads | chimeric_reads | |
---|---|---|---|---|---|
0 | GM12878_a | demonstration_dataset/ENCFF417VHJ_aligned_mm2_... | GM12878 | 40182 | 73 |
0 | GM12878_b | demonstration_dataset/ENCFF450VAU_aligned_mm2_... | GM12878 | 54853 | 7 |
0 | GM12878_c | demonstration_dataset/ENCFF694DIE_aligned_mm2_... | GM12878 | 72451 | 12 |
0 | K562_a | demonstration_dataset/ENCFF429VVB_aligned_mm2_... | K562 | 76692 | 415 |
0 | K562_b | demonstration_dataset/ENCFF696GDL_aligned_mm2_... | K562 | 59118 | 284 |
0 | K562_c | demonstration_dataset/ENCFF634YSN_aligned_mm2_... | K562 | 80343 | 371 |
In the next step, we compute several qc metrics for the transcripts: * downstream A content, * direct repeat length at junctions, * noncanonical splicing, * potential fragments
[5]:
# compute qc metrics
isoseq.add_qc_metrics(genome_fn)
100%|██████████| 10803/10803 [01:16<00:00, 141.86genes/s]
[6]:
# export the transcriptome object for later use.
isoseq.save('PacBio_isotools_substantial_isotools.pkl')
INFO:saving transcriptome to PacBio_isotools_substantial_isotools.pkl
[ ]: