{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transcriptome Reconstruction\n", "This tutorial processes the example dataset, based on PacBio Isoseq samples of hematopoetic cells from ENCODE. This dataset contains only a subset of genomic regions, allowing for fast processing of the demonstration tutorials. \n", "All required data files to run the tutorials can be obtained here: ([download link](https://oc-molgen.gnz.mpg.de/owncloud/s/gjG9EPiQwpRAyg3)). \n", "\n", "You will need:\n", "\n", "* sample description file 'encode_samples.tsv'\n", "* six .bam alignment files for the six samples\n", "* six corresponding .bam.bai indices\n", "* referernce annotation file gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz\n", "* corresponding .gff3.gz.tbi index file\n", "* genomic reference file GRCh38.p13.genome_chr8.fa\n", "* cooresponding .fai index file\n", "\n", "All files are assumed to be stored in a subfolder called 'demonstration_dataset'\n", "\n", "In this tutorial, we import the reference annotation, specify the alignment files for the samples, and integrate the data into a common data structure. During this step, the transcriptome is reconstructed, and quality control metrics are computed. \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:This is isootools version 0.3.2rc3\n" ] } ], "source": [ "# preperation: import the libraries\n", "from isotools import Transcriptome\n", "from isotools import __version__ as isotools_version\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import logging\n", "# set up logging\n", "logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)\n", "logger=logging.getLogger('isotools')\n", "logger.info(f'This is isootools version {isotools_version}')\n", "\n", "\n", "path='demonstration_dataset'\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import of reference annotation\n", "The first step is to import the reference annotation from a gff or gtf file. It should be sorted and indexed with tabix. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:importing reference from gff3 file demonstration_dataset/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz\n", "100%|█████████▉| 2.82M/2.82M [00:02<00:00, 1.43MB/s]\n", "INFO:skipped the following categories: {'three_prime_UTR', 'five_prime_UTR', 'CDS'}\n", "WARNING:Missing genes! Found gene information in categories ['gene'] for 2540/5080 genes\n" ] } ], "source": [ "annotation_fn=f'{path}/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz'\n", "#create the IsoTools transcriptome object from the reference annotation\n", "isoseq=Transcriptome.from_reference(annotation_fn)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import of sequencing data\n", "Next the sample table is imported, and sequencing information is added for each sample. After importing the alignments in bam format, quality control metrics are calculated, and the object with all data is stored on disk in a pickle file for later use." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sample_namefile_namegroup
0GM12878_ademonstration_dataset/ENCFF417VHJ_aligned_mm2_...GM12878
1GM12878_bdemonstration_dataset/ENCFF450VAU_aligned_mm2_...GM12878
2GM12878_cdemonstration_dataset/ENCFF694DIE_aligned_mm2_...GM12878
3K562_ademonstration_dataset/ENCFF429VVB_aligned_mm2_...K562
4K562_bdemonstration_dataset/ENCFF696GDL_aligned_mm2_...K562
5K562_cdemonstration_dataset/ENCFF634YSN_aligned_mm2_...K562
\n", "
" ], "text/plain": [ " sample_name file_name group\n", "0 GM12878_a demonstration_dataset/ENCFF417VHJ_aligned_mm2_... GM12878\n", "1 GM12878_b demonstration_dataset/ENCFF450VAU_aligned_mm2_... GM12878\n", "2 GM12878_c demonstration_dataset/ENCFF694DIE_aligned_mm2_... GM12878\n", "3 K562_a demonstration_dataset/ENCFF429VVB_aligned_mm2_... K562\n", "4 K562_b demonstration_dataset/ENCFF696GDL_aligned_mm2_... K562\n", "5 K562_c demonstration_dataset/ENCFF634YSN_aligned_mm2_... K562" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample_fn=f'{path}/encode_samples.tsv'\n", "genome_fn=f'{path}/GRCh38.p13.genome_chr8.fa'\n", "\n", "samples=pd.read_csv(sample_fn, sep='\\t')\n", "samples.file_name=path+'/'+samples.file_name\n", "samples" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:adding sample GM12878_a from file demonstration_dataset/ENCFF417VHJ_aligned_mm2_chr8.bam\n", "100%|██████████| 53.0k/53.0k [00:13<00:00, 4.05kreads/s, chr=KI270757.1]\n", "INFO:skipped 110 reads aligned fraction of less than 0.75.\n", "INFO:skipped 10972 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n", "WARNING:ignored 533 chimeric alignments with only one part aligned to specified chromosomes.\n", "INFO:ignoring 2231 chimeric alignments with less than 2 reads\n", "INFO:imported 40182 nonchimeric reads (including 14 chained chimeric alignments) and 73 chimeric reads with coverage of at least 2.\n", "INFO:adding sample GM12878_b from file demonstration_dataset/ENCFF450VAU_aligned_mm2_chr8.bam\n", "100%|██████████| 68.4k/68.4k [00:12<00:00, 5.36kreads/s, chr=KI270757.1]\n", "INFO:skipped 71 reads aligned fraction of less than 0.75.\n", "INFO:skipped 12700 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n", "WARNING:ignored 484 chimeric alignments with only one part aligned to specified chromosomes.\n", "INFO:ignoring 1273 chimeric alignments with less than 2 reads\n", "INFO:imported 54853 nonchimeric reads (including 12 chained chimeric alignments) and 7 chimeric reads with coverage of at least 2.\n", "INFO:adding sample GM12878_c from file demonstration_dataset/ENCFF694DIE_aligned_mm2_chr8.bam\n", "100%|██████████| 90.7k/90.7k [00:15<00:00, 5.89kreads/s, chr=KI270757.1]\n", "INFO:skipped 85 reads aligned fraction of less than 0.75.\n", "INFO:skipped 17261 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n", "WARNING:ignored 455 chimeric alignments with only one part aligned to specified chromosomes.\n", "INFO:ignoring 1410 chimeric alignments with less than 2 reads\n", "INFO:imported 72451 nonchimeric reads (including 38 chained chimeric alignments) and 12 chimeric reads with coverage of at least 2.\n", "INFO:adding sample K562_a from file demonstration_dataset/ENCFF429VVB_aligned_mm2_chr8.bam\n", "100%|██████████| 107k/107k [00:20<00:00, 5.21kreads/s, chr=KI270757.1]\n", "INFO:skipped 297 reads aligned fraction of less than 0.75.\n", "INFO:skipped 23990 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n", "WARNING:ignored 2160 chimeric alignments with only one part aligned to specified chromosomes.\n", "INFO:ignoring 7445 chimeric alignments with less than 2 reads\n", "INFO:imported 76692 nonchimeric reads (including 57 chained chimeric alignments) and 415 chimeric reads with coverage of at least 2.\n", "INFO:adding sample K562_b from file demonstration_dataset/ENCFF696GDL_aligned_mm2_chr8.bam\n", "100%|██████████| 78.0k/78.0k [00:16<00:00, 4.87kreads/s, chr=KI270757.1]\n", "INFO:skipped 165 reads aligned fraction of less than 0.75.\n", "INFO:skipped 15026 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n", "WARNING:ignored 1142 chimeric alignments with only one part aligned to specified chromosomes.\n", "INFO:ignoring 4530 chimeric alignments with less than 2 reads\n", "INFO:imported 59118 nonchimeric reads (including 43 chained chimeric alignments) and 284 chimeric reads with coverage of at least 2.\n", "INFO:adding sample K562_c from file demonstration_dataset/ENCFF634YSN_aligned_mm2_chr8.bam\n", "100%|██████████| 117k/117k [00:19<00:00, 5.94kreads/s, chr=KI270757.1]\n", "INFO:skipped 294 reads aligned fraction of less than 0.75.\n", "INFO:skipped 30231 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n", "WARNING:ignored 2528 chimeric alignments with only one part aligned to specified chromosomes.\n", "INFO:ignoring 8019 chimeric alignments with less than 2 reads\n", "INFO:imported 80343 nonchimeric reads (including 46 chained chimeric alignments) and 371 chimeric reads with coverage of at least 2.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefilegroupnonchimeric_readschimeric_reads
0GM12878_ademonstration_dataset/ENCFF417VHJ_aligned_mm2_...GM128784018273
0GM12878_bdemonstration_dataset/ENCFF450VAU_aligned_mm2_...GM12878548537
0GM12878_cdemonstration_dataset/ENCFF694DIE_aligned_mm2_...GM128787245112
0K562_ademonstration_dataset/ENCFF429VVB_aligned_mm2_...K56276692415
0K562_bdemonstration_dataset/ENCFF696GDL_aligned_mm2_...K56259118284
0K562_cdemonstration_dataset/ENCFF634YSN_aligned_mm2_...K56280343371
\n", "
" ], "text/plain": [ " name file group \\\n", "0 GM12878_a demonstration_dataset/ENCFF417VHJ_aligned_mm2_... GM12878 \n", "0 GM12878_b demonstration_dataset/ENCFF450VAU_aligned_mm2_... GM12878 \n", "0 GM12878_c demonstration_dataset/ENCFF694DIE_aligned_mm2_... GM12878 \n", "0 K562_a demonstration_dataset/ENCFF429VVB_aligned_mm2_... K562 \n", "0 K562_b demonstration_dataset/ENCFF696GDL_aligned_mm2_... K562 \n", "0 K562_c demonstration_dataset/ENCFF634YSN_aligned_mm2_... K562 \n", "\n", " nonchimeric_reads chimeric_reads \n", "0 40182 73 \n", "0 54853 7 \n", "0 72451 12 \n", "0 76692 415 \n", "0 59118 284 \n", "0 80343 371 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# integrate the samples\n", "for i,row in samples.iterrows():\n", " # this step takes about 5-30 seconds per sample\n", " isoseq.add_sample_from_bam(row.file_name, sample_name=row.sample_name, group=row.group)\n", "# the sample table of the transcriptome object contains the number of imported reads\n", "isoseq.sample_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next step, we compute several qc metrics for the transcripts: \n", "* downstream A content, \n", "* direct repeat length at junctions, \n", "* noncanonical splicing, \n", "* potential fragments\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 10803/10803 [01:16<00:00, 141.86genes/s]\n" ] } ], "source": [ "# compute qc metrics\n", "isoseq.add_qc_metrics(genome_fn)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:saving transcriptome to PacBio_isotools_substantial_isotools.pkl\n" ] } ], "source": [ "# export the transcriptome object for later use. \n", "isoseq.save('PacBio_isotools_substantial_isotools.pkl')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 4 }