{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ENCODE Long Read Sequel2 dataset\n", "\n", "This tutorial guides through a typical Long Read Transcriptome Sequencing (LRTS) analysis workflow with isotools, \n", "using ENCODE Isoseq Sequel2 Data. It demonstrates the analysis of alternative splicing events within and between sample groups. \n", "\n", "The original analysis integrates quite a large number of samples, and consequently runs several hours. Users interested in reproducing the notebook should consider restricting the data to the samples they are interested in.\n", "\n", "## Preparation\n", "In this notebook, \"long read RNA-seq\" samples are downloaded from ENCODE to the \"encode\" subdirectory. You can also manually download the files using the data portal (https://www.encodeproject.org/) and download aligned .bam files. Here I use all Sequel II samples leukemia and b-cell samples the time of writing, but you can choose to process a subset. \n", "\n", "Further, you need a reference annotation and a genome fastq file. Please see the Alzheimer tutorial how to get these files. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data import" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:This is isootools version 0.2.7\n" ] } ], "source": [ "\n", "from isotools import Transcriptome\n", "from isotools import __version__ as isotools_version\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import os\n", "from pathlib import Path\n", "import logging\n", "from collections import Counter\n", "from urllib.request import urlretrieve\n", "import pysam\n", "\n", "logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)\n", "logger=logging.getLogger('isotools')\n", "logger.info(f'This is isootools version {isotools_version}')\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:Oxford Nanopore PromethION: In total 4 samples from ENCODE\n", "INFO:Pacific Biosciences Sequel: In total 16 samples from ENCODE\n", "INFO:Pacific Biosciences Sequel II: In total 58 samples from ENCODE\n", "INFO:Oxford Nanopore MinION: In total 11 samples from ENCODE\n", "INFO:downloading bam file for ENCFF648NAR\n", "INFO:indexing ENCFF648NAR\n", "INFO:downloading bam file for ENCFF225CCJ\n", "INFO:indexing ENCFF225CCJ\n", "INFO:downloading bam file for ENCFF219UJG\n", "INFO:indexing ENCFF219UJG\n", "INFO:downloading bam file for ENCFF810FRP\n", "INFO:indexing ENCFF810FRP\n", "INFO:downloading bam file for ENCFF600MGT\n", "INFO:indexing ENCFF600MGT\n", "INFO:downloading bam file for ENCFF322UJU\n", "INFO:indexing ENCFF322UJU\n", "INFO:downloading bam file for ENCFF661OEY\n", "INFO:indexing ENCFF661OEY\n", "INFO:downloading bam file for ENCFF645UVN\n", "INFO:indexing ENCFF645UVN\n" ] }, { "data": { "text/html": [ "
\n", " | File accession | \n", "Output type | \n", "Biosample term name | \n", "Biosample type | \n", "Technical replicate(s) | \n", "Platform | \n", "
---|---|---|---|---|---|---|
0 | \n", "ENCFF648NAR | \n", "alignments | \n", "GM12878 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
1 | \n", "ENCFF225CCJ | \n", "alignments | \n", "GM12878 | \n", "cell line | \n", "2_1 | \n", "Pacific Biosciences Sequel II | \n", "
2 | \n", "ENCFF219UJG | \n", "alignments | \n", "GM12878 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
3 | \n", "ENCFF810FRP | \n", "alignments | \n", "HL-60 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
4 | \n", "ENCFF600MGT | \n", "alignments | \n", "HL-60 | \n", "cell line | \n", "2_1 | \n", "Pacific Biosciences Sequel II | \n", "
5 | \n", "ENCFF322UJU | \n", "alignments | \n", "K562 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
6 | \n", "ENCFF661OEY | \n", "alignments | \n", "K562 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
7 | \n", "ENCFF645UVN | \n", "alignments | \n", "K562 | \n", "cell line | \n", "2_1 | \n", "Pacific Biosciences Sequel II | \n", "
\n", " | index | \n", "gene | \n", "gene_id | \n", "chrom | \n", "strand | \n", "start | \n", "end | \n", "splice_type | \n", "novel | \n", "padj | \n", "... | \n", "GM12878_1_1_GM12878_in_cov | \n", "GM12878_1_1_GM12878_total_cov | \n", "GM12878_2_1_GM12878_in_cov | \n", "GM12878_2_1_GM12878_total_cov | \n", "GM12878_1_1_b_GM12878_in_cov | \n", "GM12878_1_1_b_GM12878_total_cov | \n", "HL-60_1_1_HL-60_in_cov | \n", "HL-60_1_1_HL-60_total_cov | \n", "HL-60_2_1_HL-60_in_cov | \n", "HL-60_2_1_HL-60_total_cov | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | \n", "13046 | \n", "EIF4A2 | \n", "ENSG00000156976.17 | \n", "chr3 | \n", "+ | \n", "186787882 | \n", "186789124 | \n", "ES | \n", "False | \n", "0.003859 | \n", "... | \n", "234 | \n", "708 | \n", "333 | \n", "966 | \n", "49 | \n", "132 | \n", "47.0 | \n", "1748.0 | \n", "26.0 | \n", "1205.0 | \n", "
4 | \n", "13047 | \n", "EIF4A2 | \n", "ENSG00000156976.17 | \n", "chr3 | \n", "+ | \n", "186787882 | \n", "186789124 | \n", "3AS | \n", "True | \n", "0.004204 | \n", "... | \n", "210 | \n", "684 | \n", "250 | \n", "883 | \n", "39 | \n", "122 | \n", "25.0 | \n", "1726.0 | \n", "14.0 | \n", "1193.0 | \n", "
9 | \n", "3869 | \n", "ANAPC5 | \n", "ENSG00000089053.13 | \n", "chr12 | \n", "- | \n", "121328458 | \n", "121330582 | \n", "3AS | \n", "False | \n", "0.006753 | \n", "... | \n", "895 | \n", "1340 | \n", "1169 | \n", "1727 | \n", "201 | \n", "305 | \n", "470.0 | \n", "1206.0 | \n", "404.0 | \n", "989.0 | \n", "
13 | \n", "15216 | \n", "MCM3 | \n", "ENSG00000112118.20 | \n", "chr6 | \n", "- | \n", "52264786 | \n", "52266074 | \n", "ES | \n", "False | \n", "0.006753 | \n", "... | \n", "47 | \n", "2403 | \n", "31 | \n", "1979 | \n", "22 | \n", "704 | \n", "519.0 | \n", "2363.0 | \n", "443.0 | \n", "1955.0 | \n", "
14 | \n", "10047 | \n", "CBWD2 | \n", "ENSG00000136682.15 | \n", "chr2 | \n", "+ | \n", "113438022 | \n", "113441350 | \n", "ES | \n", "False | \n", "0.006753 | \n", "... | \n", "24 | \n", "209 | \n", "37 | \n", "326 | \n", "12 | \n", "88 | \n", "135.0 | \n", "199.0 | \n", "83.0 | \n", "117.0 | \n", "
17 | \n", "16292 | \n", "METTL2B | \n", "ENSG00000165055.16 | \n", "chr7 | \n", "+ | \n", "128476875 | \n", "128479157 | \n", "ES | \n", "False | \n", "0.006753 | \n", "... | \n", "5 | \n", "88 | \n", "9 | \n", "144 | \n", "3 | \n", "44 | \n", "122.0 | \n", "123.0 | \n", "67.0 | \n", "68.0 | \n", "
20 | \n", "1711 | \n", "SRP9 | \n", "ENSG00000143742.14 | \n", "chr1 | \n", "+ | \n", "225783368 | \n", "225789239 | \n", "ES | \n", "False | \n", "0.007374 | \n", "... | \n", "95 | \n", "1453 | \n", "173 | \n", "2368 | \n", "24 | \n", "266 | \n", "305.0 | \n", "1134.0 | \n", "208.0 | \n", "749.0 | \n", "
23 | \n", "7847 | \n", "WSB1 | \n", "ENSG00000109046.15 | \n", "chr17 | \n", "+ | \n", "27306882 | \n", "27309099 | \n", "3AS | \n", "False | \n", "0.007374 | \n", "... | \n", "25 | \n", "403 | \n", "21 | \n", "621 | \n", "6 | \n", "63 | \n", "101.0 | \n", "108.0 | \n", "144.0 | \n", "155.0 | \n", "
24 | \n", "5145 | \n", "ZNF410 | \n", "ENSG00000119725.20 | \n", "chr14 | \n", "+ | \n", "73921105 | \n", "73923394 | \n", "ES | \n", "False | \n", "0.007374 | \n", "... | \n", "332 | \n", "494 | \n", "497 | \n", "772 | \n", "88 | \n", "134 | \n", "350.0 | \n", "353.0 | \n", "283.0 | \n", "283.0 | \n", "
26 | \n", "1284 | \n", "LGALS8 | \n", "ENSG00000116977.19 | \n", "chr1 | \n", "+ | \n", "236542787 | \n", "236543559 | \n", "ES | \n", "False | \n", "0.007374 | \n", "... | \n", "82 | \n", "111 | \n", "107 | \n", "150 | \n", "13 | \n", "17 | \n", "14.0 | \n", "154.0 | \n", "9.0 | \n", "120.0 | \n", "
10 rows × 29 columns
\n", "\n", " | index | \n", "gene | \n", "gene_id | \n", "chrom | \n", "strand | \n", "start | \n", "end | \n", "splice_type | \n", "novel | \n", "padj | \n", "... | \n", "GM12878_2_1_GM12878_in_cov | \n", "GM12878_2_1_GM12878_total_cov | \n", "GM12878_1_1_b_GM12878_in_cov | \n", "GM12878_1_1_b_GM12878_total_cov | \n", "K562_1_1_K562_in_cov | \n", "K562_1_1_K562_total_cov | \n", "K562_2_1_K562_in_cov | \n", "K562_2_1_K562_total_cov | \n", "K562_1_1_b_K562_in_cov | \n", "K562_1_1_b_K562_total_cov | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1124 | \n", "NDUFS5 | \n", "ENSG00000168653.11 | \n", "chr1 | \n", "+ | \n", "39026398 | \n", "39028722 | \n", "5AS | \n", "False | \n", "0.001004 | \n", "... | \n", "7 | \n", "634 | \n", "3 | \n", "199 | \n", "208 | \n", "372 | \n", "336 | \n", "580 | \n", "136 | \n", "249 | \n", "
5 | \n", "20075 | \n", "RIPK2 | \n", "ENSG00000104312.8 | \n", "chr8 | \n", "+ | \n", "89780160 | \n", "89786592 | \n", "ES | \n", "True | \n", "0.004049 | \n", "... | \n", "224 | \n", "225 | \n", "76 | \n", "76 | \n", "34 | \n", "104 | \n", "33 | \n", "112 | \n", "41 | \n", "152 | \n", "
8 | \n", "616 | \n", "PSMB2 | \n", "ENSG00000126067.12 | \n", "chr1 | \n", "- | \n", "35631344 | \n", "35641341 | \n", "ES | \n", "True | \n", "0.005460 | \n", "... | \n", "954 | \n", "954 | \n", "257 | \n", "258 | \n", "681 | \n", "893 | \n", "595 | \n", "765 | \n", "262 | \n", "341 | \n", "
9 | \n", "4998 | \n", "MYL6 | \n", "ENSG00000092841.19 | \n", "chr12 | \n", "+ | \n", "56160320 | \n", "56161386 | \n", "ES | \n", "False | \n", "0.005460 | \n", "... | \n", "24 | \n", "1827 | \n", "22 | \n", "911 | \n", "188 | \n", "722 | \n", "235 | \n", "940 | \n", "182 | \n", "611 | \n", "
13 | \n", "15657 | \n", "CASP3 | \n", "ENSG00000164305.19 | \n", "chr4 | \n", "- | \n", "184638468 | \n", "184649394 | \n", "ES | \n", "True | \n", "0.005645 | \n", "... | \n", "197 | \n", "958 | \n", "26 | \n", "111 | \n", "71 | \n", "105 | \n", "75 | \n", "119 | \n", "71 | \n", "95 | \n", "
21 | \n", "5028 | \n", "RAD51AP1 | \n", "ENSG00000111247.15 | \n", "chr12 | \n", "+ | \n", "4553147 | \n", "4558856 | \n", "ES | \n", "False | \n", "0.006496 | \n", "... | \n", "44 | \n", "48 | \n", "34 | \n", "38 | \n", "5 | \n", "34 | \n", "7 | \n", "39 | \n", "7 | \n", "41 | \n", "
22 | \n", "6326 | \n", "HNRNPC | \n", "ENSG00000092199.18 | \n", "chr14 | \n", "- | \n", "21230366 | \n", "21230996 | \n", "5AS | \n", "False | \n", "0.006782 | \n", "... | \n", "387 | \n", "2092 | \n", "168 | \n", "799 | \n", "431 | \n", "1287 | \n", "480 | \n", "1475 | \n", "361 | \n", "1094 | \n", "
24 | \n", "19040 | \n", "METTL2B | \n", "ENSG00000165055.16 | \n", "chr7 | \n", "+ | \n", "128476875 | \n", "128479157 | \n", "ES | \n", "False | \n", "0.006782 | \n", "... | \n", "9 | \n", "144 | \n", "3 | \n", "44 | \n", "68 | \n", "71 | \n", "61 | \n", "63 | \n", "71 | \n", "71 | \n", "
33 | \n", "20288 | \n", "CBWD1 | \n", "ENSG00000172785.18 | \n", "chr9 | \n", "- | \n", "175784 | \n", "178815 | \n", "ES | \n", "False | \n", "0.007260 | \n", "... | \n", "99 | \n", "561 | \n", "25 | \n", "138 | \n", "107 | \n", "227 | \n", "125 | \n", "248 | \n", "50 | \n", "112 | \n", "
39 | \n", "1415 | \n", "MPZL1 | \n", "ENSG00000197965.12 | \n", "chr1 | \n", "+ | \n", "167773368 | \n", "167787819 | \n", "ES | \n", "False | \n", "0.009935 | \n", "... | \n", "0 | \n", "4 | \n", "6 | \n", "65 | \n", "168 | \n", "273 | \n", "144 | \n", "239 | \n", "124 | \n", "190 | \n", "
10 rows × 31 columns
\n", "\n", " | index | \n", "gene | \n", "gene_id | \n", "chrom | \n", "strand | \n", "start | \n", "end | \n", "splice_type | \n", "novel | \n", "padj | \n", "... | \n", "HL-60_1_1_leukemia_in_cov | \n", "HL-60_1_1_leukemia_total_cov | \n", "HL-60_2_1_leukemia_in_cov | \n", "HL-60_2_1_leukemia_total_cov | \n", "GM12878_1_1_GM12878_in_cov | \n", "GM12878_1_1_GM12878_total_cov | \n", "GM12878_2_1_GM12878_in_cov | \n", "GM12878_2_1_GM12878_total_cov | \n", "GM12878_1_1_b_GM12878_in_cov | \n", "GM12878_1_1_b_GM12878_total_cov | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | \n", "20601 | \n", "METTL2B | \n", "ENSG00000165055.16 | \n", "chr7 | \n", "+ | \n", "128476875 | \n", "128479157 | \n", "ES | \n", "False | \n", "0.001373 | \n", "... | \n", "122 | \n", "123 | \n", "67 | \n", "68 | \n", "5 | \n", "88 | \n", "9.0 | \n", "144.0 | \n", "3.0 | \n", "44.0 | \n", "
7 | \n", "21875 | \n", "CBWD1 | \n", "ENSG00000172785.18 | \n", "chr9 | \n", "- | \n", "175784 | \n", "178815 | \n", "ES | \n", "False | \n", "0.002528 | \n", "... | \n", "192 | \n", "347 | \n", "120 | \n", "228 | \n", "60 | \n", "302 | \n", "99.0 | \n", "561.0 | \n", "25.0 | \n", "138.0 | \n", "
9 | \n", "1232 | \n", "NDUFS5 | \n", "ENSG00000168653.11 | \n", "chr1 | \n", "+ | \n", "39026398 | \n", "39028722 | \n", "5AS | \n", "False | \n", "0.002679 | \n", "... | \n", "17 | \n", "44 | \n", "13 | \n", "43 | \n", "5 | \n", "393 | \n", "7.0 | \n", "634.0 | \n", "3.0 | \n", "199.0 | \n", "
10 | \n", "7656 | \n", "INTS14 | \n", "ENSG00000138614.15 | \n", "chr15 | \n", "- | \n", "65607442 | \n", "65611097 | \n", "5AS | \n", "False | \n", "0.003423 | \n", "... | \n", "135 | \n", "224 | \n", "108 | \n", "178 | \n", "35 | \n", "158 | \n", "43.0 | \n", "222.0 | \n", "18.0 | \n", "54.0 | \n", "
14 | \n", "922 | \n", "C1orf43 | \n", "ENSG00000143612.21 | \n", "chr1 | \n", "- | \n", "154214574 | \n", "154220341 | \n", "ES | \n", "False | \n", "0.003568 | \n", "... | \n", "1157 | \n", "1627 | \n", "768 | \n", "1053 | \n", "159 | \n", "349 | \n", "237.0 | \n", "497.0 | \n", "49.0 | \n", "106.0 | \n", "
18 | \n", "1402 | \n", "ACADM | \n", "ENSG00000117054.14 | \n", "chr1 | \n", "+ | \n", "75724817 | \n", "75732643 | \n", "ES | \n", "False | \n", "0.003568 | \n", "... | \n", "264 | \n", "325 | \n", "206 | \n", "256 | \n", "802 | \n", "836 | \n", "1104.0 | \n", "1159.0 | \n", "155.0 | \n", "164.0 | \n", "
28 | \n", "19864 | \n", "MRM2 | \n", "ENSG00000122687.19 | \n", "chr7 | \n", "- | \n", "2235564 | \n", "2239417 | \n", "ES | \n", "False | \n", "0.005016 | \n", "... | \n", "94 | \n", "370 | \n", "70 | \n", "301 | \n", "29 | \n", "400 | \n", "25.0 | \n", "460.0 | \n", "9.0 | \n", "107.0 | \n", "
29 | \n", "14268 | \n", "DIDO1 | \n", "ENSG00000101191.17 | \n", "chr20 | \n", "- | \n", "62914406 | \n", "62926438 | \n", "5AS | \n", "True | \n", "0.005016 | \n", "... | \n", "175 | \n", "175 | \n", "156 | \n", "156 | \n", "38 | \n", "76 | \n", "28.0 | \n", "58.0 | \n", "29.0 | \n", "50.0 | \n", "
30 | \n", "1685 | \n", "PTPRC | \n", "ENSG00000081237.20 | \n", "chr1 | \n", "+ | \n", "198692373 | \n", "198703297 | \n", "ES | \n", "False | \n", "0.005016 | \n", "... | \n", "67 | \n", "297 | \n", "58 | \n", "331 | \n", "186 | \n", "227 | \n", "259.0 | \n", "307.0 | \n", "57.0 | \n", "62.0 | \n", "
32 | \n", "19563 | \n", "ECHDC1 | \n", "ENSG00000093144.19 | \n", "chr6 | \n", "- | \n", "127314896 | \n", "127327001 | \n", "ES | \n", "False | \n", "0.005041 | \n", "... | \n", "148 | \n", "272 | \n", "109 | \n", "210 | \n", "320 | \n", "363 | \n", "441.0 | \n", "502.0 | \n", "66.0 | \n", "88.0 | \n", "
10 rows × 35 columns
\n", "