{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Transcriptome Reconstruction\n",
    "This tutorial processes the example dataset, based on PacBio Isoseq samples of hematopoetic cells from ENCODE. This dataset contains only a subset of genomic regions, allowing for fast processing of the demonstration tutorials. \n",
    "All required data files to run the tutorials can be obtained here: ([download link](https://oc-molgen.gnz.mpg.de/owncloud/s/gjG9EPiQwpRAyg3)). \n",
    "\n",
    "You will need:\n",
    "\n",
    "* sample description file 'encode_samples.tsv'\n",
    "* six .bam alignment files for the six samples\n",
    "* six corresponding .bam.bai indices\n",
    "* referernce annotation file gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz\n",
    "* corresponding .gff3.gz.tbi index file\n",
    "* genomic reference file GRCh38.p13.genome_chr8.fa\n",
    "* cooresponding .fai index file\n",
    "\n",
    "All files are assumed to be stored in a subfolder called 'demonstration_dataset'\n",
    "\n",
    "In this tutorial, we import the reference annotation, specify the alignment files for the samples, and integrate the data into a common data structure. During this step, the transcriptome is reconstructed, and quality control metrics are computed. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:This is isootools version 0.3.2rc3\n"
     ]
    }
   ],
   "source": [
    "# preperation: import the libraries\n",
    "from  isotools import Transcriptome\n",
    "from isotools import __version__ as isotools_version\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import logging\n",
    "# set up logging\n",
    "logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)\n",
    "logger=logging.getLogger('isotools')\n",
    "logger.info(f'This is isootools version {isotools_version}')\n",
    "\n",
    "\n",
    "path='demonstration_dataset'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import of reference annotation\n",
    "The first step is to import the reference annotation from a gff or gtf file. It should be sorted and indexed with tabix. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:importing reference from gff3 file demonstration_dataset/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz\n",
      "100%|█████████▉| 2.82M/2.82M [00:02<00:00, 1.43MB/s]\n",
      "INFO:skipped the following categories: {'three_prime_UTR', 'five_prime_UTR', 'CDS'}\n",
      "WARNING:Missing genes! Found gene information in categories ['gene'] for 2540/5080 genes\n"
     ]
    }
   ],
   "source": [
    "annotation_fn=f'{path}/gencode.v42.chr_patch_hapl_scaff.annotation_sorted_chr8.gff3.gz'\n",
    "#create the IsoTools transcriptome object from the reference annotation\n",
    "isoseq=Transcriptome.from_reference(annotation_fn)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import of sequencing data\n",
    "Next the sample table is imported, and sequencing information is added for each sample. After importing the alignments in bam format, quality control metrics are calculated, and the object with all data is stored on disk in a pickle file for later use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sample_name</th>\n",
       "      <th>file_name</th>\n",
       "      <th>group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>GM12878_a</td>\n",
       "      <td>demonstration_dataset/ENCFF417VHJ_aligned_mm2_...</td>\n",
       "      <td>GM12878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>GM12878_b</td>\n",
       "      <td>demonstration_dataset/ENCFF450VAU_aligned_mm2_...</td>\n",
       "      <td>GM12878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>GM12878_c</td>\n",
       "      <td>demonstration_dataset/ENCFF694DIE_aligned_mm2_...</td>\n",
       "      <td>GM12878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>K562_a</td>\n",
       "      <td>demonstration_dataset/ENCFF429VVB_aligned_mm2_...</td>\n",
       "      <td>K562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>K562_b</td>\n",
       "      <td>demonstration_dataset/ENCFF696GDL_aligned_mm2_...</td>\n",
       "      <td>K562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>K562_c</td>\n",
       "      <td>demonstration_dataset/ENCFF634YSN_aligned_mm2_...</td>\n",
       "      <td>K562</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  sample_name                                          file_name    group\n",
       "0   GM12878_a  demonstration_dataset/ENCFF417VHJ_aligned_mm2_...  GM12878\n",
       "1   GM12878_b  demonstration_dataset/ENCFF450VAU_aligned_mm2_...  GM12878\n",
       "2   GM12878_c  demonstration_dataset/ENCFF694DIE_aligned_mm2_...  GM12878\n",
       "3      K562_a  demonstration_dataset/ENCFF429VVB_aligned_mm2_...     K562\n",
       "4      K562_b  demonstration_dataset/ENCFF696GDL_aligned_mm2_...     K562\n",
       "5      K562_c  demonstration_dataset/ENCFF634YSN_aligned_mm2_...     K562"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_fn=f'{path}/encode_samples.tsv'\n",
    "genome_fn=f'{path}/GRCh38.p13.genome_chr8.fa'\n",
    "\n",
    "samples=pd.read_csv(sample_fn, sep='\\t')\n",
    "samples.file_name=path+'/'+samples.file_name\n",
    "samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:adding sample GM12878_a from file demonstration_dataset/ENCFF417VHJ_aligned_mm2_chr8.bam\n",
      "100%|██████████| 53.0k/53.0k [00:13<00:00, 4.05kreads/s, chr=KI270757.1]\n",
      "INFO:skipped 110 reads aligned fraction of less than 0.75.\n",
      "INFO:skipped 10972 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n",
      "WARNING:ignored 533 chimeric alignments with only one part aligned to specified chromosomes.\n",
      "INFO:ignoring 2231 chimeric alignments with less than 2 reads\n",
      "INFO:imported 40182 nonchimeric reads (including  14 chained chimeric alignments) and 73 chimeric reads with coverage of at least 2.\n",
      "INFO:adding sample GM12878_b from file demonstration_dataset/ENCFF450VAU_aligned_mm2_chr8.bam\n",
      "100%|██████████| 68.4k/68.4k [00:12<00:00, 5.36kreads/s, chr=KI270757.1]\n",
      "INFO:skipped 71 reads aligned fraction of less than 0.75.\n",
      "INFO:skipped 12700 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n",
      "WARNING:ignored 484 chimeric alignments with only one part aligned to specified chromosomes.\n",
      "INFO:ignoring 1273 chimeric alignments with less than 2 reads\n",
      "INFO:imported 54853 nonchimeric reads (including  12 chained chimeric alignments) and 7 chimeric reads with coverage of at least 2.\n",
      "INFO:adding sample GM12878_c from file demonstration_dataset/ENCFF694DIE_aligned_mm2_chr8.bam\n",
      "100%|██████████| 90.7k/90.7k [00:15<00:00, 5.89kreads/s, chr=KI270757.1]\n",
      "INFO:skipped 85 reads aligned fraction of less than 0.75.\n",
      "INFO:skipped 17261 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n",
      "WARNING:ignored 455 chimeric alignments with only one part aligned to specified chromosomes.\n",
      "INFO:ignoring 1410 chimeric alignments with less than 2 reads\n",
      "INFO:imported 72451 nonchimeric reads (including  38 chained chimeric alignments) and 12 chimeric reads with coverage of at least 2.\n",
      "INFO:adding sample K562_a from file demonstration_dataset/ENCFF429VVB_aligned_mm2_chr8.bam\n",
      "100%|██████████| 107k/107k [00:20<00:00, 5.21kreads/s, chr=KI270757.1]\n",
      "INFO:skipped 297 reads aligned fraction of less than 0.75.\n",
      "INFO:skipped 23990 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n",
      "WARNING:ignored 2160 chimeric alignments with only one part aligned to specified chromosomes.\n",
      "INFO:ignoring 7445 chimeric alignments with less than 2 reads\n",
      "INFO:imported 76692 nonchimeric reads (including  57 chained chimeric alignments) and 415 chimeric reads with coverage of at least 2.\n",
      "INFO:adding sample K562_b from file demonstration_dataset/ENCFF696GDL_aligned_mm2_chr8.bam\n",
      "100%|██████████| 78.0k/78.0k [00:16<00:00, 4.87kreads/s, chr=KI270757.1]\n",
      "INFO:skipped 165 reads aligned fraction of less than 0.75.\n",
      "INFO:skipped 15026 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n",
      "WARNING:ignored 1142 chimeric alignments with only one part aligned to specified chromosomes.\n",
      "INFO:ignoring 4530 chimeric alignments with less than 2 reads\n",
      "INFO:imported 59118 nonchimeric reads (including  43 chained chimeric alignments) and 284 chimeric reads with coverage of at least 2.\n",
      "INFO:adding sample K562_c from file demonstration_dataset/ENCFF634YSN_aligned_mm2_chr8.bam\n",
      "100%|██████████| 117k/117k [00:19<00:00, 5.94kreads/s, chr=KI270757.1]\n",
      "INFO:skipped 294 reads aligned fraction of less than 0.75.\n",
      "INFO:skipped 30231 secondary alignments (0x100), alignment that failed quality check (0x200) or PCR duplicates (0x400)\n",
      "WARNING:ignored 2528 chimeric alignments with only one part aligned to specified chromosomes.\n",
      "INFO:ignoring 8019 chimeric alignments with less than 2 reads\n",
      "INFO:imported 80343 nonchimeric reads (including  46 chained chimeric alignments) and 371 chimeric reads with coverage of at least 2.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>file</th>\n",
       "      <th>group</th>\n",
       "      <th>nonchimeric_reads</th>\n",
       "      <th>chimeric_reads</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>GM12878_a</td>\n",
       "      <td>demonstration_dataset/ENCFF417VHJ_aligned_mm2_...</td>\n",
       "      <td>GM12878</td>\n",
       "      <td>40182</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>GM12878_b</td>\n",
       "      <td>demonstration_dataset/ENCFF450VAU_aligned_mm2_...</td>\n",
       "      <td>GM12878</td>\n",
       "      <td>54853</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>GM12878_c</td>\n",
       "      <td>demonstration_dataset/ENCFF694DIE_aligned_mm2_...</td>\n",
       "      <td>GM12878</td>\n",
       "      <td>72451</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K562_a</td>\n",
       "      <td>demonstration_dataset/ENCFF429VVB_aligned_mm2_...</td>\n",
       "      <td>K562</td>\n",
       "      <td>76692</td>\n",
       "      <td>415</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K562_b</td>\n",
       "      <td>demonstration_dataset/ENCFF696GDL_aligned_mm2_...</td>\n",
       "      <td>K562</td>\n",
       "      <td>59118</td>\n",
       "      <td>284</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>K562_c</td>\n",
       "      <td>demonstration_dataset/ENCFF634YSN_aligned_mm2_...</td>\n",
       "      <td>K562</td>\n",
       "      <td>80343</td>\n",
       "      <td>371</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        name                                               file    group  \\\n",
       "0  GM12878_a  demonstration_dataset/ENCFF417VHJ_aligned_mm2_...  GM12878   \n",
       "0  GM12878_b  demonstration_dataset/ENCFF450VAU_aligned_mm2_...  GM12878   \n",
       "0  GM12878_c  demonstration_dataset/ENCFF694DIE_aligned_mm2_...  GM12878   \n",
       "0     K562_a  demonstration_dataset/ENCFF429VVB_aligned_mm2_...     K562   \n",
       "0     K562_b  demonstration_dataset/ENCFF696GDL_aligned_mm2_...     K562   \n",
       "0     K562_c  demonstration_dataset/ENCFF634YSN_aligned_mm2_...     K562   \n",
       "\n",
       "  nonchimeric_reads chimeric_reads  \n",
       "0             40182             73  \n",
       "0             54853              7  \n",
       "0             72451             12  \n",
       "0             76692            415  \n",
       "0             59118            284  \n",
       "0             80343            371  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# integrate the samples\n",
    "for i,row in samples.iterrows():\n",
    "    # this step takes about 5-30 seconds per sample\n",
    "     isoseq.add_sample_from_bam(row.file_name, sample_name=row.sample_name, group=row.group)\n",
    "# the sample table of the transcriptome object contains the number of imported reads\n",
    "isoseq.sample_table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next step, we compute several qc metrics for the transcripts: \n",
    "* downstream A content, \n",
    "* direct repeat length at junctions, \n",
    "* noncanonical splicing, \n",
    "* potential fragments\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10803/10803 [01:16<00:00, 141.86genes/s]\n"
     ]
    }
   ],
   "source": [
    "# compute qc metrics\n",
    "isoseq.add_qc_metrics(genome_fn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:saving transcriptome to PacBio_isotools_substantial_isotools.pkl\n"
     ]
    }
   ],
   "source": [
    "# export the transcriptome object for later use. \n",
    "isoseq.save('PacBio_isotools_substantial_isotools.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}