{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup and Data Preparation\n",
    "\n",
    "This tutorial provides guidelines for the preprocessing of the long read transcripome sequencing data as well as the preparation of reference annotation data, required for isotools analysis. **Subsequent tutorials on the isotools workflow do not depend on executing these steps, as we compiled a small preprocessed demonstration data set, based on a subset of the genome** ([download link](https://oc-molgen.gnz.mpg.de/owncloud/s/gjG9EPiQwpRAyg3)). Below, we also document how this example dataset was produced. \n",
    "\n",
    "In brief, the isotools analysis depends on the following input files:\n",
    "\n",
    "* Reference transcript annotation in gtf or gff3 format (and corresponding index)\n",
    "* Corresponding reference genome sequence in fasta format\n",
    "* Aligned long read transcriptome sequencing data in bam format (and corresponding index)\n",
    "\n",
    "In order to prepare these files, the following tools are required:\n",
    "\n",
    "* samtools (http://www.htslib.org/) for indexing of gtf/gff and bam files.\n",
    "* long read alignment tool, such as minimap2 (https://lh3.github.io/minimap2/) for genomic alignment of the long reads"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference Genome and Annotation\n",
    "Here, we download the reference transcript annotation and genome seqeunce provided by the [GENCODE](https://www.gencodegenes.org/) project, release 42. Similar files can be obtained from other sources, such as [UCSC (RefSeq)](https://ftp.ncbi.nlm.nih.gov/refseq/) and [Ensembl](https://ftp.ensembl.org/pub/). \n",
    "The transcript annotation needs to be sorted and indexed, for efficient processing, using the samtools tabix command (http://www.htslib.org/doc/tabix.html).\n",
    "\n",
    "``` bash\n",
    "    # create a directory for the reference files\n",
    "    refdir=reference\n",
    "    mkdir -p $refdir\n",
    "    \n",
    "    # download gencode reference annotation (62 MB)\n",
    "    gff='gencode.v42.chr_patch_hapl_scaff.annotation'\n",
    "    annotation_link=ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/${gff}.gff3.gz\n",
    "    wget -P $refdir ${annotation_link} \n",
    " \n",
    "    # sort by chromosome and position\n",
    "    (zcat $refdir/${gff}.gff3.gz| grep ^\"#\" ; zcat $refdir/${gff}.gff3.gz|grep -v ^\"#\"| sort -k1,1 -k4,4n)|bgzip  > $refdir/${gff}_sorted.gff3.gz\n",
    "    \n",
    "    # create index\n",
    "    tabix -p gff $refdir/${gff}_sorted.gff3.gz\n",
    "    \n",
    "    # download the reference genome (849 MB)\n",
    "    genome_link='ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.p13.genome.fa.gz'\n",
    "    wget -P $refdir ${genome_link} \n",
    "    gunzip $refdir/GRCh38.p13.genome.fa.gz\n",
    "    \n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ENCODE long read data\n",
    "\n",
    "The ENCODE project provides PacBio isoseq as well as ONT data of several cell lines and tissues.\n",
    "The data can be downloaded as reads in fastq format, or aligned bam files.\n",
    "Here we demonstrate how this resource can be accessed from within python, and how to select and download preprocessed data. You can also manually download the files using the data portal from the encode project web page (https://www.encodeproject.org/) and download aligned .bam files. \n",
    "In the following snippet, we select all PacBio Sequel II leukemia and B-cell samples available, but of course you can adapt the sample selection to your needs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pacific Biosciences Sequel II: In total 82 experiments from ENCODE\n",
      "Oxford Nanopore PromethION: In total 4 experiments from ENCODE\n",
      "Oxford Nanopore MinION: In total 14 experiments from ENCODE\n",
      "Pacific Biosciences Sequel: In total 16 experiments from ENCODE\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "from pathlib import Path\n",
    "from collections import Counter\n",
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "# We prepare a subdirectory for the encode files\n",
    "Path('./encode').mkdir(parents=True, exist_ok=True) \n",
    "\n",
    "#first, check what samples are present\n",
    "\n",
    "\n",
    "data=[('type','Experiment'),\n",
    "      ('assay_title','long read RNA-seq'),\n",
    "      ('replicates.library.biosample.donor.organism.scientific_name','Homo sapiens'),\n",
    "      ('files.file_type','bam'),\n",
    "      ('files.file_type','fastq')\n",
    "     ]\n",
    "resp=requests.get(\"https://www.encodeproject.org/metadata/\", params=data)\n",
    "header, data=resp.content.decode('utf-8').split('\\n',1)\n",
    "metadata=pd.DataFrame([row.split('\\t') for row in data.split('\\n')], columns=header.split('\\t')).replace(\"\", pd.NA)\n",
    "\n",
    "# Some information, as the instrument model used for sequencing,\n",
    "# are not available for processed files, so it needs to be copied from the raw data.\n",
    "platform=metadata.set_index('Experiment accession').Platform.dropna().to_dict()\n",
    "# Number of experiments per instrument model\n",
    "for pf,count in Counter(platform.values()).items():\n",
    "    print(f'{pf}: In total {count} experiments from ENCODE')\n",
    "\n",
    "\n",
    "#Select the reads from the metadata, make sure there are only read fastq files are in the table\n",
    "all_samples=metadata[(metadata['Output type']=='reads') & (metadata['File Status']=='released')].set_index('Experiment accession').copy()\n",
    "all_samples['Platform']=[platform.get(ea,'unknown') for ea in all_samples.index] #This info is missing for some files\n",
    "col_select=['File accession','Biosample term name','Biosample type','Technical replicate(s)','Platform']\n",
    "all_samples=all_samples[col_select].sort_values(['Biosample type','Biosample term name','Technical replicate(s)'] )\n",
    "all_samples=all_samples.rename({'File accession':'reads accession'},axis=1)    \n",
    "#Add the alignments from the metadata\n",
    "alignment=metadata.set_index('Experiment accession').query('`Output type`==\"alignments\"')['File accession'].to_dict()\n",
    "all_samples.insert(1,'alignment accession', [alignment.get(idx, 'NA') for idx in all_samples.index ])\n",
    "all_samples.to_csv('encode/all_samples.csv', index=False)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reads accession</th>\n",
       "      <th>alignment accession</th>\n",
       "      <th>Biosample term name</th>\n",
       "      <th>Biosample type</th>\n",
       "      <th>Technical replicate(s)</th>\n",
       "      <th>Platform</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENCFF417VHJ</td>\n",
       "      <td>ENCFF219UJG</td>\n",
       "      <td>GM12878</td>\n",
       "      <td>cell line</td>\n",
       "      <td>1_1</td>\n",
       "      <td>Pacific Biosciences Sequel II</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENCFF450VAU</td>\n",
       "      <td>ENCFF225CCJ</td>\n",
       "      <td>GM12878</td>\n",
       "      <td>cell line</td>\n",
       "      <td>1_1</td>\n",
       "      <td>Pacific Biosciences Sequel II</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENCFF694DIE</td>\n",
       "      <td>ENCFF225CCJ</td>\n",
       "      <td>GM12878</td>\n",
       "      <td>cell line</td>\n",
       "      <td>2_1</td>\n",
       "      <td>Pacific Biosciences Sequel II</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENCFF429VVB</td>\n",
       "      <td>ENCFF645UVN</td>\n",
       "      <td>K562</td>\n",
       "      <td>cell line</td>\n",
       "      <td>1_1</td>\n",
       "      <td>Pacific Biosciences Sequel II</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENCFF696GDL</td>\n",
       "      <td>ENCFF322UJU</td>\n",
       "      <td>K562</td>\n",
       "      <td>cell line</td>\n",
       "      <td>1_1</td>\n",
       "      <td>Pacific Biosciences Sequel II</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>ENCFF634YSN</td>\n",
       "      <td>ENCFF645UVN</td>\n",
       "      <td>K562</td>\n",
       "      <td>cell line</td>\n",
       "      <td>2_1</td>\n",
       "      <td>Pacific Biosciences Sequel II</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  reads accession alignment accession Biosample term name Biosample type  \\\n",
       "0     ENCFF417VHJ         ENCFF219UJG             GM12878      cell line   \n",
       "1     ENCFF450VAU         ENCFF225CCJ             GM12878      cell line   \n",
       "2     ENCFF694DIE         ENCFF225CCJ             GM12878      cell line   \n",
       "3     ENCFF429VVB         ENCFF645UVN                K562      cell line   \n",
       "4     ENCFF696GDL         ENCFF322UJU                K562      cell line   \n",
       "5     ENCFF634YSN         ENCFF645UVN                K562      cell line   \n",
       "\n",
       "  Technical replicate(s)                       Platform  \n",
       "0                    1_1  Pacific Biosciences Sequel II  \n",
       "1                    1_1  Pacific Biosciences Sequel II  \n",
       "2                    2_1  Pacific Biosciences Sequel II  \n",
       "3                    1_1  Pacific Biosciences Sequel II  \n",
       "4                    1_1  Pacific Biosciences Sequel II  \n",
       "5                    2_1  Pacific Biosciences Sequel II  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "#we select alignment files of hematopoetic samples sequenced on Sequel II instruments - adjust this as needed\n",
    "group={\"K562\":'CML', \"GM12878\":'B-cell'}\n",
    "selected_samples=all_samples.query('Platform == \"Pacific Biosciences Sequel II\" and `Biosample term name` in @group')\n",
    "selected_samples=selected_samples.sort_values(['Biosample term name','Technical replicate(s)'] ).reset_index(drop=True) \n",
    "selected_samples.to_csv('encode/encode_samples.csv', index=False)\n",
    "selected_samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "selected 6\n",
      "downloading reads file for ENCFF417VHJ\n",
      "downloading reads file for ENCFF450VAU\n",
      "downloading reads file for ENCFF694DIE\n",
      "downloading reads file for ENCFF429VVB\n",
      "downloading reads file for ENCFF696GDL\n",
      "downloading reads file for ENCFF634YSN\n"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlretrieve\n",
    "import pysam\n",
    "\n",
    "\n",
    "\n",
    "#print the sample table\n",
    "print(f'selected {len(selected_samples)} samples')\n",
    "\n",
    "download='reads' # or 'alignments'\n",
    "\n",
    "\n",
    "#download and index the selected files, if not present\n",
    "for accession in selected_samples[f'{download} accession']:\n",
    "    url=metadata.loc[metadata['File accession']==accession,'File download URL'].values[0]\n",
    "    file=os.path.split(url)[1]\n",
    "    if not os.path.isfile(f\"encode/{file}\"):\n",
    "        print(f'downloading {download} file for {accession}')        \n",
    "        urlretrieve(url, f\"encode/{file}\")\n",
    "    else:\n",
    "        print(f'{download} file {accession} already found')\n",
    "    if download=='alignment':\n",
    "        bai=f\"encode/{file}.bai\"\n",
    "        if not os.path.isfile(bai) or os.path.getmtime(f\"encode/{file}\")>os.path.getmtime(bai):\n",
    "            logger.info(f'indexing {accession}')\n",
    "            pysam.index(f\"encode/{file}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Alignment\n",
    "\n",
    "There are several tools for the alignment of long reads, including minimap2 (https://github.com/lh3/minimap2), gmap (http://research-pub.gene.com/gmap/), and uLTRA (https://github.com/ksahlin/ultra).\n",
    "Here, we use minimap2, but isotools is independent of the alignment tool.\n",
    "When following the PacBio isoseq3 pipeline (https://github.com/ylipacbio/IsoSeq3), resulting sequencing reads are stored in unaligned .bam files, and need to be formatted in fastq before alignment with minimap2. The following command uses the recommended parameters for PacBio isoseq reads (with additional --MD for mutation information, which is optional), and sorts the resulting alignment by genomic position.\n",
    "\n",
    "``` bash\n",
    "  ref=${refdir}/GRCh38.p13.genome\n",
    "  ubam=/path/to/sampleX_long_read_sequences.bam\n",
    "  out=/path/to/sampleX_aligned\n",
    "  # prepare the reference\n",
    "  minimap2 -x splice:hq -d ${ref}_hq.mmi ${ref}.fa\n",
    "  # align the reads and sort (calculating MD tag is optional)\n",
    "  samtools fastq $ubam| minimap2 -t 80 -ax splice:hq -uf --MD ${ref}_hq.mmi - |samtools sort -O BAM -o ${out}.bam -\n",
    "  # index alignment file\n",
    "  samtools index ${out}.bam\n",
    "  \n",
    "```\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Demonstration Data\n",
    "\n",
    "To create an demonstration dataset, we aligned the encode fastq files with minimap2, and subselected reads mapping to chromosome 8 only. All resulting files (~270 mb) [can be downloaded here](https://oc-molgen.gnz.mpg.de/owncloud/s/gjG9EPiQwpRAyg3).\n",
    "``` bash\n",
    "\n",
    "\n",
    "    ref=${refdir}/GRCh38.p13.genome\n",
    "    chr_select=chr8\n",
    "\n",
    "    for fq in encode/*fastq.gz; do\n",
    "        fn=$(basename $fq)\n",
    "        id=${fn%*.fastq.gz}\n",
    "        out=encode/${id}_aligned_mm2\n",
    "        if [ ! -e ${out}.bam ];\n",
    "            # align the fastq\n",
    "            minimap2 -t 40 -ax splice:hq -uf --MD -a ${ref}_hq.mmi $fq |samtools sort -O BAM -o ${out}.bam -\n",
    "            samtools index ${out}.bam   \n",
    "            # subset the alignment\n",
    "            samtools view -b  --write-index -o ${out}_${chr_select}.bam##idx##${out}_${chr_select}.bam.bai ${out}.bam  $chr_select\n",
    "        done\n",
    "        \n",
    "    done\n",
    "    #subset the genome\n",
    "    samtools faidx ${refdir}/GRCh38.p13.genome.fa $chr_select> ${refdir}/GRCh38.p13.genome_${chr_select}.fa\n",
    "    samtools faidx ${refdir}/GRCh38.p13.genome_${chr_select}.fa\n",
    "    #subset the annotation\n",
    "    tabix -h $refdir/${gff}_sorted.gff3.gz ${chr_select}|bgzip > $refdir/${gff}_sorted_${chr_select}.gff3.gz\n",
    "    tabix -p gff $refdir/${gff}_sorted_${chr_select}.gff3.gz\n",
    "``` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python3.9",
   "language": "python",
   "name": "python3.9"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}