{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Setup and Data Preparation\n", "\n", "This tutorial provides guidelines for the preprocessing of the long read transcripome sequencing data as well as the preparation of reference annotation data, required for isotools analysis. **Subsequent tutorials on the isotools workflow do not depend on executing these steps, as we compiled a small preprocessed demonstration data set, based on a subset of the genome** ([download link](https://oc-molgen.gnz.mpg.de/owncloud/s/gjG9EPiQwpRAyg3)). Below, we also document how this example dataset was produced. \n", "\n", "In brief, the isotools analysis depends on the following input files:\n", "\n", "* Reference transcript annotation in gtf or gff3 format (and corresponding index)\n", "* Corresponding reference genome sequence in fasta format\n", "* Aligned long read transcriptome sequencing data in bam format (and corresponding index)\n", "\n", "In order to prepare these files, the following tools are required:\n", "\n", "* samtools (http://www.htslib.org/) for indexing of gtf/gff and bam files.\n", "* long read alignment tool, such as minimap2 (https://lh3.github.io/minimap2/) for genomic alignment of the long reads" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reference Genome and Annotation\n", "Here, we download the reference transcript annotation and genome seqeunce provided by the [GENCODE](https://www.gencodegenes.org/) project, release 42. Similar files can be obtained from other sources, such as [UCSC (RefSeq)](https://ftp.ncbi.nlm.nih.gov/refseq/) and [Ensembl](https://ftp.ensembl.org/pub/). \n", "The transcript annotation needs to be sorted and indexed, for efficient processing, using the samtools tabix command (http://www.htslib.org/doc/tabix.html).\n", "\n", "``` bash\n", " # create a directory for the reference files\n", " refdir=reference\n", " mkdir -p $refdir\n", " \n", " # download gencode reference annotation (62 MB)\n", " gff='gencode.v42.chr_patch_hapl_scaff.annotation'\n", " annotation_link=ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/${gff}.gff3.gz\n", " wget -P $refdir ${annotation_link} \n", " \n", " # sort by chromosome and position\n", " (zcat $refdir/${gff}.gff3.gz| grep ^\"#\" ; zcat $refdir/${gff}.gff3.gz|grep -v ^\"#\"| sort -k1,1 -k4,4n)|bgzip > $refdir/${gff}_sorted.gff3.gz\n", " \n", " # create index\n", " tabix -p gff $refdir/${gff}_sorted.gff3.gz\n", " \n", " # download the reference genome (849 MB)\n", " genome_link='ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.p13.genome.fa.gz'\n", " wget -P $refdir ${genome_link} \n", " gunzip $refdir/GRCh38.p13.genome.fa.gz\n", " \n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ENCODE long read data\n", "\n", "The ENCODE project provides PacBio isoseq as well as ONT data of several cell lines and tissues.\n", "The data can be downloaded as reads in fastq format, or aligned bam files.\n", "Here we demonstrate how this resource can be accessed from within python, and how to select and download preprocessed data. You can also manually download the files using the data portal from the encode project web page (https://www.encodeproject.org/) and download aligned .bam files. \n", "In the following snippet, we select all PacBio Sequel II leukemia and B-cell samples available, but of course you can adapt the sample selection to your needs." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pacific Biosciences Sequel II: In total 82 experiments from ENCODE\n", "Oxford Nanopore PromethION: In total 4 experiments from ENCODE\n", "Oxford Nanopore MinION: In total 14 experiments from ENCODE\n", "Pacific Biosciences Sequel: In total 16 experiments from ENCODE\n" ] } ], "source": [ "import requests\n", "from pathlib import Path\n", "from collections import Counter\n", "import os\n", "import pandas as pd\n", "\n", "\n", "# We prepare a subdirectory for the encode files\n", "Path('./encode').mkdir(parents=True, exist_ok=True) \n", "\n", "#first, check what samples are present\n", "\n", "\n", "data=[('type','Experiment'),\n", " ('assay_title','long read RNA-seq'),\n", " ('replicates.library.biosample.donor.organism.scientific_name','Homo sapiens'),\n", " ('files.file_type','bam'),\n", " ('files.file_type','fastq')\n", " ]\n", "resp=requests.get(\"https://www.encodeproject.org/metadata/\", params=data)\n", "header, data=resp.content.decode('utf-8').split('\\n',1)\n", "metadata=pd.DataFrame([row.split('\\t') for row in data.split('\\n')], columns=header.split('\\t')).replace(\"\", pd.NA)\n", "\n", "# Some information, as the instrument model used for sequencing,\n", "# are not available for processed files, so it needs to be copied from the raw data.\n", "platform=metadata.set_index('Experiment accession').Platform.dropna().to_dict()\n", "# Number of experiments per instrument model\n", "for pf,count in Counter(platform.values()).items():\n", " print(f'{pf}: In total {count} experiments from ENCODE')\n", "\n", "\n", "#Select the reads from the metadata, make sure there are only read fastq files are in the table\n", "all_samples=metadata[(metadata['Output type']=='reads') & (metadata['File Status']=='released')].set_index('Experiment accession').copy()\n", "all_samples['Platform']=[platform.get(ea,'unknown') for ea in all_samples.index] #This info is missing for some files\n", "col_select=['File accession','Biosample term name','Biosample type','Technical replicate(s)','Platform']\n", "all_samples=all_samples[col_select].sort_values(['Biosample type','Biosample term name','Technical replicate(s)'] )\n", "all_samples=all_samples.rename({'File accession':'reads accession'},axis=1) \n", "#Add the alignments from the metadata\n", "alignment=metadata.set_index('Experiment accession').query('`Output type`==\"alignments\"')['File accession'].to_dict()\n", "all_samples.insert(1,'alignment accession', [alignment.get(idx, 'NA') for idx in all_samples.index ])\n", "all_samples.to_csv('encode/all_samples.csv', index=False)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | reads accession | \n", "alignment accession | \n", "Biosample term name | \n", "Biosample type | \n", "Technical replicate(s) | \n", "Platform | \n", "
---|---|---|---|---|---|---|
0 | \n", "ENCFF417VHJ | \n", "ENCFF219UJG | \n", "GM12878 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
1 | \n", "ENCFF450VAU | \n", "ENCFF225CCJ | \n", "GM12878 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
2 | \n", "ENCFF694DIE | \n", "ENCFF225CCJ | \n", "GM12878 | \n", "cell line | \n", "2_1 | \n", "Pacific Biosciences Sequel II | \n", "
3 | \n", "ENCFF429VVB | \n", "ENCFF645UVN | \n", "K562 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
4 | \n", "ENCFF696GDL | \n", "ENCFF322UJU | \n", "K562 | \n", "cell line | \n", "1_1 | \n", "Pacific Biosciences Sequel II | \n", "
5 | \n", "ENCFF634YSN | \n", "ENCFF645UVN | \n", "K562 | \n", "cell line | \n", "2_1 | \n", "Pacific Biosciences Sequel II | \n", "