Setup and Data Preparation

This tutorial provides guidelines for the preprocessing of the long read transcripome sequencing data as well as the preparation of reference annotation data, required for isotools analysis. Subsequent tutorials on the isotools workflow do not depend on executing these steps, as we compiled a small preprocessed demonstration data set, based on a subset of the genome (download link). Below, we also document how this example dataset was produced.

In brief, the isotools analysis depends on the following input files:

  • Reference transcript annotation in gtf or gff3 format (and corresponding index)

  • Corresponding reference genome sequence in fasta format

  • Aligned long read transcriptome sequencing data in bam format (and corresponding index)

In order to prepare these files, the following tools are required:

Reference Genome and Annotation

Here, we download the reference transcript annotation and genome seqeunce provided by the GENCODE project, release 42. Similar files can be obtained from other sources, such as UCSC (RefSeq) and Ensembl. The transcript annotation needs to be sorted and indexed, for efficient processing, using the samtools tabix command (http://www.htslib.org/doc/tabix.html).

# create a directory for the reference files
refdir=reference
mkdir -p $refdir

# download gencode reference annotation (62 MB)
gff='gencode.v42.chr_patch_hapl_scaff.annotation'
annotation_link=ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/${gff}.gff3.gz
wget -P $refdir ${annotation_link}

# sort by chromosome and position
(zcat $refdir/${gff}.gff3.gz| grep ^"#" ; zcat $refdir/${gff}.gff3.gz|grep -v ^"#"| sort -k1,1 -k4,4n)|bgzip  > $refdir/${gff}_sorted.gff3.gz

# create index
tabix -p gff $refdir/${gff}_sorted.gff3.gz

# download the reference genome (849 MB)
genome_link='ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.p13.genome.fa.gz'
wget -P $refdir ${genome_link}
gunzip $refdir/GRCh38.p13.genome.fa.gz

ENCODE long read data

The ENCODE project provides PacBio isoseq as well as ONT data of several cell lines and tissues. The data can be downloaded as reads in fastq format, or aligned bam files. Here we demonstrate how this resource can be accessed from within python, and how to select and download preprocessed data. You can also manually download the files using the data portal from the encode project web page (https://www.encodeproject.org/) and download aligned .bam files. In the following snippet, we select all PacBio Sequel II leukemia and B-cell samples available, but of course you can adapt the sample selection to your needs.

[2]:
import requests
from pathlib import Path
from collections import Counter
import os
import pandas as pd


# We prepare a subdirectory for the encode files
Path('./encode').mkdir(parents=True, exist_ok=True)

#first, check what samples are present


data=[('type','Experiment'),
      ('assay_title','long read RNA-seq'),
      ('replicates.library.biosample.donor.organism.scientific_name','Homo sapiens'),
      ('files.file_type','bam'),
      ('files.file_type','fastq')
     ]
resp=requests.get("https://www.encodeproject.org/metadata/", params=data)
header, data=resp.content.decode('utf-8').split('\n',1)
metadata=pd.DataFrame([row.split('\t') for row in data.split('\n')], columns=header.split('\t')).replace("", pd.NA)

# Some information, as the instrument model used for sequencing,
# are not available for processed files, so it needs to be copied from the raw data.
platform=metadata.set_index('Experiment accession').Platform.dropna().to_dict()
# Number of experiments per instrument model
for pf,count in Counter(platform.values()).items():
    print(f'{pf}: In total {count} experiments from ENCODE')


#Select the reads from the metadata, make sure there are only read fastq files are in the table
all_samples=metadata[(metadata['Output type']=='reads') & (metadata['File Status']=='released')].set_index('Experiment accession').copy()
all_samples['Platform']=[platform.get(ea,'unknown') for ea in all_samples.index] #This info is missing for some files
col_select=['File accession','Biosample term name','Biosample type','Technical replicate(s)','Platform']
all_samples=all_samples[col_select].sort_values(['Biosample type','Biosample term name','Technical replicate(s)'] )
all_samples=all_samples.rename({'File accession':'reads accession'},axis=1)
#Add the alignments from the metadata
alignment=metadata.set_index('Experiment accession').query('`Output type`=="alignments"')['File accession'].to_dict()
all_samples.insert(1,'alignment accession', [alignment.get(idx, 'NA') for idx in all_samples.index ])
all_samples.to_csv('encode/all_samples.csv', index=False)

Pacific Biosciences Sequel II: In total 82 experiments from ENCODE
Oxford Nanopore PromethION: In total 4 experiments from ENCODE
Oxford Nanopore MinION: In total 14 experiments from ENCODE
Pacific Biosciences Sequel: In total 16 experiments from ENCODE
[3]:

#we select alignment files of hematopoetic samples sequenced on Sequel II instruments - adjust this as needed
group={"K562":'CML', "GM12878":'B-cell'}
selected_samples=all_samples.query('Platform == "Pacific Biosciences Sequel II" and `Biosample term name` in @group')
selected_samples=selected_samples.sort_values(['Biosample term name','Technical replicate(s)'] ).reset_index(drop=True)
selected_samples.to_csv('encode/encode_samples.csv', index=False)
selected_samples
[3]:
reads accession alignment accession Biosample term name Biosample type Technical replicate(s) Platform
0 ENCFF417VHJ ENCFF219UJG GM12878 cell line 1_1 Pacific Biosciences Sequel II
1 ENCFF450VAU ENCFF225CCJ GM12878 cell line 1_1 Pacific Biosciences Sequel II
2 ENCFF694DIE ENCFF225CCJ GM12878 cell line 2_1 Pacific Biosciences Sequel II
3 ENCFF429VVB ENCFF645UVN K562 cell line 1_1 Pacific Biosciences Sequel II
4 ENCFF696GDL ENCFF322UJU K562 cell line 1_1 Pacific Biosciences Sequel II
5 ENCFF634YSN ENCFF645UVN K562 cell line 2_1 Pacific Biosciences Sequel II
[4]:
from urllib.request import urlretrieve
import pysam



#print the sample table
print(f'selected {len(selected_samples)} samples')

download='reads' # or 'alignments'


#download and index the selected files, if not present
for accession in selected_samples[f'{download} accession']:
    url=metadata.loc[metadata['File accession']==accession,'File download URL'].values[0]
    file=os.path.split(url)[1]
    if not os.path.isfile(f"encode/{file}"):
        print(f'downloading {download} file for {accession}')
        urlretrieve(url, f"encode/{file}")
    else:
        print(f'{download} file {accession} already found')
    if download=='alignment':
        bai=f"encode/{file}.bai"
        if not os.path.isfile(bai) or os.path.getmtime(f"encode/{file}")>os.path.getmtime(bai):
            logger.info(f'indexing {accession}')
            pysam.index(f"encode/{file}")

selected 6
downloading reads file for ENCFF417VHJ
downloading reads file for ENCFF450VAU
downloading reads file for ENCFF694DIE
downloading reads file for ENCFF429VVB
downloading reads file for ENCFF696GDL
downloading reads file for ENCFF634YSN

Alignment

There are several tools for the alignment of long reads, including minimap2 (https://github.com/lh3/minimap2), gmap (http://research-pub.gene.com/gmap/), and uLTRA (https://github.com/ksahlin/ultra). Here, we use minimap2, but isotools is independent of the alignment tool. When following the PacBio isoseq3 pipeline (https://github.com/ylipacbio/IsoSeq3), resulting sequencing reads are stored in unaligned .bam files, and need to be formatted in fastq before alignment with minimap2. The following command uses the recommended parameters for PacBio isoseq reads (with additional –MD for mutation information, which is optional), and sorts the resulting alignment by genomic position.

ref=${refdir}/GRCh38.p13.genome
ubam=/path/to/sampleX_long_read_sequences.bam
out=/path/to/sampleX_aligned
# prepare the reference
minimap2 -x splice:hq -d ${ref}_hq.mmi ${ref}.fa
# align the reads and sort (calculating MD tag is optional)
samtools fastq $ubam| minimap2 -t 80 -ax splice:hq -uf --MD ${ref}_hq.mmi - |samtools sort -O BAM -o ${out}.bam -
# index alignment file
samtools index ${out}.bam

Demonstration Data

To create an demonstration dataset, we aligned the encode fastq files with minimap2, and subselected reads mapping to chromosome 8 only. All resulting files (~270 mb) can be downloaded here.

ref=${refdir}/GRCh38.p13.genome
chr_select=chr8

for fq in encode/*fastq.gz; do
    fn=$(basename $fq)
    id=${fn%*.fastq.gz}
    out=encode/${id}_aligned_mm2
    if [ ! -e ${out}.bam ];
        # align the fastq
        minimap2 -t 40 -ax splice:hq -uf --MD -a ${ref}_hq.mmi $fq |samtools sort -O BAM -o ${out}.bam -
        samtools index ${out}.bam
        # subset the alignment
        samtools view -b  --write-index -o ${out}_${chr_select}.bam##idx##${out}_${chr_select}.bam.bai ${out}.bam  $chr_select
    done

done
#subset the genome
samtools faidx ${refdir}/GRCh38.p13.genome.fa $chr_select> ${refdir}/GRCh38.p13.genome_${chr_select}.fa
samtools faidx ${refdir}/GRCh38.p13.genome_${chr_select}.fa
#subset the annotation
tabix -h $refdir/${gff}_sorted.gff3.gz ${chr_select}|bgzip > $refdir/${gff}_sorted_${chr_select}.gff3.gz
tabix -p gff $refdir/${gff}_sorted_${chr_select}.gff3.gz
[ ]: