API Reference

isotools.Transcriptome

class isotools.Transcriptome(**kwargs)[source]

Contains sequencing data and annotation for Long Read Transcriptome Sequencing (LRTS) Experiments.

classmethod from_reference(reference_file, file_format='auto', **kwargs)[source]

Creates a Transcriptome object by importing reference annotation.

Parameters
  • reference_file (str) – Reference file in gff3 format or pickle file to restore previously imported annotation

  • file_format – Specify the file format of the provided reference_file. If set to “auto” the file type is infrered from the extension.

  • chromosome – If reference file is gtf/gff, restrict import on specified chromosomes

  • infer_transcrpts – If reference file is gtf, genes and transcripts are infered from “exon” entries, no specific transcript

save(pickle_file=None)[source]

Saves transcriptome information (including reference) in a pickle file.

Parameters

pickle_file – Filename to save data

classmethod load(pickle_file)[source]

Restores transcriptome information from a pickle file.

Parameters

pickle_file – Filename to restore data

save_reference(pickle_file=None)[source]

Saves the reference information of a transcriptome in a pickle file.

Parameters

pickle_file – Filename to save data

make_index()[source]

Updates the index of gene names and ids (e.g. used by the the [] operator).

__getitem__(key)[source]

Syntax: self[key]

Parameters

key – May either be the gene name or the gene id

Returns

The gene specified by key.

__len__()[source]

Syntax: len(self)

Returns

The number of genes.

__contains__(key)[source]

Syntax: key in self

Checks whether key is in self.

Parameters

key – May either be the gene name or the gene id

remove_chromosome(chromosome)[source]

Deletes the chromosome from the transcriptome

Parameters

chromosome – Name of the chromosome to remove

property sample_table

The sample table contains sample names, group information, long read coverage, as well as all other potentially relevant information on the samples.

property samples: list

An ordered list of sample names.

groups(by='group') dict[source]

Get sample groups as defined in columns of the sample table.

Parameters

by – A column name of the sample table that defines the grouping.

Returns

Dict with groupnames as keys and list of sample names as values.

property n_transcripts: int

The total number of transcripts isoforms.

property n_genes: int

The total number of genes.

property novel_genes: int

The total number of novel (not reference) genes.

property chromosomes: list

The list of chromosome names.

add_sample_from_bam(fn, sample_name=None, barcode_file=None, fuzzy_junction=5, add_chromosomes=True, min_mapqual=0, min_align_fraction=0.75, chimeric_mincov=2, min_exonic_ref_coverage=0.25, use_satag=False, save_readnames=False, progress_bar=True, **kwargs)

Imports expressed transcripts from bam and adds it to the ‘Transcriptome’ object.

Parameters
  • fn – The bam filename of the new sample

  • sample_name – Name of the new sample. If specified, all reads are assumed to belong to this sample.

  • barcode_file – For barcoded samples, ath to file with assignment of sequencing barcodes to sample names. This file should be a tab seperated text file with two columns: the barcode and the sample name Barcodes not listed in this file will be ignored. If sample_name is specified in addition to bacode_file, it will be used as a prefix

  • fuzzy_junction – maximum size for fuzzy junction correction

  • add_chromosomes – If True, genes from chromosomes which are not in the Transcriptome yet are added.

  • min_mapqual – Minimum mapping quality of the read to be considdered. A mapping quality of 0 usually means ambigous alignment.

  • min_align_fraction – Minimum fraction of the read sequence matching the reference.

  • chimeric_mincov – Minimum number of reads for a chimeric transcript to be considered

  • min_exonic_ref_coverage – Minimal fraction of exonic overlap to assign to reference transcript if no splice junctions match. Also applies to mono-exonic transcripts

  • use_satag – If True, import secondary alignments (of chimeric alignments) from the SA tag. This should only be specified if the secondary alignment is not reported in a seperate bam entry.

  • save_readnames – Save a list of the readnames, that contributed to the transcript.

  • progress_bar – Show the progress.

  • kwargs – Additional keyword arugments are added to the sample table.

add_sample_from_csv(coverage_csv_file, transcripts_file, transcript_id_col=None, sample_cov_cols=None, sample_properties=None, add_chromosomes=True, reconstruct_genes=True, fuzzy_junction=0, min_exonic_ref_coverage=0.25, sep='\t')

Imports expressed transcripts from coverage table and gtf/gff file, and adds it to the ‘Transcriptome’ object.

Transcript to gene assignment is either taken from the transcript_file, or recreated, as specified by the reconstruct_genes parameter. In the first case, the genes are then matched to overlapping genes from the reference annotation by gene id. In absence of a overlapping gene with same id, the gene is matched by splice junction, and renamed. A map reflecting the the renaming is returned as a dictionary.

Parameters
  • coverage_csv_file – The name of the csv file with coverage information for all samples to be added. The file contains columns unique ids for the transcripts, and one column with the coverage for each sample

  • transcripts_file – Transcripts to be added to the ‘Transcriptome’ object, in gtf or gff/gff3 format. Gene and transcript ids should correspond to ids provided in the coverage_csv_file.

  • transcript_id_col – Column name with the transcript ids. Alternatively, a list of column names can be provided, in which case the transcript id is concatenated from the provided columns, seperated by an underscore (“_”). If not specified, checks for columns ‘transcript_id’ or [‘gene_id’, ‘transcript_nr’].

  • sample_cov_cols – Dict with sample names for the new samples as keys, and corresponding coverage column names in coverage_csv_file as values. If not specified, a new sample is added for each <name>_coverage column.

  • sample_properties – Additional properties of the samples, that get added to the sample table, and can be used to group or stratify the samples. Can be provided either as a dict with sample names as keys, and the respective properties dicts as the values, or as a data frame with a column “name” or with the sample names in the index, and the properties in the additional columns.

  • add_chromosomes – If True, genes from chromosomes which are not in the Transcriptome yet are added.

  • reconstruct_genes – If True, transcript gene assignment from gtf is ignored, and transcripts are grouped to genes from scratch.

  • min_exonic_ref_coverage – Minimal fraction of exonic overlap to assign to reference transcript if no splice junctions match. Also applies to mono-exonic transcripts

  • progress_bar – Show the progress.

  • sep – Specify the seperator for the coverage_csv_file.

Returns

Dict with map of renamed gene ids.

remove_samples(sample_names)

Removes samples from the dataset.

Params sample_names

A list of sample names to remove.

add_short_read_coverage(bam_files, load=False)

Adds short read coverage to the genes.

This does, by default (e.g. if load==False), this method does not actually read the bams, but import for each gene is done at first access.

Parameters
  • bam_files – A dict with the sample names as keys, and the path to aligned short reads in bam format as values.

  • load – If True, the coveage of all genes is imported. WARNING: this may take a long time.

remove_short_read_coverage()

Removes short read coverage.

Removes all short read coverage information from self.

collapse_immune_genes(maxgap=300000)

This function collapses annotation of immune genes (IG and TR) of a loci.

As immune genes are so variable, classical annotation as a set of transcripts is not meaningfull for those genes. In consequence, each component of an immune gene is usually stored as an individual gene. This can cause issues when comparing transcripts to these genes, which naturally would overlap many of these components. To avoid these issues, immunoglobulin and T-cell receptor genes of a locus are combined to a single gene, without specifiying transcripts. Immune genes are recognized by the gff/gtf property “gene_type” set to “IG*_gene” or “TR*_gene”. Components within the distance of “maxgap” get collapsed to a single gene called TR/IG_locus_X. :param maxgap: Specify maximum distance between components of the same locus.

gene_table(**filter_args)

Creates a gene summary table.

Exports all genes within region to a table.

Parameters

filter_args – Parameters (e.g. “region”, “query”) are passed to Transcriptome.iter_genes.

transcript_table(samples=None, groups=None, coverage=False, tpm=False, tpm_pseudocount=0, extra_columns=None, **filter_args)

Creates a transcript table.

Exports all transcript isoforms within region to a table.

Parameters
  • samples – provide a list of samples for which coverage / expression information is added.

  • groups – provide groups as a dict (as from Transcriptome.groups()), for which coverage / expression information is added.

  • coverage – If set, coverage information is added for specified samples / groups.

  • tpm – If set, expression information (in tpm) is added for specified samples / groups.

  • tpm_pseudocount – This value is added to the coverage for each transcript, before calculating tpm.

  • extra_columns – Specify the additional information added to the table. These can be any transcrit property as defined by the key in the transcript dict.

  • filter_args – Parameters (e.g. “region”, “query”, “min_coverage”,…) are passed to Transcriptome.iter_transcripts.

chimeric_table(region=None, query=None)

Creates a chimeric table

This table contains relevant infos about breakpoints and coverage for chimeric genes.

Parameters
  • region – Specify the region, either as (chr, start, end) tuple or as “chr:start-end” string. If omitted specify the complete genome.

  • query – Specify transcript filter query.

write_gtf(fn, source='isotools', gzip=False, **filter_args)

Exports the transcripts in gtf format to a file.

Parameters
  • fn – The filename to write the gtf.

  • source – String for the source column of the gtf file.

  • region – Splecify genomic region to export to gtf. If omitted, export whole genome.

  • query – Specify transcript filter query.

  • gzip – compress the output as gzip.

export_alternative_splicing(out_dir, out_format='mats', reference=False, min_total=100, min_alt_fraction=0.1, samples=None, region=None, query=None, progress_bar=True)

Exports alternative splicing events defined by the transcriptome.

This is intended to integrate splicing event analysis from short read data. Tools for short read data implement different formats for the import of events. These formats include several files and depend on specific file naming. Currently only MISO (out_format=”miso”) and rMATS (out_format=’mats’) are supported. rMATS is recommended.

Parameters
  • out_dir – Path to the directory where the event files are written to.

  • out_format – Specify the output format. Must be either “miso” or “mats”.

  • min_total – Minimum total coverage over all selected samples.

  • region – Specify the region, either as (chr, start, end) tuple or as “chr:start-end” string. If omitted specify the complete genome.

  • query – Specify gene filter query.

  • progress_bar – Show the progress.

  • reference – If set to True, the LRTS data is ignored and the events are called from the reference. In this case the following parameters are ignored

  • samples – Specify the samples to consider

  • min_total – Minimum total coverage over all selected samples.

  • min_alt_fraction – Minimum fraction of reads supporting the alternative.

add_qc_metrics(genome_fn, progress_bar=True, downstream_a_len=30, direct_repeat_wd=15, direct_repeat_wobble=2, direct_repeat_mm=2, unify_ends=True)

Retrieves QC metrics for the transcripts.

Calling this function populates transcript[“biases”] information, which can be used do create filters. In particular, the direct repeat length, the downstream adenosine content and information about noncanonical splice sites are fetched. Additionaly genes are scanned for transcripts that are fully contained in other transcripts.

Parameters
  • geneome_fn – Path to the genome in fastA format.

  • downstream_a_len – The number of bases downstream the transcript where the adenosine fraction is determined.

  • direct_repeat_wd – The number of bases around the splice sites scanned for direct repeats.

  • direct_repeat_wobble – Number of bases the splice site sequences are shifted.

  • unify_ends – Unify TSS/PAS across transcripts of a gene.

  • direct_repeat_mm – Maximum number of missmatches in a direct repeat.

add_filter(tag, expression, context='transcript', update=False)

Defines a new filter for gene, transcripts and reference transcripts.

The provided expressions is evaluated during filtering in the provided context. For examples, see the default filter definitions isotools.DEFAULT_GENE_FILTER, isotools.DEFAULT_TRANSCRIPT_FILTER and isotools.DEFAULT_REF_TRANSCRIPT_FILTER.

Parameters
  • tag – Unique tag identifer for this filter. Must be a single word.

  • expression – Expression to be evaluated on gene, transcript, or reference transcript.

  • context – The context for the filter expression, either “gene”, “transcript” or “reference”.

  • update – If set, the already present definition of the provided tag gets overwritten.

remove_filter(tag)

Removes definition of filter tag.

Parameters

tag – Specify the tag of the filter definition to remove.

iter_genes(region=None, query=None, min_coverage=None, max_coverage=None, gois=None, progress_bar=False)

Iterates over the genes of a region, optionally applying filters.

Parameters
  • region – The region to be considered. Either a string “chr:start-end”, or a tuple (chr,start,end). Start and end is optional.

  • query – If provided, query string is evaluated on all genes for filtering

iter_transcripts(region=None, query=None, min_coverage=None, max_coverage=None, genewise=False, gois=None, progress_bar=False)

Iterates over the transcripts of a region, optionally applying filters.

By default, each iteration returns a 3 Tuple with the gene object, the transcript number and the transcript dictionary.

Parameters
  • region – The region to be considered. Either a string “chr:start-end”, or a tuple (chr,start,end). Start and end is optional.

  • query – If provided, query string is evaluated on all transcripts for filtering

  • min_coverage – The minimum coverage threshold. Transcripts with less reads in total are ignored.

  • max_coverage – The maximum coverage threshold. Transcripts with more reads in total are ignored.

  • genewise – In each iteration, return the gene and all transcript numbers and transcript dicts for the gene as tuples.

  • gois – If provided, only transcripts from the list of genes of interest are considered. Provide as a list of gene ids or gene names. By default, all genes are considered

  • progress_bar – Print a progress bar.

iter_ref_transcripts(region=None, query=None, genewise=False, gois=None, progress_bar=False)

Iterates over the referemce transcripts of a region, optionally applying filters.

Parameters
  • region – The region to be considered. Either a string “chr:start-end”, or a tuple (chr,start,end). Start and end is optional.

  • genewise – In each iteration, return the gene and all transcript numbers and transcript dicts for the gene as tuples.

  • query – If provided, query string is evaluated on all transcripts for filtering

  • genewise – In each iteration, return the gene and all transcript numbers and transcript dicts for the gene as tuples.

  • gois – If provided, only transcripts from the list of genes of interest are considered. Provide as a list of gene ids or gene names. By default, all genes are considered

  • progress_bar – Print a progress bar.

die_test(groups, min_cov=25, n_isoforms=10, padj_method='fdr_bh', progress_bar=True)

Reimplementation of the DIE test, suggested by Joglekar et al in Nat Commun 12, 463 (2021): “A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain”

Syntax and parameters follow the original implementation in https://github.com/noush-joglekar/scisorseqr/blob/master/inst/RScript/IsoformTest.R :param groups: Dict with group names as keys and lists of sample names as values, defining the two groups for the test. :param min_cov: Minimal number of reads per group for each gene. :param n_isoforms: Number of isoforms to consider in the test for each gene. All additional least expressed isoforms get summarized.

altsplice_test(groups, min_total=100, min_alt_fraction=0.1, min_n=10, min_sa=0.51, test='auto', padj_method='fdr_bh', types=None, progress_bar=True)

Performs the alternative splicing event test.

Parameters
  • groups – Dict with group names as keys and lists of sample names as values, defining the two groups for the test. If more then two groups are provided, test is performed between first two groups, but maximum likelihood parameters (expected PSI and dispersion) will be computed for the other groups as well.

  • min_total – Minimum total coverage over all selected samples (for both groups combined).

  • min_alt_fraction – Minimum fraction of reads supporting the alternative (for both groups combined).

  • min_n – The minimum coverage of the event for an individual sample to be considered for the min_sa filter.

  • min_sa – The fraction of samples within each group that must be covered by at least min_n reads.

  • test – The name of one of the implemented statistical tests (‘betabinom_lr’,’binom_lr’,’proportions’).

  • padj_method – Specify the method for multiple testing correction.

  • types – Restrict the analysis on types of events. If omitted, all types are tested.

coordination_test(samples=None, test='fisher', min_dist=1, min_total=100, min_alt_fraction=0.1, events_dict=None, event_type=('ES', '5AS', '3AS', 'IR', 'ME'), query=None, region=None, padj_method='fdr_bh', progress_bar=True)

Performs gene_coordination_test on all genes.

Parameters
  • samples – Specify the samples that should be considered in the test. The samples can be provided either as a single group name, a list of sample names, or a list of sample indices.

  • test (str) – Test to be performed. One of (“chi2”, “fisher”)

  • min_dist (int) – Minimum distance (in nucleotides) between the two Alternative Splicing Events for the pair to be tested

  • min_total (int) – The minimum total number of reads for an event to pass the filter

  • min_alt_fraction – The minimum fraction of read supporting the alternative

  • min_cov_pair (int) – the minimum total number of a pair of the joint occurrence of a pair of event for it to be reported in the result

  • events_dict – Pre-computed dictionary of alternative splicing events, to speed up analysis of several groups of samples of the same data set. Can be generated with the function _utils.precompute_events_dict.

  • event_type – A tuple with event types to test. Valid types are (“ES”,”3AS”, “5AS”,”IR” or “ME”, “TSS”, “PAS”). Default is (“ES”, “5AS”, “3AS”, “IR”, “ME”)

  • query – If provided, query string is evaluated on all genes for filtering

  • region – The region to be considered. Either a string “chr:start-end”, or a tuple (chr,start,end). Start and end is optional.

  • padj_method – The multiple test adjustment method. Any value allowed by statsmodels.stats.multitest.multipletests (default: Benjamini-Hochberg)

Returns

a Pandas DataFrame, where each column corresponds to the p_values, the statistics (the chi squared statistic if the chi squared test is used and the odds-ratio if the Fisher test is used), the log2 OR, the gene Id, the gene name, the type of the first ASE, the type of the second ASE, the starting coordinate of the first ASE, the ending coordinate of the first ASE, the starting coordinate of the second ASE, the ending coordinate of the second ASE, and the four entries of the contingency table.

alternative_splicing_events(min_total=100, min_alt_fraction=0.1, samples=None, region=None, query=None, progress_bar=False)

Finds alternative splicing events.

Finds alternative splicing events and potential transcription start sites/polyA sites by searching for splice bubbles in the Segment Graph. Genes may be specified by genomic “region”, and/or by filter tags / novelty class using the “query” parameters.

Parameters
  • min_total – Minimum total coverage over all selected samples.

  • min_alt_fraction – Minimum fraction of reads supporting the alternative.

  • samples – Specify the samples to consider. If omitted, all samples are selected.

  • region – Specify the region, either as (chr, start, end) tuple or as “chr:start-end” string. If omitted, the complete genome is searched.

  • query – Specify gene filter query.

  • progress_bar – If set True the progress is shown.

Returns

Table with alternative splicing events.

rarefaction(groups=None, fractions=20, min_coverage=2, tr_filter={})

Rarefaction analysis

Reads are sub-sampled according to the provided fractions, to estimate saturation of the transcriptome.

Parameters
  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • fractions – the fractions of reads to be sub-sampled. Either a list of floats between 0 and 1, or a integer number, specifying the number of equally spaced fractions.

  • min_coverage – Number of reads per transcript required to consider the transcribed discovered.

  • tr_filter – Filter dict, that is passed to self.iter_transcripts().

Returns

Tuple with: 1) Data frame containing the number of discovered transcripts, for each sub-sampling fraction and each sample / sample group. 2) Dict with total number of reads for each group.

altsplice_stats(groups=None, weight_by_coverage=True, min_coverage=2, tr_filter={})

Summary statistics for novel alternative splicing.

This function counts the novel alternative splicing events of LRTS isoforms with respect to the reference annotation. The result can be depicted by isotools.plots.plot_bar.

Parameters
  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • weight_by_coverage – If True, each transcript is weighted by the coverage.

  • min_coverage – Threshold to ignore poorly covered transcripts.

  • tr_filter – Filter dict, that is passed to self.iter_transcripts().

Returns

Table with numbers of novel alternative splicing events, and suggested parameters for isotools.plots.plot_bar().

filter_stats(tags=None, groups=None, weight_by_coverage=True, min_coverage=2, tr_filter={})

Summary statistics for filter flags.

This function counts the number of transcripts corresponding to filter tags. The result can be depicted by isotools.plots.plot_bar.

Parameters
  • tags – The filter tags to be evaluated. If omitted, all transcript tags are selected.

  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • weight_by_coverage – If True, each transcript is weighted by the number of supporting reads.

  • min_coverage – Coverage threshold per sample to ignore poorly covered transcripts.

  • tr_filter – Only transcripts that pass this filter are evaluated. Filter is provided as dict of parameters, passed to self.iter_transcripts().

Returns

Table with numbers of transcripts featuring the filter tag, and suggested parameters for isotools.plots.plot_bar().

transcript_length_hist(groups=None, add_reference=False, bins=50, x_range=(0, 10000), weight_by_coverage=True, min_coverage=2, use_alignment=True, tr_filter={}, ref_filter={})

Retrieves the transcript length distribution.

This function counts the number of transcripts within length interval. The result can be depicted by isotools.plots.plot_dist.

Parameters
  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • add_reference – Add the transcript length distribution of the reference annotation.

  • bins – Define the length interval, either by a single number of bins, or by a list of lengths, defining the interval boundaries.

  • x_range – The range of the intervals. Ignored if “bins” is provided as a list.

  • weight_by_coverage – If True, each transcript is weighted by the coverage.

  • min_coverage – Threshold to ignore poorly covered transcripts.

  • use_alignment – use the transcript length as defined by the alignment (e.g. the sum of all exon lengths).

  • tr_filter – Filter dict, that is passed to self.iter_transcripts().

  • ref_filter – Filter dict, that is passed to self.iter_ref_transcripts() (relevant only if add_reference=True).

Returns

Table with numbers of transcripts within the length intervals, and suggested parameters for isotools.plots.plot_distr().

transcript_coverage_hist(groups=None, bins=50, x_range=(1, 1001), tr_filter={})

Retrieves the transcript coverage distribution.

This function counts the number of transcripts within coverage interval. The result can be depicted by isotools.plots.plot_dist.

Parameters
  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • bins – Define the coverage interval, either by a single number of bins, or by a list of values, defining the interval boundaries.

  • x_range – The range of the intervals. Ignored if “bins” is provided as a list.

  • tr_filter – Filter dict, that is passed to self.iter_transcripts().

Returns

Table with numbers of transcripts within the coverage intervals, and suggested parameters for isotools.plots.plot_distr().

transcripts_per_gene_hist(groups=None, add_reference=False, bins=49, x_range=(1, 50), min_coverage=2, tr_filter={}, ref_filter={})

Retrieves the histogram of number of transcripts per gene.

This function counts the genes featuring transcript numbers within specified intervals. The result can be depicted by isotools.plots.plot_dist.

Parameters
  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • add_reference – Add the transcript per gene histogram of the reference annotation.

  • bins – Define the intervals, either by a single number of bins, or by a list of values, defining the interval boundaries.

  • x_range – The range of the intervals. Ignored if “bins” is provided as a list.

  • min_coverage – Threshold to ignore poorly covered transcripts.

  • tr_filter – Filter dict, that is passed to self.iter_transcripts().

  • ref_filter – Filter dict, that is passed to self.iter_ref_transcripts() (relevant only if add_reference=True).

Returns

Table with numbers of genes featuring transcript numbers within the specified intervals, and suggested parameters for isotools.plots.plot_distr().

exons_per_transcript_hist(groups=None, add_reference=False, bins=34, x_range=(1, 69), weight_by_coverage=True, min_coverage=2, tr_filter={}, ref_filter={})

Retrieves the histogram of number of exons per transcript.

This function counts the transcripts featuring exon numbers within specified intervals. The result can be depicted by isotools.plots.plot_dist.

Parameters
  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • add_reference – Add the exons per transcript histogram of the reference annotation.

  • bins – Define the intervals, either by a single number of bins, or by a list of values, defining the interval boundaries.

  • x_range – The range of the intervals. Ignored if “bins” is provided as a list.

  • weight_by_coverage – If True, each transcript is weighted by the coverage.

  • min_coverage – Threshold to ignore poorly covered transcripts.

  • tr_filter – Filter dict, that is passed to self.iter_transcripts().

  • ref_filter – Filter dict, that is passed to self.iter_ref_transcripts() (relevant only if add_reference=True).

Returns

Table with numbers of transcripts featuring exon numbers within the specified intervals, and suggested parameters for isotools.plots.plot_distr().

downstream_a_hist(groups=None, add_reference=False, bins=30, x_range=(0, 1), weight_by_coverage=True, min_coverage=2, tr_filter={}, ref_filter={})

Retrieves the distribution of downstream adenosine content.

High downstream adenosine content is indicative for internal priming.

Parameters
  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • add_reference – Add the distribution of downstream adenosine content of the reference annotation.

  • bins – Define the intervals, either by a single number of bins, or by a list of values, defining the interval boundaries.

  • x_range – The range of the intervals. Ignored if “bins” is provided as a list. Should not exceed (0,1), e.g. 0 to 100%.

  • weight_by_coverage – If True, each transcript is weighted by the coverage.

  • min_coverage – Threshold to ignore poorly covered transcripts.

  • tr_filter – Filter dict, that is passed to self.iter_transcripts().

  • ref_filter – Filter dict, that is passed to self.iter_ref_transcripts() (relevant only if add_reference=True).

Returns

Table with downstream adenosine content distribution, and suggested parameters for isotools.plots.plot_distr().

direct_repeat_hist(groups=None, bins=10, x_range=(0, 10), weight_by_coverage=True, min_coverage=2, tr_filter={})

Retrieves the distribution direct repeat length at splice junctions.

Direct repeats are indicative for reverse transcriptase template switching.

Parameters
  • groups – A dict {group_name:[sample_name_list]} specifying sample groups. If omitted, the samples are analyzed individually.

  • bins – Define the intervals, either by a single number of bins, or by a list of values, defining the interval boundaries.

  • x_range – The range of the intervals. Ignored if “bins” is provided as a list.

  • weight_by_coverage – If True, each transcript is weighted by the coverage.

  • min_coverage – Threshold to ignore poorly covered transcripts.

  • tr_filter – Filter dict, that is passed to self.iter_transcripts().

Returns

Table with direct repeat length distribution, and suggested parameters for isotools.plots.plot_distr().

add_hmmer_domains(domain_models, genome, query=True, ref_query=False, region=None, min_coverage=None, max_coverage=None, gois=None, progress_bar=False)

Align domains to protein sequences using pyhmmer and add them to the transcript isoforms.

Parameters
  • domain_models – The domain models and metadata, imported by “isotools.domains.import_hmmer_models” function

  • genome – Filename of genome fasta file, or Fasta

  • query – Query string to select the isotools transcripts, or True/False to include/exclude all transcripts.

  • ref_query – Query string to select the reference transcripts, or True/False to include/exclude all transcripts.

  • region – The region to be considered. Either a string “chr:start-end”, or a tuple (chr,start,end). Start and end is optional.

  • min_coverage – The minimum coverage threshold. Transcripts with less reads in total are ignored.

  • max_coverage – The maximum coverage threshold. Transcripts with more reads in total are ignored.

  • progress_bar – Print progress bars.

add_annotation_domains(annotation, category, id_col='uniProtId', name_col='name', inframe=True, progress_bar=False)

Annotate isoforms with protein domains from uniprot ucsc table files.

This function adds protein domains and other protein annotation to the transcripts. Annotation tables can be retriefed from https://genome.ucsc.edu/cgi-bin/hgTables. Select group: = “Genes and Gene Predictions”, track: “UniProt”, and chose from the available tables (e.g. “domains”).

Parameters
  • annotation – The file name of the table downloaded from the table browser, or a pandas dataframe with the content.

  • category – A term describing the type of annotation in the table (e.g. “domains”).

  • id_col – The column of the table with the domain id.

  • name_col – The column of the table with the label.

  • inframe – If set True (default), only annotations starting in frame are added to the transcript.

  • append – If set True, the annotation is added to existing annotation. This may lead to duplicate entries. By default, annotation of the same category is removed before annotation is added.

  • progress_bar – If set True, the progress is depicted with a progress bar.

isotools.Gene

class isotools.Gene(begin, end, data, transcriptome)[source]

This class stores all gene information and transcripts. It is derived from intervaltree.Interval.

sashimi_plot(samples=None, title='Long read sashimi plot', ax=None, junctions_of_interest=None, x_range=None, select_transcripts=None, y_range=None, log_y=True, jparams=None, exon_color='green', min_cov_th=0.001, high_cov_th=0.05, text_width=1, arc_type='both', text_height=1)

Draws long read Sashimi plot of the gene.

The Sashimi plot depicts the genomic long read sequencing coverage of one or more samples as blocks, and junction coverage as arcs.

Parameters
  • samples – Names of the samples to be depicted (as a list).

  • title – Specify the title of the axis.

  • ax – Specify the axis.

  • junctions_of_interest – List of int pairs to define junctions of interest (which are highlighed in the plots)

  • x_range – Genomic positions to specify the x range of the plot. If not specified, the range will be determined to include the complete gene.

  • y_range – Range for the coverage axis of the plot. Note to include space for the junction arcs. If not specified, the range will be determined automatically.

  • log_y – Log scale for the coverage.

  • select_transcripts – A list of transcript numbers from which the coverage is to be depicted. If obmitted, all transcripts are displayed.

  • jparams – Define the apperance of junctions, depending on their priority. A list with three dicts, defining parameters for low coverage junctions, high coverage junctions, and junctions of interest. For default values, see isotools._gene_plots.DEFAULT_JPARAMS

  • exon_color – Specify the color of the genomic coverage blocks (e.g. the exons)

  • high_cov_th – Minimum coverage for a junction to be considdered high coverage.

  • min_cov_th – Coverage threshold for a junction to be considdered at all.

  • text_width – Scaling factor for the horizontal space that gets reserved for labels on the arcs. This affects the height of the arcs.

  • arc_type – Label the junction arcs with the “coverage” (e.g. number of supporting reads), “fraction” (e.g. fraction of supporting reads in %), or “both”.

  • text_height – Scaling factor for the vertical space that gets reserved for labels on the arcs. This affects the height of the arcs.

gene_track(ax=None, title=None, reference=True, select_transcripts=None, label_exon_numbers=True, label_transcripts=True, label_fontsize=10, color='blue', x_range=None, draw_other_genes=False, query=None, mincoverage=None, maxcoverage=None)

Draws a gene track of the gene.

The gene track depicts the exon structure of a gene, like in a genome browser. Exons are depicted as boxes, and junctions are lines. For coding regions, the height of the boxes is increased. Transcripts are labeled with the name, and a “>” or “<” sign, marking the direction of transcription.

Parameters
  • ax – Specify the axis.

  • title – Specify the title of the axis.

  • reference – If True, depict the reference transcripts, else transcripts are defined by long read sequencing.

  • select_transcripts – A list of transcript numbers to be depicted. If draw_other_genes is set, select_transcripts should be a dict with gene_name as keys and lists with transcript numbers as values. If obmitted, all transcripts are displayed.

  • label_exon_numbers – Draw exon numbers within exons.

  • label_transcripts – Draw transcript name below transcripts.

  • label_fontsize – Specify the font sice for the labels.

  • color – Specify the color for the exons.

  • x_range – Genomic positions to specify the x range of the plot.

  • draw_other_genes – If set to True, transcripts from other genes overlapping the depicted region are also displayed. You can also provide a list of gene names/ids, to specify which other genes should be included.

  • query – Filter query, which is passed to Gene.filter_transcripts or Gene.filter_ref_transcripts

  • mincoverage – Minimum coverage for the transcript to be depeicted. Ignored in case of reference=True.

  • maxcoverage – Maximum coverage for the transcript to be depeicted. Ignored in case of reference=True.

sashimi_plot_short_reads(samples=None, title='short read coverage', ax=None, junctions_of_interest=None, x_range=None, y_range=None, log_y=True, jparams=None, min_cov_th=0.001, high_cov_th=0.05, text_width=0.02, arc_type='both', text_height=1, exon_color='green')

Draws short read Sashimi plot of the gene.

The Sashimi plot depicts the genomic coverage from short read sequencing as blocks, and junction coverage as arcs.

Parameters
  • samples – Names of the short read samples to be depicted (as a list).

  • title – Specify the title of the axis.

  • ax – Specify the axis.

  • junctions_of_interest – List of int pairs to define junctions of interest (which are highlighed in the plots)

  • x_range – Genomic positions to specify the x range of the plot.

  • y_range – Range for the coverage axis of the plot. Note to include space for the junction arcs. If not specified, the range will be determined automatically.

  • log_y – Log scale for the coverage.

  • jparams – Define the apperance of junctions, depending on their priority. A list with three dicts, defining parameters for low coverage junctions, high coverage junctions, and junctions of interest. For default values, see isotools._gene_plots.DEFAULT_JPARAMS

  • exon_color – Specify the color of the genomic coverage blocks (e.g. the exons)

  • high_cov_th – Minimum coverage for a junction to be considdered high coverage.

  • min_cov_th – Coverage threshold for a junction to be considdered at all.

  • text_width – Control the horizontal space that gets reserved for labels on the arcs. This affects the height of the arcs.

  • arc_type – Label the junction arcs with the “coverage” (e.g. number of supporting reads), “fraction” (e.g. fraction of supporting reads in %), or “both”.

  • text_height – Control the vertical space that gets reserved for labels on the arcs. This affects the height of the arcs.

sashimi_figure(samples=None, short_read_samples=None, draw_gene_track=True, draw_other_genes=True, long_read_params=None, short_read_params=None, junctions_of_interest=None, x_range=None)

Arranges multiple Sashimi plots of the gene.

The Sashimi figure consist of a reference gene track, long read sashimi plots for one or more samples or groups of samples, and optionally short read sashimi plots for one or more samples or groups of samples.

Parameters
  • samples – Definition of samples (as a list) or groups of samples (as a dict) for long read plots.

  • short_read_samples – Definition of samples (as a list) or groups of samples (as a dict) for short read plots.

  • draw_gene_track – Specify whether to plot the reference gene track.

  • draw_other_genes – Specify, whether to draw other genes in the gene track.

  • long_read_params – Dict with parameters for the long read plots, get passed to self.sashimi_plot. See isotools._gene_plots.DEFAULT_PARAMS and isotools._gene_plots.DEFAULT_JPARAMS

  • short_read_params – Dict with parameters for the short read plots, get passed to self.sashimi_plot_short_reads. See isotools._gene_plots.DEFAULT_PARAMS and isotools._gene_plots.DEFAULT_JPARAMS

  • junctions_of_interest – List of int pairs to define junctions of interest (which are highlighed in the plots)

  • x_range – Genomic positions to specify the x range of the plot.

Returns

Tuple with figure and axses

plot_domains(source, categories=None, trids=True, ref_trids=False, label='name', include_utr=False, seperate_exons=True, x_ticks='gene', ax=None, domain_cols={'Coiled-coil': 'blue', 'Disordered': 'pink', 'Domain': 'green', 'Family': 'red', 'Motif': 'grey', 'Repeat': 'orange'}, max_overlap=5, highlight=None, highlight_col='red')

Plot exonic part of transcripts, together with protein domanis and annotations.

Parameters
  • source – Source of protein domains, e.g. “annotation”, “hmmer” or “interpro”, for domains added by the functions “add_annotation_domains”, “add_hmmer_domains” or “add_interpro_domains” respectively.

  • categories – List of domain categories to be depicted, default: all categories.

  • trids – List of transcript indices to be depicted. If True/False, all/none transcripts are depicted.

  • ref_trids – List of reference transcript indices to be depicted. If True/False, all/none reference transcripts are depicted.

  • label – Specify the type of label: eiter None, or id, or name.

  • include_utr – If set True, the untranslated regions are also depicted.

  • seperate_exons – If set True, exon boundaries are marked.

  • x_ticks – Either “gene” or “genome”. If set to “gene”, positions are relative to the gene (continuous, starting from 0). If set to “genome”, positions are (discontinous) genomic coordinates.

  • ax – Specify the axis.

  • domain_cols – Dicionary for the colors of different domain types.

  • max_overlap – Maximum number of overlapping domains to be depicted. Longer domains have priority over shorter domains.

  • highlight – List of genomic positions or intervals to highlight.

  • highlight_col – Specify the color for highlight positions.

add_interpro_domains(genome, email, baseUrl='http://www.ebi.ac.uk/Tools/services/rest/iprscan5', max_jobs=25, poll_time=5, query=True, ref_query=False, min_coverage=None, max_coverage=None, progress_bar=True)

Add domains to gene by webrequests to ebi interpro REST API.

This function adds protein domains from interpro to the transcripts. Note that these rquest may take around 60 seconds per sequence. Seqeunces of transcripts sharing the same coding sequence are requested only once.

Parameters
  • genome – Filename of genome fasta

  • email – Valid email address, as required by the ebi web request. Not sure how it is used.

  • base_url – URL of the api.

  • max_jobs – Specify the maximum number of parallel requests.

  • query – Query string to select the isotools transcripts, or True/False to include/exclude all transcripts.

  • ref_query – Query string to select the reference transcripts, or True/False to include/exclude all transcripts.

  • min_coverage – The minimum coverage threshold. Transcripts with less reads in total are ignored.

  • max_coverage – The maximum coverage threshold. Transcripts with more reads in total are ignored.

  • progress_bar – If set True, the progress is depicted with a progress bar.

short_reads(idx)[source]

Returns the short read coverage profile for a short read sample.

Parameters

idx – The index of the short read sample.

correct_fuzzy_junctions(trid, size, modify=True)[source]

Corrects for splicing shifts.

This function looks for “shifted junctions”, e.g. same difference compared to reference annotaiton at both donor and acceptor) presumably caused by ambigous alignments. In these cases the positions are adapted to the reference position (if modify is set).

Parameters
  • trid – The index of the transcript to be checked.

  • size – The maximum shift to be corrected.

  • modify – If set, the exon positions are corrected according to the reference.

add_noncanonical_splicing(genome_fh)[source]

Add information on noncanonical splicing.

For all transcripts of the gene, scan for noncanonical (i.e. not GT-AG) splice sites. If noncanonical splice sites are present, the corresponding intron index (in genomic orientation) and the sequence i.e. the dinucleotides of donor and aceptor as XX-YY string are stored in the “noncannoncical_splicing” field of the transcript dicts. True noncanonical splicing is rare, thus it might indicate technical artifacts (template switching, missalignment, …)

Parameters

genome_fh – A file handle of the genome fasta file.

add_direct_repeat_len(genome_fh, delta=15, max_mm=2, wobble=2)[source]

Computes direct repeat length.

This function counts the number of consequtive equal bases at donor and acceptor sites of the splice junctions. This information is stored in the “direct_repeat_len” filed of the transcript dictionaries. Direct repeats longer than expected by chance indicate template switching.

Parameters
  • genome_fh – The file handle to the genome fasta.

  • delta – The maximum length of direct repeats that can be found.

  • max_mm – The maximum length of direct repeats that can be found.

  • wobble – The maximum length of direct repeats that can be found.

add_threeprime_a_content(genome_fh, length=30)[source]

Adds the information of the genomic A content downstream the transcript.

High values of genomic A content indicate internal priming and hence genomic origin of the LRTS read. This function populates the ‘downstream_A_content’ field of the transcript dictionaries.

Parameters
  • geneome_fh – A file handle for the indexed genome fasta file.

  • length – The length of the downstream region to be considered.

get_sequence(genome_fh, trids=None, reference=False, protein=False)[source]

Returns the nucleotide sequence of the specified transcripts.

Parameters
  • genome_fh – The path to the genome fasta file, or FastaFile handle.

  • trids – List of transcript ids for which the sequence are requested.

  • reference – Specifiy whether the sequence is fetched for reference transcripts (True) or long read transcripts (False, default).

  • protein – Return protein sequences instead of transcript sequences.

add_orfs(genome_fh, reference=False, minlen=30, start_codons=['ATG'], stop_codons=['TAA', 'TAG', 'TGA'])[source]

find longest ORF for each transcript and add to the transcript properties tr[“ORF”]

get_all_orf(genome_fh, reference=False, minlen=30, start_codons=['ATG'], stop_codons=['TAA', 'TAG', 'TGA'])[source]

Predicts ORF.

add_fragments()[source]

Checks for transcripts that are fully contained in other transcripts.

Transcripts that are fully contained in other transcripts are potential truncations. This function populates the ‘fragment’ filed of the transcript dictionaries with the indices of the containing transcripts, and the exon ids that match the first and last exons.

coding_len(trid)[source]

Returns length of 5’UTR, coding sequence and 3’UTR.

Parameters

trid – The transcript index for which the coding length is requested.

get_infos(trid, keys, sample_i, group_i, **kwargs)[source]

Returns the transcript information specified in “keys” as a list.

tpm(pseudocount=1)[source]

Returns the transcripts per million (TPM).

TPM is returned as a numpy array, with samples in columns and transcript isoforms in the rows.

find_transcript_positions(trid, pos, reference=False)[source]

Converts genomic positions to positions within the transcript.

Parameters
  • trid – The transcript id

  • pos – List of sorted genomic positions, for which the transcript positions are computed.

property coverage

Returns the transcript coverage.

Coverage is returned as a numpy array, with samples in columns and transcript isoforms in the rows.

property gene_coverage

Returns the gene coverage.

Total Coverage of the gene for each sample.

property chrom

Returns the genes chromosome.

property region

Returns the region of the gene as a string in the format “chr:start-end”.

property id

Returns the gene id

property name

Returns the gene name

property is_annotated

Returns “True” iff reference annotation is present for the gene.

property is_expressed

Returns “True” iff gene is covered by at least one long read in at least one sample.

property strand

Returns the strand of the gene, e.g. “+” or “-“

property transcripts

Returns the list of transcripts of the gene, as found by LRTS.

property ref_transcripts

Returns the list of reference transcripts of the gene.

property n_transcripts

Returns number of transcripts of the gene, as found by LRTS.

property n_ref_transcripts

Returns number of reference transcripts of the gene.

property ref_segment_graph

Returns the segment graph of the reference transcripts for the gene

property segment_graph

Returns the segment graph of the LRTS transcripts for the gene

copy()[source]

Returns a shallow copy of self.

coordination_test(samples=None, test='chi2', min_dist=1, min_total=100, min_alt_fraction=0.1, events=None, event_type=('ES', '5AS', '3AS', 'IR', 'ME'))[source]

Performs pairwise independence test for all pairs of Alternative Splicing Events (ASEs) in a gene.

For all pairs of ASEs in a gene creates a contingency table and performs an indeppendence test. All ASEs A have two states, pri_A and alt_A, the primary and the alternative state respectivley. Thus, given two events A and B, we have four possible ways in which these events can occur on a transcript, that is, pri_A and pri_B, pri_A and alt_B, alt_A and pri_B, and alt_A and alt_B. These four values can be put in a contingency table and independence, or coordination, between the two events can be tested.

Parameters
  • samples – Specify the samples that should be considdered in the test. The samples can be provided either as a single group name, a list of sample names, or a list of sample indices.

  • test (str) – Test to be performed. One of (“chi2”, “fisher”)

  • min_dist (int) – Minimum distance (in nucleotides) between the two alternative splicing events for the pair to be tested.

  • min_total (int) – The minimum total number of reads for an event pair to pass the filter.

  • min_alt_fraction (float) – The minimum fraction of reads supporting the minor alternative of the two events.

  • events – To speed up testing on different groups of the same transcriptome objects, events can be precomputed with the isotools._utils.precompute_events_dict function.

  • event_type – A tuple with event types to test. Valid types are (“ES”, “3AS”, “5AS”, “IR”, “ME”, “TSS”, “PAS”). Default is (“ES”, “5AS”, “3AS”, “IR”, “ME”). Not used if the event parameter is already given.

Returns

A list of tuples with the test results: (gene_id, gene_name, strand, eventA_type, eventB_type, eventA_start, eventA_end, eventB_start, eventB_end, p_value, test_stat, log2OR, dcPSI_AB, dcPSI_BA, priA_priB, priA_altB, altA_priB, altA_altB, priA_priB_trids, priA_altB_trids, altA_priB_trids, altA_altB_trids).

die_test(groups, min_cov=25, n_isoforms=10)[source]

Reimplementation of the DIE test, suggested by Joglekar et al in Nat Commun 12, 463 (2021): “A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain”

Syntax and parameters follow the original implementation in https://github.com/noush-joglekar/scisorseqr/blob/master/inst/RScript/IsoformTest.R :param groups: Define the columns for the groups. :param min_cov: Minimal number of reads per group for the gene. :param n_isoforms: Number of isoforms to consider in the test for the gene. All additional least expressed isoforms get summarized.

isotools.SegmentGraph

class isotools.SegmentGraph(transcripts, strand)[source]

Segment Graph Implementation

Nodes in the Segment Graph represent disjoint exonic bins (aka segments) and have start (genomic 5’), end (genomic 3’), and a dict of successors and predesessors (edges) Edges link two exonic bins x and y that follow successively in a transcript. They represent either introns (if x.end<y.start) or connect exonic bins of the same exon (x.end==y.start).

Parameters
  • transcripts (list) – A list of transcripts, which are lists of exons, which in turn are (start,end) tuples

  • strand – the strand of the gene, either “+” or “-“

search_transcript2(exons)[source]

Tests if a transcript (provided as list of exons) is contained in self and return the corresponding transcript indices.

Parameters

exons (list) – A list of exon tuples representing the transcript

Returns

a list of supporting transcript indices

Return type

list

search_transcript(exons, complete=True, include_ends=False)[source]

Tests if a transcript (provided as list of exons) is contained in sg and return the corresponding transcript indices.

Search the splice graph for transcripts that match the introns of the provided list of exons.

Parameters
  • exons (list) – A list of exon tuples representing the transcript

  • complete – If True, yield only splice graph transcripts that match all introns from the exon list. If False, yield also splice graph transcripts having additional exons at the beginning and/or end.

  • include_ends – If True, yield only splice graph transcripts that include the first and last exon. If False, also yield splice graph transcripts that extend first and/or last exon but match the intron chain.

Returns

a list of supporting transcript indices

Return type

list

get_node_matrix() numpy.array[source]

Gets the node matrix representation of the segment graph.

find_fragments()[source]

Finds all fragments (e.g. transcript contained in other transcripts) in the segment graph.

get_alternative_splicing(exons, alternative=None)[source]

Compares exons to segment graph and returns list of novel splicing events.

This function computes the novelty class of the provided transcript compared to (reference annotation) transcripts from the segment graph. It returns the “squanti category” (0=FSM,1=ISM,2=NIC,3=NNC,4=Novel gene) and the subcategory.

Parameters
  • exons (list) – A list of exon tuples representing the transcript

  • alternative – list of splice site indices that match other genes

Returns

pair with the squanti category number and the subcategories as list of novel splicing events that produce the provided transcript from the transcripts in splce graph

Return type

tuple

fuzzy_junction(exons, size)[source]

Looks for “fuzzy junctions” in the provided transcript.

For each intron from “exons”, look for introns in the splice graph shifted by less than “size”. These shifts may be produced by ambigious alignments.

Parameters
  • exons (list) – A list of exon tuples representing the transcript

  • size (int) – The maximum size of the fuzzy junction

Returns

a dict with the intron number as key and the shift as value (assuming size is smaller than introns)

Return type

dict

find_splice_sites(sj)[source]

Checks whether the splice sites of a new transcript are present in the segment graph.

Parameters

sj – A list of 2 tuples with the splice site positions

Returns

boolean array indicating whether the splice site is contained or not

get_overlap(exons)[source]

Compute the exonic overlap of a new transcript with the segment graph.

Parameters

exons (list) – A list of exon tuples representing the transcript

Returns

a tuple: the overlap with the gene, and a list of the overlaps with the transcripts

get_intron_support_matrix(exons)[source]

Check the intron support for the provided transcript w.r.t. transcripts from self.

This is supposed to be helpfull for the analysis of novel combinations of known splice sites.

Parameters

exons – A list of exon positions defining the transcript to check.

Returns

A boolean array of shape (n_transcripts in self)x(len(exons)-1). An entry is True iff the intron from “exons” is present in the respective transcript of self.

get_exon_support_matrix(exons)[source]

Check the exon support for the provided transcript w.r.t. transcripts from self.

This is supposed to be helpfull for the analysis of novel combinations of known splice sites.

Parameters

exons – A list of exon positions defining the transcript to check.

Returns

A boolean array of shape (n_transcripts in self)x(len(exons)-1). An entry is True iff the exon from “exons” is fully covered in the respective transcript of self. First and last exon are checked to overlap the first and last exon of the ref transcript but do not need to be fully covered

get_intersects(exons)[source]

Computes the splice junction exonic overlap of a new transcript with the segment graph.

Parameters

exons (list) – A list of exon tuples representing the transcript

Returns

the splice junction overlap and exonic overlap

find_splice_bubbles(types=None, pos=None)[source]

Searches for alternative paths in the segment graph (“bubbles”).

Bubbles are defined as combinations of nodes x_s and x_e with more than one path from x_s to x_e.

Parameters
  • types – A tuple with event types to find. Valid types are (‘ES’,’3AS’, ‘5AS’,’IR’ or ‘ME’, ‘TSS’, ‘PAS’). If ommited, all types are considered

  • pos – If specified, restrict the search on specific position. This is useful to find the supporting transcripts for a given type if the position is known.

Returns

Tuple with 1) transcript indices of primary (e.g. most direct) paths and 2) alternative paths respectivly, as well as 3) start and 4) end node ids and 5) type of alternative event (‘ES’,’3AS’, ‘5AS’,’IR’ or ‘ME’, ‘TSS’, ‘PAS’)

is_exonic(position)[source]

Checks whether the position is within an exon.

Parameters

position – The genomic position to check.

Returns

True, if the position overlaps with an exon, else False.

events_dist(e1, e2)[source]

returns the distance (in nucleotides) between two Alternative Splicing Events.

Parameters
  • e1 – event obtained from .find_splice_bubbles()

  • e2 – event obtained from .find_splice_bubbles()

  • sg – segment graph

class isotools.SegGraphNode(start, end, pre=None, suc=None)[source]

A node in a segment graph represents an exonic segment.

property start: int

the (genomic 5’) start of the segment

property end: int

the (genomic 3’) end of the segment

property pre: dict

the predecessor segments of the segment (linked nodes downstream)

property suc: dict

the successor segments of the segment (linked nodes downstream)

isotools.plots module

isotools.plots.plot_diff_results(result_table, min_support=3, min_diff=0.1, grid_shape=(5, 5), min_cov=10, splice_types=None, group_colors=None, sample_colors=None, pt_size=20, lw=1, ls='solid')[source]

Plots differential splicing results.

For the first (e.g. most significant) differential splicing events from result_table that pass the checks defined by the parameters, the PSI value of the alternative splicing event, as well as the fitted beta model for the groups, is depicted.

Parameters
  • min_cov – Depict samples where the event is covered by at least min_cov reads

  • min_support – Minimum number of samples per group supporting the differential event. A sample is considdered to support the differential event if it is covered > min_cov and the PSI is closer to the group mean than to the alternative group mean.

  • min_diff – Minimum PSI group difference.

  • grid_shape – Number of rows and columns for the figure.

  • splice_type – Only events from the splecified splice_type(s) are depicted. If omitted, all types are selected.

  • group_colors – Specify the colors for the groups (e.g. the lines) as a dict or list of length two.

  • sample_colors – Specify the colors for the samples (e.g. the dots) as a dict. Defaults to the corresponding group color.

  • pt_size – Specify the size for the data points in the plot.

  • lw – Specify witdh of the lines. See matplotlib Line2D for details.

  • ls – Specify style of the lines. See matplotlib Line2D for details.

Returns

figure, axes and list of plotted events

isotools.plots.plot_embedding(splice_bubbles, method='PCA', prior_count=3, top_var=500, min_total=100, min_alt_fraction=0.1, plot_components=(1, 2), splice_types='all', labels=True, groups=None, colors=None, pt_size=20, ax=None, **kwargs)[source]

Plots embedding of alternative splicing events.

Alternative splicing events are soreted by variance and only the top variable events are used for the embedding. A prior weight is added to all samples proportional to the average fraction of the alternatives, in order to bias poorly covered samples towards the mean and limit their potential to disturb the analysis.

Parameters
  • splice_bubbles – The splice bubble table, produced by Transcriptome.alternative_splicing_events().

  • method – The embedding method, either “PCA” or “UMAP”.

  • prior_count – Number of prior reads which are added to each sample proportional to the average fraction of the alternatives.

  • top_var – Number of alternative splicing events which are used for the embedding.

  • min_total – Minimum total coverage over all selected samples.

  • min_alt_fraction – Minimum fraction of reads supporting the alternative (for both groups combined).

  • plot_components – The dimentions to plot (E.g. the components of the PCA)

  • splice_types – Restrict the analysis on specified splicing event(s).

  • labels – If True, sample names are printed in the plot next to the corresponding points.

  • groups – Set a group definition (e.g. by isoseq.Transcirptome.groups()) to color the datapoints. All samples within one group are depicted.

  • colors – List or dict of colors for the groups, if ommited colors are generated automatically.

  • pt_size – Specify the size for the data points in the plot.

  • ax – The axis for plotting.

  • **kwargs – Additional keyword parameters are passed to PCA() or UMAP().

Returns

A dataframe with the proportions of the alternative events, the transformed data and the embedding object.

isotools.plots.plot_bar(df, ax=None, drop_categories=None, legend=True, annotate=True, rot=90, bar_width=0.5, colors=None, **axparams)[source]

Depicts data as a barplot.

This function is intended to be called with the result from isoseq.Transcriptome.filter_stats() or isoseq.Transcriptome.altsplice_stats().

Parameters
  • df – Pandas dataframe with the data to plot.

  • ax – the axis for the plot.

  • drop_categories – Specify columns from df to drop.

  • legend – If True, add a legend.

  • annotate – If True, print numbers / fractions in the bars.

  • rot – Set rotation of the lables.

  • bar_width – Set relative width of the plotted bars.

  • colors – Provide a dictionary with label keys and color values. By default, colors are automatically assigned.

  • **axparams – Additional keyword parameters are passed to ax.set().

isotools.plots.plot_distr(counts, ax=None, density=False, smooth=None, legend=True, fill=True, lw=1, ls='solid', colors=None, **axparams)[source]

Depicts data as density plot.

This function is intended to be called with the result from isoseq.Transcriptome.transcript_length_hist(), isoseq.Transcriptome.transcripts_per_gene_hist(), isoseq.Transcriptome.exons_per_transcript_hist(), isoseq.Transcriptome.downstream_a_hist(), isoseq.Transcriptome.direct_repeat_hist() or isoseq.Transcriptome.transcript_coverage_hist().

Parameters
  • df – Pandas dataframe with the data to plot.

  • ax – The axis for the plot.

  • density – Scale the data by the total.

  • smooth – Ews smoothing span.

  • legend – If True, add a legend.

  • fill – If set, the area below the lines are filled with half transparent color.

  • lw – Specify witdh of the lines. See matplotlib Line2D for details.

  • ls – Specify style of the lines. See matplotlib Line2D for details.

  • colors – Provide a dictionary with label keys and color values. By default, colors are automatically assigned.

  • **axparams – Additional keyword parameters are passed to ax.set().

isotools.plots.plot_saturation(isoseq=None, ax=None, cov_th=2, expr_th=[0.5, 1, 2, 5, 10], x_range=(10000.0, 10000000.0, 10000.0), legend=True, label=True, **axparams)[source]

Plots Negative Binomial model to analyze the saturation of LRTS data.

Saturation (e.g. the probability to observe a transcript of interest in the sample) is dependent on the sequencing depth (number of reads), the concentration of the transcripts of interest in the sample (in TPM), and the requested coverage of the transcript in the data (minimum number of reads per transcript). This function models the relation with a Negative Binomial distribution, to help estimate the required sequencing depth.

Parameters
  • isoseq – If provided, the sequencing depth of samples from this isotools.Transcriptome object are depicted as vertical lines.

  • ax – The axis for the plot.

  • cov_th – The requested coverage, e.g. the minimum number of reads per transcript.

  • expr_th – A list of transcript concentrations in TPM for transcripts of interest.

  • x_range – Specify the range of the x axis (e.g. the sequencing depth)

  • legend – If set True, a legend is added to the plot.

  • label – If set True, the sample names and sequencing depth from the isoseq parameter is printed in the plot.

  • **axparams – Additional keyword parameters are passed to ax.set().

isotools.plots.plot_rarefaction(rarefaction, total=None, ax=None, legend=True, colors=None, lw=1, ls='solid', **axparams)[source]

Plots the rarefaction curve.

Parameters
  • rarefaction – A DataFrame with the observed number of transcripts, as computed by Transcriptome.rarefaction().

  • total – A dictionary with the total number of reads per sample/sample group, as computed by Transcriptome.rarefaction().

  • ax – The axis for the plot.

  • legend – If set True, a legend is added to the plot.

  • colors – Provide a dictionary with label keys and color values. By default, colors are automatically assigned.

  • lw – Specify witdh of the lines. See matplotlib Line2D for details.

  • ls – Specify style of the lines. See matplotlib Line2D for details.

  • **axparams – Additional keyword parameters are passed to ax.set().

isotools.domains module

isotools.domains.add_domains_to_table(table, transcriptome, source='annotation', categories=None, id_col='gene_id', modes=['trA-trB', 'trB-trA'], naming='id', overlap_only=False, insert_after=None, **filter_kwargs)[source]

add domain annotation to table.

Parameters
  • table – A table, for which domains are derived. It should have at least one column with a gene id and one with a list of transcripts.

  • transcriptome – The isotools.Transcriptome object, from which the domains are sourced.

  • source – One of the three possible annotation sources.

  • categories – A list of annotation categories, which are considered. If set to None (default) all annotation categories are considered.

  • modes – A list of strings with column names of transcript ids used to look up domains. Multible columns can be combined to one mode with set operators “|” for union, “&” for intersection, and “-” for set difference. For example “trA-trB” would generate a Series with all domains of transcripts in “trA”, but not in “trB”.

  • naming – Define how domains are named, either by “id” or by “name”.

  • insert_after – Define column after which the domains are inserted into the table, either by column name or index. By default, domain columns returned as seperate DataFrame.

  • **filter_kwargs

    additional keywords are passed to Gene.filter_transcripts, to restrict the transcripts to be considered.

Parm overlap_only

If set “True”, only domains overlapping the region from column “start” to “end” are considered. If set “False”, all domains of the transcripts are considered.

isotools.domains.import_hmmer_models(path, model_file='Pfam-A.hmm.gz', metadata_file='Pfam-A.hmm.dat.gz')[source]

Import the hmmer model and metadata.

This function imports the hmmer Pfam models from “Pfam-A.hmm.gz” and metadata from “Pfam-A.hmm.dat.gz”, which are available for download on the interpro website, at “https://www.ebi.ac.uk/interpro/download/Pfam/”. The models are needed for Transcriptome.add_hmmer_domains() function.

Parameters
  • path – The path where model and metadata files are located.

  • model_file – The filename of the model file.

  • model_file – The filename of the metadata file.

constants

IsoTools: Python package for long read transcriptome sequencing analysis.

isotools.DEFAULT_GENE_FILTER

Default definitions for gene filter, as used in iosotools.Transcriptome.add_filters().

isotools.DEFAULT_TRANSCRIPT_FILTER

Default definitions for transcript filter, as used in iosotools.Transcriptome.add_filters().

isotools.DEFAULT_REF_TRANSCRIPT_FILTER

Default definitions for reference transcript filter, as used in iosotools.Transcriptome.add_filters().

isotools.ANNOTATION_VOCABULARY

Controlled vocabulary for filtering by annotation.

isotools.SPLICE_CATEGORY

Controlled vocabulary for filtering by novel alternative splicing.