Genomic Data

The bnp.genomic_data module contains functions for analysing data that belongs to a reference genome, e.g .bed, .bam, .vcf, .gtf files and so on. The main entry point is the Genome class that creates genomic data objects, either from file, or from BNPDataClass objects. The Genome file is most commonly instaniated from a .chrom.sizes or .fasta file, which contains information about the size of each chromosome in the genome.

API documentation

class bionumpy.genomic_data.Genome(chrom_sizes: ~typing.Dict[str, int], fasta_filename: str | None = None, sort_names=False, filter_function=<function Genome.<lambda>>)[source]

The entry point for working with genomic data. Supports reading data directly from file and into GenomicData objects, or creating GenomicData objects from BNPDataClass objects, such Interval, BedGraph or LocationEntry.

Supports both in memory or streamed/lazy data, controlled with the stream keyword. For streamed data it requires that the data in the file uses the same chromosome ordering as the genome. This will usually be the same as either what is provided in the chrom.sizes file, or a simple alphabetic ordering. If sort order errors occur, try using sort_names=True when creating the Genome object.

classmethod from_file(filename: str, sort_names: bool = False, filter_function=<function ignore_underscores>) Genome[source]

Read genome information from a ‘chrom.sizes’ or ‘fa.fai’ file

File should contain rows of names and lengths of chromosomes. If a fasta file is provided, a fasta index will be written to ‘fa.fai’ if not already theree. The sequence will then be available through, ‘genome.read_sequence()’ without any ‘filename’ parameter.

Parameters

cls : filename : str sort_names : bool Whether or not to sort the chromosome names

Returns

‘Genome’

A Genome object created from the read chromosome sizes

get_intervals(intervals: Interval, stranded: bool = False) GenomicIntervals[source]

Get genomic intervals from interval data.

Parameters

intervalsInterval

Interval’s or stream of Intervals’s

strandedbool

Wheter or not the intervals are stranded

Returns

GenomicIntervals

get_locations(data: LocationEntry, has_numeric_chromosomes=False) GenomicLocation[source]

Create `GenomicLocation`s from location entries

Parameters

dataLocationEntry

BNPDataClass of location entries

Returns

GenomicLocation

get_track(bedgraph: BedGraph) GenomicArray[source]

Create a GenomicArray for the data in the bedgraph

Parameters

bedgraph : BedGraph

Returns

GenomicArray

read_annotation(filename: str) GenomicAnnotation[source]

Read genomic annotation from a ‘gtf’ or ‘gff’ file

The annotation file should have ‘gene’, ‘transcript’ and ‘exon’ entries.

Parameters

filenamestr

Filename of the annotation file

Returns

GenomicAnnotation

GenomicAnnotation containing the genes, transcripts and exons

read_intervals(filename: str, stranded: bool = False, stream: bool = False, buffer_type=None) GenomicIntervals[source]

Read a bed file and represent it as GenomicIntervals

If stream is True then read the bedgraph in chunks and get a lazy evaluated GenomicTrack

Parameters

filenamestr

Filename for the bed file

strandedbool

Whether or not to treat the intervals as stranded

streambool

Wheter to read as a stream

Returns

GenomicIntervals

read_locations(filename: str, stranded: bool = False, stream: bool = False, has_numeric_chromosomes=False, buffer_type=None) GenomicLocation[source]

Read a set of locations from file and convert them to GenomicLocation

Locations can be treated as stranded or not stranded. If has_numeric_chromosomes then ‘chr’ will be prepended to all the chromosome names in order to make them compatible with the genome’s chromosome names

Parameters

filenamestr

Filename of the locations

strandedbool

Whether or not the locations are stranded

streambool

Whether the data should be read as a stream

has_numeric_chromosomesbool

Whether the file has the chromosomes as numbers

Returns

GenomicLocation

(Stranded) GenomicLocation

read_sequence(filename: str | None = None) GenomicSequence[source]

Read the genomic sequence from file.

If a fasta file was used to create the Genome object, filename can be None in which case the sequence will be read from that fasta file

Parameters

filename : str

Returns

GenomicSequence

read_track(filename: str, stream: bool = False) GenomicArray[source]

Read a bedgraph from file and convert it to a GenomicArray

If stream is True then read the bedgraph in chunks and get a lazy evaluated GenomicArray

Parameters

filenamestr

Filename for the bedgraph file

streambool

Whether or not to read as stream

property size

The size of the genome

class bionumpy.genomic_data.GenomicArray[source]

Class for representing data on a genome.

classmethod from_bedgraph(bedgraph: BedGraph, genome_context: GenomeContextBase) GenomicData[source]

Create a genomic array from a bedgraph and genomic context

Parameters

bedgraphBedGraph

Bedgraph with data along the genome

genome_contextGenomeContextBase

The genome context

Returns

‘GenomicData’

classmethod from_global_data(global_pileup: GenomicRunLengthArray, genome_context: GenomeContextBase) GenomicArray[source]

Create the genomic array from data represened on a flat/concatenated genome.

Parameters

global_pileupGenomicRunLengthArray

Array over the concatenated genome

genome_contextGenomeContextBase

The genome context

Returns

‘GenomicArray’

class bionumpy.genomic_data.GenomicIntervals[source]

Class for representing intervals on a genome

abstract clip() GenomicIntervals[source]

Clip the intervals so that they are contained in the genome

Returns

‘GenomicIntervals’

Clipped intervals

abstract extended_to_size(size: int) GenomicIntervals[source]

Extend intervals along strand to reach the given size

Parameters

size : int

Returns

‘GenomicIntervals’

classmethod from_interval_stream(interval_stream: Iterable[Interval], genome_context: GenomeContextBase, is_stranded=False)[source]

Create streamed genomic intervals from a stream of intervals and genome info

Parameters

interval_stream : Iterable[Interval] chrom_sizes : Dict[str, int]

classmethod from_intervals(intervals: Interval, genome_context: GenomeContextBase, is_stranded=False)[source]

Create genomic intervals from interval entries and genome info

Parameters

intervals : Interval chrom_sizes : Dict[str, int]

classmethod from_track(track: GenomicArray) GenomicIntervals[source]

Return intervals of contigous areas of nonzero values of track

Parameters

track : GenomicArray

Returns

‘GenomicIntervals’

abstract get_mask() GenomicArray[source]

Return a boolean mask of areas covered by any interval

Returns

GenomicArray

Genomic mask

abstract get_pileup() GenomicArray[source]

Return a genmic track of counting the number of intervals covering each bp

Returns

GenomicArray

Pileup track

abstract merged(distance: int = 0) GenomicIntervals[source]

Merge intervals that overlap or lie within distance of eachother

Parameters

distance : int

Returns

‘GenomicIntervals’

4

class bionumpy.genomic_data.GenomicLocation[source]

Class representing (possibliy stranded) locations in the genome

classmethod from_data(data: BNPDataClass, genome_context: GenomeContextBase, is_stranded: bool = False, chromosome_name: str = 'chromosome', position_name: str = 'position', strand_name: str = 'strand') GenomicLocation[source]

Create GenomicLocation object from a genome context and a bnpdataclass

The field names for the chromosome, positions and strand can be specified

Parameters

cls3

4

dataBNPDataClass

The data containing the locations

genome_contextGenomeContextBase

Genome context for the genome

is_strandedbool

Whether or not the locations should be stranded

chromosome_namestr

Name of the chromosome field in data

position_namestr

Name of the position field in the data

strand_namestr

Name if the strand field int the data

Returns

‘GenomicLocation’

classmethod from_fields(genome_context: GenomeContextBase, chromosome: List[str], position: List[int], strand: List[str] | None = None) GenomicLocation[source]

Create genomic location from a genome context and the needed fields (chromosome and position)

Parameters

genome_contextGenomeContextBase

Genome context object for the genome

chromosomeList[str]

List of chromosome

positionList[int]:

List of positions

strandList[str]

Optional list of strand

Returns

‘GenomicLocation’

class bionumpy.genomic_data.GenomicAnnotation(data, genome_context)[source]

Class to hold a genomic annotations. Basically just a holder for gene, transcript and exon intervals.

classmethod from_gtf_entries(gtf_entries: GTFEntry, genome_context: GenomeContextBase) GenomicAnnotation[source]

Get a GenomicAnnotation from a set of gtf/gff entries

Parameters

gtf_entriesGTFEntry

BNPDataClass with gtf entries

genome_contextGenomeContextBase

The genome context

Returns

‘GenomicAnnotation’

Annotatin object holding the genes, transcripts and exons

class bionumpy.genomic_data.GenomicSequence(indexed_fasta: IndexedFasta, genome_context=None)[source]

Class to hold a genomic sequence as a GenomicArray

extract_intervals(intervals: Interval, stranded: bool = False) EncodedRaggedArray[source]

Get the data within the (stranded) intervals

Parameters

intervalsInterval

Set of intervals

strandedbool

Wheter to reverse intervals on - strand

Returns

RunLengthRaggedArray

Data for all intervals

classmethod from_dict(sequence_dict: Dict[str, str])[source]

Create genomic data from a dict of data with chromosomes as keys

Parameters

d : Dict[str, GenomicRunLengthArray]

Returns

‘GenomicData’