Genomic Data

The bnp.genomic_data module contains functions for analysing data that belongs to a reference genome, e.g .bed, .bam, .vcf, .gtf files and so on. The main entry point is the Genome class that creates genomic data objects, either from file, or from BNPDataClass objects. The Genome file is most commonly instaniated from a .chrom.sizes or .fasta file, which contains information about the size of each chromosome in the genome.

API documentation

class bionumpy.genomic_data.Genome(chrom_sizes: ~typing.Dict[str, int], fasta_filename: str | None = None, sort_names=False, filter_function=<function Genome.<lambda>>)[source]

The entry point for working with genomic data. Supports reading data directly from file and into GenomicData objects, or creating GenomicData objects from BNPDataClass objects, such Interval, BedGraph or LocationEntry.

Supports both in memory or streamed/lazy data, controlled with the stream keyword. For streamed data it requires that the data in the file uses the same chromosome ordering as the genome. This will usually be the same as either what is provided in the chrom.sizes file, or a simple alphabetic ordering. If sort order errors occur, try using sort_names=True when creating the Genome object.

classmethod from_dict(chrom_sizes: Dict[str, int], *args, **kwargs) Genome[source]

Create a Genome object from a dictionary of chromosome sizes

Parameters

chrom_sizes: Dict[str, int] args: Additional args to be passed to the Genome constructor kwargs: Additional kwargs to be passed to the Genome constructor

Returns

Genome

Examples

>>> import bionumpy as bnp
>>> bnp.Genome.from_dict({'chr1': 1000, 'chr2': 2000})
Genome(['chr1', 'chr2'])
classmethod from_file(filename: str, sort_names: bool = False, filter_function=<function ignore_underscores>) Genome[source]

Read genome information from a ‘chrom.sizes’ or ‘fa.fai’ file

File should contain rows of names and lengths of chromosomes. If a fasta file is provided, a fasta index will be written to ‘fa.fai’ if not already theree. The sequence will then be available through, ‘genome.read_sequence()’ without any ‘filename’ parameter.

Parameters

cls : filename : str sort_names : bool Whether or not to sort the chromosome names

Returns

‘Genome’

A Genome object created from the read chromosome sizes

Examples

>>> import bionumpy as bnp
>>> bnp.Genome.from_file('example_data/hg38.chrom.sizes')
Genome(['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', '...'])
get_genome_context() GenomeContext[source]

Get the genome context of the Genome Returns ——- GenomeContext

get_intervals(intervals: Interval, stranded: bool = False) GenomicIntervals[source]

Get genomic intervals from interval data.

Parameters

intervalsInterval

Interval’s or stream of Intervals’s

strandedbool

Wheter or not the intervals are stranded

Returns

GenomicIntervals

Examples

>>> import bionumpy as bnp
>>> intervals = bnp.Interval(chromosome=['chr1', 'chr1', 'chr2'], start=[0, 10, 0], stop=[5, 15, 5])
>>> genome = bnp.Genome.from_dict({'chr1': 20, 'chr2': 10})
>>> genome.get_intervals(intervals)
Genomic Intervals on ['chr1', 'chr2']:
Interval with 3 entries
               chromosome                    start                     stop
                     chr1                        0                        5
                     chr1                       10                       15
                     chr2                        0                        5
get_locations(data: LocationEntry, has_numeric_chromosomes=False) GenomicLocation[source]

Create `GenomicLocation`s from location entries

Parameters

dataLocationEntry

BNPDataClass of location entries

Returns

GenomicLocation

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/small_sequence.fa')
>>> genome.get_locations(bnp.datatypes.LocationEntry(chromosome=['chr1', 'chr1', 'chr2'], position=[0, 10, 0]))
Genomic Locations on ['chr1', 'chr2', 'chr3']:
LocationEntry with 3 entries
               chromosome                 position
                     chr1                        0
                     chr1                       10
                     chr2                        0
get_track(bedgraph: BedGraph) GenomicArray[source]

Create a GenomicArray for the data in the bedgraph

Parameters

bedgraph : BedGraph

Returns

GenomicArray

Examples

>>> import bionumpy as bnp
>>> bedgraph = bnp.datatypes.BedGraph(chromosome=['chr1', 'chr1', 'chr2'], start=[0, 10, 0], stop=[5, 15, 5], value=[1, 2, 3])
>>> genome = bnp.Genome.from_dict({'chr1': 20, 'chr2': 10})
>>> genome.get_track(bedgraph)
chr1: [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2 0 0 0 0 0]
chr2: [3 3 3 3 3 0 0 0 0 0]
read_annotation(filename: str) GenomicAnnotation[source]

Read genomic annotation from a ‘gtf’ or ‘gff’ file

The annotation file should have ‘gene’, ‘transcript’ and ‘exon’ entries.

Parameters

filenamestr

Filename of the annotation file

Returns

GenomicAnnotation

GenomicAnnotation containing the genes, transcripts and exons

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/small_sequence.fa')
>>> genome.read_annotation('example_data/small.gtf')
GenomicAnnotation(genome_context=['chr1', 'chr2', 'chr3'], data=GTFEntry with 5 entries
               chromosome                   source             feature_type                    start                     stop                    score                   strand                    phase                atributes
                     chr1                knownGene               transcript                    17369                    17436                        .                        -                        .  gene_id "ENST0000061921
                     chr1                knownGene                     exon                    17369                    17436                        .                        -                        .  gene_id "ENST0000061921
                     chr1                knownGene               transcript                    29554                    31097                        .                        +                        .  gene_id "ENST0000047335
                     chr1                knownGene                     exon                    29554                    30039                        .                        +                        .  gene_id "ENST0000047335
                     chr1                knownGene                     exon                    30564                    30667                        .                        +                        .  gene_id "ENST0000047335)
read_intervals(filename: str, stranded: bool = False, stream: bool = False, buffer_type=None) GenomicIntervals[source]

Read a bed file and represent it as GenomicIntervals

If stream is True then read the bedgraph in chunks and get a lazy evaluated GenomicTrack

Parameters

filenamestr

Filename for the bed file

strandedbool

Whether or not to treat the intervals as stranded

streambool

Wheter to read as a stream

Returns

GenomicIntervals

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/small_sequence.fa')
>>> genome.read_intervals('example_data/small_summits.bed')
Genomic Intervals on ['chr1', 'chr2', 'chr3']:
Interval with 13 entries
               chromosome                    start                     stop
                     chr1                      639                      640
                     chr1                     6023                     6024
                     chr1                     7124                     7125
                     chr2                      849                      850
                     chr2                     6320                     6321
                     chr2                     8483                     8484
                     chr2                    11342                    11343
                     chr2                    12527                    12528
                     chr2                    13092                    13093
                     chr2                    18943                    18944
read_locations(filename: str, stranded: bool = False, stream: bool = False, has_numeric_chromosomes=False, buffer_type=None) GenomicLocation[source]

Read a set of locations from file and convert them to GenomicLocation

Locations can be treated as stranded or not stranded. If has_numeric_chromosomes then ‘chr’ will be prepended to all the chromosome names in order to make them compatible with the genome’s chromosome names

Parameters

filenamestr

Filename of the locations

strandedbool

Whether or not the locations are stranded

streambool

Whether the data should be read as a stream

has_numeric_chromosomesbool

Whether the file has the chromosomes as numbers

Returns

GenomicLocation

(Stranded) GenomicLocation

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/hg38.chrom.sizes')
>>> genome.read_locations('example_data/thousand_genomes.vcf', has_numeric_chromosomes=True)
Genomic Locations on ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', '...']:
LocationEntry with 74 entries
               chromosome                 position
                    chr21                  5033883
                    chr21                  5035657
                    chr21                  5038297
                    chr21                  5038312
                    chr21                  5052250
                    chr21                  5053935
                    chr21                  5053961
                    chr21                  5063903
                    chr21                  5063916
                    chr21                  5064678
read_sequence(filename: str | None = None) GenomicSequence[source]

Read the genomic sequence from file.

If a fasta file was used to create the Genome object, filename can be None in which case the sequence will be read from that fasta file

Parameters

filename : str

Returns

GenomicSequence

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/small_sequence.fa')
>>> genome.read_sequence()
GenomicSequence over chromosomes: ['chr1', 'chr2', 'chr3']
>>> genome = bnp.Genome.from_file('example_data/small.chrom.sizes')
>>> genome.read_sequence('example_data/small_sequence.fa')
GenomicSequence over chromosomes: ['chr1', 'chr2', 'chr3']
read_track(filename: str, stream: bool = False) GenomicArray[source]

Read a bedgraph from file and convert it to a GenomicArray

If stream is True then read the bedgraph in chunks and get a lazy evaluated GenomicArray

Parameters

filenamestr

Filename for the bedgraph file

streambool

Whether or not to read as stream

Returns

GenomicArray

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_dict({'chr1': 30000, 'chr2': 31000, 'chr3': 32000})
>>> genome.read_track('example_data/small_treat_pileup.bdg')
chr1: [ 0.0 0.0 1.0 ... 0.0 0.0 0.0]
chr2: [ 0.0 0.0 0.0 ... 0.0 0.0 0.0]
chr3: [ 0.0 0.0 0.0 ... 0.0 0.0 0.0]
property size: int

The size of the genome

with_ignored_added(ignored: List[str]) Genome[source]

Make a new GenomeContext with additional ignored chromosomes. This is useful for allowing but ignoring chromosome names that are not in the origin genome. Parameters ———- ignored: Iterable[str]

Returns

Genome

class bionumpy.genomic_data.GenomicArray[source]

Class for representing data on a genome.

classmethod from_bedgraph(bedgraph: BedGraph, genome_context: GenomeContextBase) GenomicData[source]

Create a genomic array from a bedgraph and genomic context

Parameters

bedgraphBedGraph

Bedgraph with data along the genome

genome_contextGenomeContextBase

The genome context

Returns

‘GenomicData’

classmethod from_global_data(global_pileup: GenomicRunLengthArray, genome_context: GenomeContextBase) GenomicArray[source]

Create the genomic array from data represented on a flat/concatenated genome.

Parameters

global_pileupGenomicRunLengthArray

Array over the concatenated genome

genome_contextGenomeContextBase

The genome context

Returns

‘GenomicArray’

class bionumpy.genomic_data.GenomicIntervals[source]

Class for representing intervals on a genome

abstract clip() GenomicIntervals[source]

Clip the intervals so that they are contained in the genome

Returns

‘GenomicIntervals’

Clipped intervals

abstract extended_to_size(size: int) GenomicIntervals[source]

Extend intervals along strand to reach the given size

Parameters

size : int

Returns

‘GenomicIntervals’

classmethod from_fields(genome_context: GenomeContextBase, chromosome: StringArray, start: ndarray, stop: ndarray, strand: EncodedArray | None = None) GenomicIntervals[source]

Create genomic intervals from fields

Parameters

genome_context: GenomeContextBase chromosome: StringArray start: np.ndarray stop: np.ndarray strand: EncodedArray

Returns

GenomicIntervals

classmethod from_interval_stream(interval_stream: Iterable[Interval], genome_context: GenomeContextBase, is_stranded=False) GenomicIntervals[source]

Create streamed genomic intervals from a stream of intervals and genome info

Parameters

interval_stream : Iterable[Interval] genome_context : GenomeContextBase is_stranded : bool

Returns

‘GenomicIntervals’

classmethod from_intervals(intervals: Interval, genome_context: GenomeContextBase, is_stranded: bool | None = False) GenomicIntervals[source]

Create genomic intervals from interval entries and genome info

Parameters

intervals : Interval genome_context : GenomeContextBase is_stranded : bool

Returns

‘GenomicIntervals’

classmethod from_track(track: GenomicArray) GenomicIntervals[source]

Return intervals of contigous areas of nonzero values of track

Parameters

track : GenomicArray

Returns

‘GenomicIntervals’

abstract get_mask() GenomicArray[source]

Return a boolean mask of areas covered by any interval

Returns

GenomicArray

Genomic mask

abstract get_pileup() GenomicArray[source]

Return a genmic track of counting the number of intervals covering each bp

Returns

GenomicArray

Pileup track

abstract merged(distance: int = 0) GenomicIntervals[source]

Merge intervals that overlap or lie within distance of eachother

Parameters

distance : int

Returns

‘GenomicIntervals’

4

class bionumpy.genomic_data.GenomicLocation[source]

Class representing (possibliy stranded) locations in the genome

classmethod from_data(data: BNPDataClass, genome_context: GenomeContextBase, is_stranded: bool = False, chromosome_name: str = 'chromosome', position_name: str = 'position', strand_name: str = 'strand') GenomicLocation[source]

Create GenomicLocation object from a genome context and a bnpdataclass

The field names for the chromosome, positions and strand can be specified

Parameters

cls3

4

dataBNPDataClass

The data containing the locations

genome_contextGenomeContextBase

Genome context for the genome

is_strandedbool

Whether or not the locations should be stranded

chromosome_namestr

Name of the chromosome field in data

position_namestr

Name of the position field in the data

strand_namestr

Name if the strand field int the data

Returns

‘GenomicLocation’

classmethod from_fields(genome_context: GenomeContextBase, chromosome: List[str], position: List[int], strand: List[str] | None = None) GenomicLocation[source]

Create genomic location from a genome context and the needed fields (chromosome and position)

Parameters

genome_contextGenomeContextBase

Genome context object for the genome

chromosomeList[str]

List of chromosome

positionList[int]:

List of positions

strandList[str]

Optional list of strand

Returns

‘GenomicLocation’

class bionumpy.genomic_data.GenomicAnnotation(data, genome_context)[source]

Class to hold a genomic annotations. Basically just a holder for gene, transcript and exon intervals.

classmethod from_gtf_entries(gtf_entries: GTFEntry, genome_context: GenomeContextBase) GenomicAnnotation[source]

Get a GenomicAnnotation from a set of gtf/gff entries

Parameters

gtf_entriesGTFEntry

BNPDataClass with gtf entries

genome_contextGenomeContextBase

The genome context

Returns

‘GenomicAnnotation’

Annotatin object holding the genes, transcripts and exons

class bionumpy.genomic_data.GenomicSequence(indexed_fasta: IndexedFasta, genome_context=None)[source]

Class to hold a genomic sequence as a GenomicArray

extract_intervals(intervals: Interval, stranded: bool = False) EncodedRaggedArray[source]

Get the data within the (stranded) intervals

Parameters

intervalsInterval

Set of intervals

strandedbool

Wheter to reverse intervals on - strand

Returns

RunLengthRaggedArray

Data for all intervals

classmethod from_dict(sequence_dict: Dict[str, str])[source]

Create genomic data from a dict of data with chromosomes as keys

Parameters

d : Dict[str, GenomicRunLengthArray]

Returns

‘GenomicData’