Genomic Data

The bnp.genomic_data module contains functions for analysing data that belongs to a reference genome, e.g .bed, .bam, .vcf, .gtf files and so on. The main entry point is the Genome class that creates genomic data objects, either from file, or from BNPDataClass objects. The Genome file is most commonly instaniated from a .chrom.sizes or .fasta file, which contains information about the size of each chromosome in the genome.

API documentation

class bionumpy.genomic_data.Genome(chrom_sizes: ~typing.Dict[str, int], fasta_filename: str = None, sort_names=False, filter_function=<function Genome.<lambda>>)[source]

The entry point for working with genomic data. Supports reading data directly from file and into GenomicData objects, or creating GenomicData objects from BNPDataClass objects, such Interval, BedGraph or LocationEntry.

Supports both in memory or streamed/lazy data, controlled with the stream keyword. For streamed data it requires that the data in the file uses the same chromosome ordering as the genome. This will usually be the same as either what is provided in the chrom.sizes file, or a simple alphabetic ordering. If sort order errors occur, try using sort_names=True when creating the Genome object.

classmethod from_dict(chrom_sizes: Dict[str, int], *args, **kwargs) → Genome[source]

Create a Genome object from a dictionary of chromosome sizes

Parameters

chrom_sizes: Dict[str, int] args: Additional args to be passed to the Genome constructor kwargs: Additional kwargs to be passed to the Genome constructor

Returns

Genome

Examples

>>> import bionumpy as bnp
>>> bnp.Genome.from_dict({'chr1': 1000, 'chr2': 2000})
Genome(['chr1', 'chr2'])

classmethod from_file(filename: str, sort_names: bool = False, filter_function=<function ignore_underscores>) → Genome[source]

Read genome information from a ‘chrom.sizes’ or ‘fa.fai’ file

File should contain rows of names and lengths of chromosomes. If a fasta file is provided, a fasta index will be written to ‘fa.fai’ if not already theree. The sequence will then be available through, ‘genome.read_sequence()’ without any ‘filename’ parameter.

Parameters

cls : filename : str sort_names : bool Whether or not to sort the chromosome names

Returns

‘Genome’: A Genome object created from the read chromosome sizes

Examples

>>> import bionumpy as bnp
>>> bnp.Genome.from_file('example_data/hg38.chrom.sizes')
Genome(['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', '...'])

get_genome_context() → GenomeContext[source]: Get the genome context of the Genome Returns ——- GenomeContext

get_intervals(intervals: Interval, stranded: bool = False) → GenomicIntervals[source]

Get genomic intervals from interval data.

Parameters

intervalsInterval: Interval’s or stream of Intervals’s
strandedbool: Wheter or not the intervals are stranded

Returns

GenomicIntervals

Examples

>>> import bionumpy as bnp
>>> intervals = bnp.Interval(chromosome=['chr1', 'chr1', 'chr2'], start=[0, 10, 0], stop=[5, 15, 5])
>>> genome = bnp.Genome.from_dict({'chr1': 20, 'chr2': 10})
>>> genome.get_intervals(intervals)
Genomic Intervals on ['chr1', 'chr2']:
Interval with 3 entries
               chromosome                    start                     stop
                     chr1                        0                        5
                     chr1                       10                       15
                     chr2                        0                        5

get_locations(data: LocationEntry, has_numeric_chromosomes=False) → GenomicLocation[source]

Create `GenomicLocation`s from location entries

Parameters

dataLocationEntry: BNPDataClass of location entries

Returns

GenomicLocation

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/small_sequence.fa')
>>> genome.get_locations(bnp.datatypes.LocationEntry(chromosome=['chr1', 'chr1', 'chr2'], position=[0, 10, 0]))
Genomic Locations on ['chr1', 'chr2', 'chr3']:
LocationEntry with 3 entries
               chromosome                 position
                     chr1                        0
                     chr1                       10
                     chr2                        0

get_track(bedgraph: BedGraph) → GenomicArray[source]

Create a GenomicArray for the data in the bedgraph

Parameters

bedgraph : BedGraph

Returns

GenomicArray

Examples

>>> import bionumpy as bnp
>>> bedgraph = bnp.datatypes.BedGraph(chromosome=['chr1', 'chr1', 'chr2'], start=[0, 10, 0], stop=[5, 15, 5], value=[1, 2, 3])
>>> genome = bnp.Genome.from_dict({'chr1': 20, 'chr2': 10})
>>> genome.get_track(bedgraph)
chr1: [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2 0 0 0 0 0]
chr2: [3 3 3 3 3 0 0 0 0 0]

read_annotation(filename: str) → GenomicAnnotation[source]

Read genomic annotation from a ‘gtf’ or ‘gff’ file

The annotation file should have ‘gene’, ‘transcript’ and ‘exon’ entries.

Parameters

filenamestr: Filename of the annotation file

Returns

GenomicAnnotation: GenomicAnnotation containing the genes, transcripts and exons

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/small_sequence.fa')
>>> genome.read_annotation('example_data/small.gtf')
GenomicAnnotation(genome_context=['chr1', 'chr2', 'chr3'], data=GTFEntry with 5 entries
               chromosome                   source             feature_type                    start                     stop                    score                   strand                    phase                atributes
                     chr1                knownGene               transcript                    17369                    17436                        .                        -                        .  gene_id "ENST0000061921
                     chr1                knownGene                     exon                    17369                    17436                        .                        -                        .  gene_id "ENST0000061921
                     chr1                knownGene               transcript                    29554                    31097                        .                        +                        .  gene_id "ENST0000047335
                     chr1                knownGene                     exon                    29554                    30039                        .                        +                        .  gene_id "ENST0000047335
                     chr1                knownGene                     exon                    30564                    30667                        .                        +                        .  gene_id "ENST0000047335)

read_intervals(filename: str, stranded: bool = False, stream: bool = False, buffer_type=None) → GenomicIntervals[source]

Read a bed file and represent it as GenomicIntervals

If stream is True then read the bedgraph in chunks and get a lazy evaluated GenomicTrack

Parameters

filenamestr: Filename for the bed file
strandedbool: Whether or not to treat the intervals as stranded
streambool: Wheter to read as a stream

Returns

GenomicIntervals

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/small_sequence.fa')
>>> genome.read_intervals('example_data/small_summits.bed')
Genomic Intervals on ['chr1', 'chr2', 'chr3']:
Interval with 13 entries
               chromosome                    start                     stop
                     chr1                      639                      640
                     chr1                     6023                     6024
                     chr1                     7124                     7125
                     chr2                      849                      850
                     chr2                     6320                     6321
                     chr2                     8483                     8484
                     chr2                    11342                    11343
                     chr2                    12527                    12528
                     chr2                    13092                    13093
                     chr2                    18943                    18944

read_locations(filename: str, stranded: bool = False, stream: bool = False, has_numeric_chromosomes=False, buffer_type=None) → GenomicLocation[source]

Read a set of locations from file and convert them to GenomicLocation

Locations can be treated as stranded or not stranded. If has_numeric_chromosomes then ‘chr’ will be prepended to all the chromosome names in order to make them compatible with the genome’s chromosome names

Parameters

filenamestr: Filename of the locations
strandedbool: Whether or not the locations are stranded
streambool: Whether the data should be read as a stream
has_numeric_chromosomesbool: Whether the file has the chromosomes as numbers

Returns

GenomicLocation: (Stranded) GenomicLocation

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/hg38.chrom.sizes')
>>> genome.read_locations('example_data/thousand_genomes.vcf', has_numeric_chromosomes=True)
Genomic Locations on ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', '...']:
LocationEntry with 74 entries
               chromosome                 position
                    chr21                  5033883
                    chr21                  5035657
                    chr21                  5038297
                    chr21                  5038312
                    chr21                  5052250
                    chr21                  5053935
                    chr21                  5053961
                    chr21                  5063903
                    chr21                  5063916
                    chr21                  5064678

read_sequence(filename: str | None = None) → GenomicSequence[source]

Read the genomic sequence from file.

If a fasta file was used to create the Genome object, filename can be None in which case the sequence will be read from that fasta file

Parameters

filename : str

Returns

GenomicSequence

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file('example_data/small_sequence.fa')
>>> genome.read_sequence()
GenomicSequence over chromosomes: ['chr1', 'chr2', 'chr3']
>>> genome = bnp.Genome.from_file('example_data/small.chrom.sizes')
>>> genome.read_sequence('example_data/small_sequence.fa')
GenomicSequence over chromosomes: ['chr1', 'chr2', 'chr3']

read_track(filename: str, stream: bool = False) → GenomicArray[source]

Read a bedgraph from file and convert it to a GenomicArray

If stream is True then read the bedgraph in chunks and get a lazy evaluated GenomicArray

Parameters

filenamestr: Filename for the bedgraph file
streambool: Whether or not to read as stream

Returns

GenomicArray

Examples

>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_dict({'chr1': 30000, 'chr2': 31000, 'chr3': 32000})
>>> genome.read_track('example_data/small_treat_pileup.bdg')
chr1: [ 0.0 0.0 1.0 ... 0.0 0.0 0.0]
chr2: [ 0.0 0.0 0.0 ... 0.0 0.0 0.0]
chr3: [ 0.0 0.0 0.0 ... 0.0 0.0 0.0]

property size: int: The size of the genome

with_ignored_added(ignored: List[str]) → Genome[source]: Make a new GenomeContext with additional ignored chromosomes. This is useful for allowing but ignoring chromosome names that are not in the origin genome. Parameters ———- ignored: Iterable[str]

Returns

Genome

class bionumpy.genomic_data.GenomicArray[source]

Class for representing data on a genome.

classmethod from_bedgraph(bedgraph: BedGraph, genome_context: GenomeContextBase) → GenomicData[source]

Create a genomic array from a bedgraph and genomic context

Parameters

bedgraphBedGraph: Bedgraph with data along the genome
genome_contextGenomeContextBase: The genome context

Returns

‘GenomicData’

classmethod from_global_data(global_pileup: GenomicRunLengthArray, genome_context: GenomeContextBase) → GenomicArray[source]

Create the genomic array from data represented on a flat/concatenated genome.

Parameters

global_pileupGenomicRunLengthArray: Array over the concatenated genome
genome_contextGenomeContextBase: The genome context

Returns

‘GenomicArray’

class bionumpy.genomic_data.GenomicIntervals[source]

Class for representing intervals on a genome

abstract clip() → GenomicIntervals[source]

Clip the intervals so that they are contained in the genome

Returns

‘GenomicIntervals’: Clipped intervals

abstract extended_to_size(size: int) → GenomicIntervals[source]: Extend intervals along strand to reach the given size

Parameters

size : int

Returns

‘GenomicIntervals’

classmethod from_fields(genome_context: GenomeContextBase, chromosome: StringArray, start: ndarray, stop: ndarray, strand: EncodedArray | None = None) → GenomicIntervals[source]: Create genomic intervals from fields

Parameters

genome_context: GenomeContextBase chromosome: StringArray start: np.ndarray stop: np.ndarray strand: EncodedArray

Returns

GenomicIntervals

classmethod from_interval_stream(interval_stream: Iterable[Interval], genome_context: GenomeContextBase, is_stranded=False) → GenomicIntervals[source]: Create streamed genomic intervals from a stream of intervals and genome info

Parameters

interval_stream : Iterable[Interval] genome_context : GenomeContextBase is_stranded : bool

Returns

‘GenomicIntervals’

classmethod from_intervals(intervals: Interval, genome_context: GenomeContextBase, is_stranded: bool | None = False) → GenomicIntervals[source]: Create genomic intervals from interval entries and genome info

Parameters

intervals : Interval genome_context : GenomeContextBase is_stranded : bool

Returns

‘GenomicIntervals’

classmethod from_track(track: GenomicArray) → GenomicIntervals[source]: Return intervals of contigous areas of nonzero values of track

Parameters

track : GenomicArray

Returns

‘GenomicIntervals’

abstract get_mask() → GenomicArray[source]

Return a boolean mask of areas covered by any interval

Returns

GenomicArray: Genomic mask

abstract get_pileup() → GenomicArray[source]

Return a genmic track of counting the number of intervals covering each bp

Returns

GenomicArray: Pileup track

abstract merged(distance: int = 0) → GenomicIntervals[source]

Merge intervals that overlap or lie within distance of eachother

Parameters

distance : int

Returns

‘GenomicIntervals’: 4

class bionumpy.genomic_data.GenomicLocation[source]

Class representing (possibliy stranded) locations in the genome

classmethod from_data(data: BNPDataClass, genome_context: GenomeContextBase, is_stranded: bool = False, chromosome_name: str = 'chromosome', position_name: str = 'position', strand_name: str = 'strand') → GenomicLocation[source]

Create GenomicLocation object from a genome context and a bnpdataclass

The field names for the chromosome, positions and strand can be specified

Parameters

cls3: 4
dataBNPDataClass: The data containing the locations
genome_contextGenomeContextBase: Genome context for the genome
is_strandedbool: Whether or not the locations should be stranded
chromosome_namestr: Name of the chromosome field in data
position_namestr: Name of the position field in the data
strand_namestr: Name if the strand field int the data

Returns

‘GenomicLocation’

classmethod from_fields(genome_context: GenomeContextBase, chromosome: List[str], position: List[int], strand: List[str] = None) → GenomicLocation[source]

Create genomic location from a genome context and the needed fields (chromosome and position)

Parameters

genome_contextGenomeContextBase: Genome context object for the genome
chromosomeList[str]: List of chromosome
positionList[int]:: List of positions
strandList[str]: Optional list of strand

Returns

‘GenomicLocation’

class bionumpy.genomic_data.GenomicAnnotation(data, genome_context)[source]

Class to hold a genomic annotations. Basically just a holder for gene, transcript and exon intervals.

classmethod from_gtf_entries(gtf_entries: GTFEntry, genome_context: GenomeContextBase) → GenomicAnnotation[source]

Get a GenomicAnnotation from a set of gtf/gff entries

Parameters

gtf_entriesGTFEntry: BNPDataClass with gtf entries
genome_contextGenomeContextBase: The genome context

Returns

‘GenomicAnnotation’: Annotatin object holding the genes, transcripts and exons

class bionumpy.genomic_data.GenomicSequence(indexed_fasta: IndexedFasta, genome_context=None)[source]

Class to hold a genomic sequence as a GenomicArray

extract_intervals(intervals: Interval, stranded: bool = False) → EncodedRaggedArray[source]

Get the data within the (stranded) intervals

Parameters

intervalsInterval: Set of intervals
strandedbool: Wheter to reverse intervals on - strand

Returns

RunLengthRaggedArray: Data for all intervals

classmethod from_dict(sequence_dict: Dict[str, str])[source]: Create genomic data from a dict of data with chromosomes as keys

Parameters

d : Dict[str, GenomicRunLengthArray]

Returns

‘GenomicData’