Intervals
Working with intervals is a central part of many bioinformatics analyses. In BioNumPy, an interval is any object (bnpdataclass) having a start and stop attribute and optionally a strand attribute. We will typically get intervals by reading them from files such as .bed, .gtf or .narrowPeaks. Or by converting alignments to reference intervals by alignment.to_interval. The functionality in BioNumPy for handling intervals can be divided into two operating on intervals as independent objects, or operations with some sort of dependencies:
Independent Operations
Independent operations covers any thing we want to do with a set of intervals that can be done one interval at a time. This covers simple geometric operations such as shifting and resizing; filtering based on properties of intervals. Since both the start and stop (and strand) attribute are NumPy arrays, we can implement a lot of functionality simply by using NumPy indexing and ufuncs:
>>> import bionumpy as bnp
>>> intervals = bnp.open("example_data/small_interval.bed").read()
>>> extended_right = bnp.replace(intervals, stop=intervals.stop+10)
>>> shifted = bnp.replace(intervals, start=intervals.start+5, stop=intervals.stop+5)
>>> small = intervals[(intervals.stop-intervals.start)<50]
Even though it is possible to write these kinds of functions using simple NumPy functionality, it is often better to wrap them in functions with instructive names. In the bnp.interval module, a set of utility funcitons ffor working with intervals is provided.
Operations with dependencies
Some queries cannot be answered just by looking at intervals as independent entries. Consider the query “all regions covered by at least two intervals”. In order to answer such questions, we need to look at the intervals as a coherent set. In BioNumPy, such queries are done by reducing all the relevant information from the intervals into a npstructures.RunLengthArray. The two main ways of doing this is to do make a pileup, i.e. counting how many intevals overlap with each position in the genome, or to make a boolean mask indicating which positions are covered by any interval.
>>> from bionumpy.datatypes import Interval
>>> from bionumpy.arithmetics import get_boolean_mask, get_pileup
>>> intervals = Interval(["chr1"]*3, [10, 15, 20], [13, 22, 21])
>>> get_boolean_mask(intervals, 23)
array([False, False, False, False, False, False, False, False, False,
False, True, True, True, False, False, True, True, True,
True, True, True, True, False])
>>> get_pileup(intervals, 23)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 2,
1, 0])
These run length encoded arrays stores the data in a smart way, but are supposed to work a much like a NumPy array as possible. So we can do NumPy-like operations on them to achieve common goals:
Working with boolean masks
The boolean mask representation is good to use when you want to filter a set of intervals or positions based on overlap with another set of intervals. We will examplify that here with a use of: a set of genes, a set of aligned reads and a set of variants.
>>> from bionumpy.datatypes import Bed6, Variant
>>> genes = Bed6(["chr1"]*3, [10, 15, 20], [13, 22, 21], ["gene_a", "gene_b", "gene_c"], ["."]*3, ["+", "-", "+"])
>>> genes
Bed6 with 3 entries
chromosome start stop name score strand
chr1 10 13 gene_a . +
chr1 15 22 gene_b . -
chr1 20 21 gene_c . +
>>> reads = Bed6(["chr1"]*5, [2, 3, 5, 7, 11], [4, 6, 8,10,12], [f"read{i}" for i in range(5)], ["."]*5, ["+", "-", "+", "-", "+"])
>>> reads
Bed6 with 5 entries
chromosome start stop name score strand
chr1 2 4 read0 . +
chr1 3 6 read1 . -
chr1 5 8 read2 . +
chr1 7 10 read3 . -
chr1 11 12 read4 . +
>>> variants = Variant(["chr1"]*4, [2, 4, 11, 16], ["A", "C", "G", "T"], ["G", "C", "T", "A"])
>>> variants
Variant with 4 entries
chromosome position ref_seq alt_seq
chr1 2 A G
chr1 4 C C
chr1 11 G T
chr1 16 T A
Let’s say we now want to get all the variants that lie within a gene, and is also covered by at least one read. We would then make a boolean mask for both the genes and the reads:
>>> gene_mask = get_boolean_mask(genes, 23)
>>> read_mask = get_boolean_mask(reads, 23)
We can now combine these together using the and operator to get the mask where both are covered:
>>> gene_and_read_mask = gene_mask & read_mask
Finally, we want to find all the variants that lie within this mask:
>>> covered_variants = variants[gene_and_read_mask[variants.position]]
>>> covered_variants
Variant with 1 entries
chromosome position ref_seq alt_seq
chr1 11 G T
We see that only one variant is covered both by a gene and a read.
Genomic Intervals
When working with genomic intervals, we often want to only deal intervals from one chromsome at the time. In order to do this, we can use the bnp.groupby function on interval entries.
>>> import bionumpy as bnp
>>> intervals = bnp.open("example_data/small_interval.bed").read()
>>> print(intervals)
Interval with 50 entries
chromosome start stop
0 13 18
0 37 46
0 62 83
0 105 126
0 129 130
1 3 21
1 41 65
1 91 114
1 131 153
1 157 168
>>> for chromosome, data in bnp.groupby(intervals, "chromosome"):
... print(f"---Chromosome: {chromosome}---")
... print(data)
---Chromosome: 0---
Interval with 5 entries
chromosome start stop
0 13 18
0 37 46
0 62 83
0 105 126
0 129 130
---Chromosome: 1---
Interval with 10 entries
chromosome start stop
1 3 21
1 41 65
1 91 114
1 131 153
1 157 168
1 174 201
1 213 230
1 240 268
1 290 315
1 319 339
---Chromosome: 2---
Interval with 15 entries
chromosome start stop
2 2 16
2 44 49
2 77 101
2 108 127
2 135 154
2 163 165
2 173 177
2 201 214
2 242 268
2 292 320
---Chromosome: 3---
Interval with 20 entries
chromosome start stop
3 7 34
3 58 82
3 95 101
3 130 138
3 150 170
3 188 211
3 234 261
3 283 302
3 325 352
3 353 362