.. _intervals: =========== Intervals =========== Working with intervals is a central part of many bioinformatics analyses. In BioNumPy, an interval is any object (`bnpdataclass`) having a `start` and `stop` attribute and optionally a `strand` attribute. We will typically get intervals by reading them from files such as `.bed`, `.gtf` or `.narrowPeaks`. Or by converting alignments to reference intervals by `alignment.to_interval`. The functionality in BioNumPy for handling intervals can be divided into two operating on intervals as independent objects, or operations with some sort of dependencies: Independent Operations ====================== Independent operations covers any thing we want to do with a set of intervals that can be done one interval at a time. This covers simple geometric operations such as shifting and resizing; filtering based on properties of intervals. Since both the start and stop (and strand) attribute are numpy arrays, we can implement alot of functionality simply by using numpy indexing and ufuncs: >>> import bionumpy as bnp >>> intervals = bnp.open("example_data/small_interval.bed").read() >>> extended_right = bnp.replace(intervals, stop=intervals.stop+10) >>> shifted = bnp.replace(intervals, start=intervals.start+5, stop=intervals.stop+5) >>> small = intervals[(intervals.stop-intervals.start)<50] Even though it is possible to write these kinds of functions using simple numpy functionality, it is often better to wrap them in functions with instructive names. In the `bnp.interval` module, a set of utility funcitons ffor working with intervals is provided. Operations with dependencies ============================ Some queries cannot be answered just by looking at intervals as independent entries. For instance: "All regions" covered by at least two intervals". In order to answer such questions, we need to look at the intervals as a coherent set. In BioNumPy, such queries are by reducing all the relevant information from the intervals into a `npstructures.RunLengthArray`. The two main ways of doing this is to do make a pileup, i.e. counting how many intevals overlap with each position in the genome, or to make a boolean mask indicating which positions are covered by any interval. >>> from bionumpy.datatypes import Interval >>> from bionumpy.arithmetics import get_boolean_mask, get_pileup >>> intervals = Interval(["chr1"]*3, [10, 15, 20], [13, 22, 21]) >>> get_boolean_mask(intervals, 23) array([False, False, False, False, False, False, False, False, False, False, True, True, True, False, False, True, True, True, True, True, True, True, False]) >>> get_pileup(intervals, 23) array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 0]) These run length encoded arrays stores the data in a smart way, but are supposed to work a much like a numpy array as possible. So we can do numpy-like operations on them to achieve common goals: Working with boolean masks -------------------------- The boolean mask representation is good to use when you want to filter a set of intervals or positions based on overlap with another set of intervals. We will examplify that here with a use of: a set of genes, a set of aligned reads and a set of variants. >>> from bionumpy.datatypes import Bed6, Variant >>> genes = Bed6(["chr1"]*3, [10, 15, 20], [13, 22, 21], ["gene_a", "gene_b", "gene_c"], ["."]*3, ["+", "-", "+"]) >>> genes Bed6 with 3 entries chromosome start stop name score strand chr1 10 13 gene_a . + chr1 15 22 gene_b . - chr1 20 21 gene_c . + >>> reads = Bed6(["chr1"]*5, [2, 3, 5, 7, 11], [4, 6, 8,10,12], [f"read{i}" for i in range(5)], ["."]*5, ["+", "-", "+", "-", "+"]) >>> reads Bed6 with 5 entries chromosome start stop name score strand chr1 2 4 read0 . + chr1 3 6 read1 . - chr1 5 8 read2 . + chr1 7 10 read3 . - chr1 11 12 read4 . + >>> variants = Variant(["chr1"]*4, [2, 4, 11, 16], ["A", "C", "G", "T"], ["G", "C", "T", "A"]) >>> variants Variant with 4 entries chromosome position ref_seq alt_seq chr1 2 A G chr1 4 C C chr1 11 G T chr1 16 T A Let's say we now want to get all the variants that lie within a gene, and is also covered by at least one read. We would then make a boolean mask for both the genes and the reads: >>> gene_mask = get_boolean_mask(genes, 23) >>> read_mask = get_boolean_mask(reads, 23) We can now combine these together using the `and` operator to get the mask where both are covered: >>> gene_and_read_mask = gene_mask & read_mask Finally, we want to find all the variants that lie within this mask: >>> covered_variants = variants[gene_and_read_mask[variants.position]] >>> covered_variants Variant with 1 entries chromosome position ref_seq alt_seq chr1 11 G T We see that only one variant is covered both by a gene and a read. Genomic Intervals ================= When working with genomic intervals, we often want to only deal intervals from one chromsome at the time. In order to do this, we can use the `bnp.groupby` function on interval entries. >>> import bionumpy as bnp >>> intervals = bnp.open("example_data/small_interval.bed").read() >>> print(intervals) Interval with 50 entries chromosome start stop 0 13 18 0 37 46 0 62 83 0 105 126 0 129 130 1 3 21 1 41 65 1 91 114 1 131 153 1 157 168 >>> for chromosome, data in bnp.groupby(intervals, "chromosome"): ... print(f"---Chromosome: {chromosome}---") ... print(data) ---Chromosome: 0--- Interval with 5 entries chromosome start stop 0 13 18 0 37 46 0 62 83 0 105 126 0 129 130 ---Chromosome: 1--- Interval with 10 entries chromosome start stop 1 3 21 1 41 65 1 91 114 1 131 153 1 157 168 1 174 201 1 213 230 1 240 268 1 290 315 1 319 339 ---Chromosome: 2--- Interval with 15 entries chromosome start stop 2 2 16 2 44 49 2 77 101 2 108 127 2 135 154 2 163 165 2 173 177 2 201 214 2 242 268 2 292 320 ---Chromosome: 3--- Interval with 20 entries chromosome start stop 3 7 34 3 58 82 3 95 101 3 130 138 3 150 170 3 188 211 3 234 261 3 283 302 3 325 352 3 353 362