Getting read pileup inside peaks (intervals)
This is a brief example of how you can you can analyse alignments (pileup) inside a given set of regions. This example assumes that:
You have a set of alignments as a bam or bed file
You have a set of regions in another bed-file
In these scenarios it is common to have a bed-file that easily fits into memory (e.g. binding sites) and a large BAM file that we don’t want to keep in memory. As long as these are sorted, BioNumPy will handle streaming of the files and make sure that only data for a single chromosome is processed in memory at once. We can specify the behaviour by using stream=True/False when reading the data.
We start by reading our genome as the rest of the data depends on the genome (chromosome sizes).
>>> import numpy as np
>>> import bionumpy as bnp
>>> genome = bnp.Genome.from_file("example_data/hg38.chrom.sizes")
>>> print(genome)
Chromosome Size
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chr8 145138636
chr9 138394717
chr10 133797422
Using this genome object, we can now read our peaks and alignemnts. We add stream=True to tell BioNumPy to stream all data, i.e. only keep the data for a single chromosome in memory at the same time.
>>> peaks = genome.read_intervals("example_data/ctcf_chr21-22.bed.gz", stream=True)
>>> reads = genome.read_intervals("example_data/ctcf_chr21-22.bam", stream=True)
The peaks and reads are now intervals. Using the .get_pileup() method, we can create a GenomeArray pileup, which lets us easily query the pileup value at any position along the genome. A GenomeArray can be thought of as an array along the genome.
>>> read_pileup = reads.get_pileup()
We can then subset this pileup where we have peaks
>>> peaks_pileup = read_pileup[peaks]
… get e.g. the max pileup value inside each peak:
>>> max_pileup_value_per_peak = np.max(peaks_pileup, axis=1)
If you try to print max_ipleup_value_per_peak, you will see that no data is shown. Instead you get a ComputationNode object. This is because since we used stream=True, no computations have been performed yet. BioNumPy waits until you ask for results by calling .compute() on a ComputationNode object, and then finds out how to correctly stream the data and do the necessary jobs for creating the data you ask for.
Having the max peak values, we can easily filter the original peaks, e.g. get the peaks with max pileup value above some threshold:
>>> high_peaks = peaks[max_pileup_value_per_peak > 4]
If we now call .compute() on the final peaks, BioNumPy performs all the necessary steps and gives us a final set of peaks:
>>> print(high_peaks.compute())
Genomic Intervals on ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', '...']:
Interval with 362 entries
chromosome start stop
chr21 13980409 13980679
chr21 37460251 37460521
chr21 31933717 31933987
chr21 29164384 29164654
chr21 34351806 34352076
chr21 25708519 25708789
chr21 32403198 32403468
chr21 30434098 30434368
chr21 30044653 30044724
chr21 31874156 31874275