Numpy Interoperability
Using Numpy functionality
This is a collection of short examples showing how NumPy concepts like indexing, broadcasting and reductions can be used effectively in BioNumPy:
import bionumpy as bnp
import numpy as np
import matplotlib.pyplot as plt
#### Boolean indexing
intervals = bnp.open("example_data/small_interval.bed").read()
# Get intervals larger than 20
mask = (intervals.end-intervals.start) > 20
print(intervals[mask])
### slice-indexing
# Skip first and last 10 intervals and extract every 2nd
print(intervals[10:-10:2])
### List indexing
reference_sequence = bnp.open("example_data/small_genome.fa.fai")["1"]
# Get indices of all Gs in reference
indices = np.flatnonzero(reference_sequence[:-1] == "G")
# Get all letters follwing a G
next_letters = reference_sequence[indices+1]
# Count the letters following a G
counts = bnp.count_encoded(bnp.as_encoded_sequence_array(next_letters, bnp.DNAEncoding))
print(counts)
### Broadcasting
# Get a one_hot_encoding of a sequence
one_hot_encoding = reference_sequence[:, np.newaxis] == "ACTG"
print(one_hot_encoding)
### Reductions
# Get reads from file
reads = bnp.open("example_data/big.fq.gz").read()
# Get proportion of G's in each sequence
g_count = (reads.sequence == "G").mean(axis=-1)
plt.hist(g_count); plt.show()
From NumPy to BioNumPy
If you have an array of already encoded sequences, you can convert it to a BioNumPy sequence array by instantiating a ‘EncodedArray’ object with the array and the encoding you used. Here is an example:
import bionumpy as bnp
import numpy as np
original_sequence = 'ACGTACGTACGT'
lookup = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
numpy_array = np.array([lookup[base] for base in original_sequence], dtype=np.uint8)
encoded_array = bnp.EncodedArray(numpy_array, bnp.DNAEncoding)
print(encoded_array)
ACGTACGTACGT
From BioNumPy to NumPy
The most common way to get a NumPy array from a EncodedArray object is to either count occurances, or creating masks of certain bases. Here is an example:
import bionumpy as bnp
encoded_array = bnp.as_encoded_array('ACGTACGAC', bnp.DNAEncoding)
mask = encoded_array == 'A'
counts = bnp.count_encoded(encoded_array)
print(mask)
print(counts.counts)
[ True False False False True False False True False]
[3 3 2 1]
If you want to access the underlying Numpy array, you can use the ‘raw()’ method:
import bionumpy as bnp
encoded_array = bnp.as_encoded_array('ACGTACGAC', bnp.DNAEncoding)
print(encoded_array.raw())
[0 1 2 3 0 1 2 0 1]