Numpy Interoperability

Using Numpy functionality

This is a collection of short examples showing how NumPy concepts like indexing, broadcasting and reductions can be used effectively in BioNumPy:

import bionumpy as bnp
import numpy as np
import matplotlib.pyplot as plt

#### Boolean indexing
intervals = bnp.open("example_data/small_interval.bed").read()

# Get intervals larger than 20
mask = (intervals.end-intervals.start) > 20
print(intervals[mask])


### slice-indexing
# Skip first and last 10 intervals and extract every 2nd
print(intervals[10:-10:2])


###  List indexing
reference_sequence = bnp.open("example_data/small_genome.fa.fai")["1"]

# Get indices of all Gs in reference
indices = np.flatnonzero(reference_sequence[:-1] == "G")

# Get all letters follwing a G
next_letters = reference_sequence[indices+1]

# Count the letters following a G
counts = bnp.count_encoded(bnp.as_encoded_sequence_array(next_letters, bnp.DNAEncoding))
print(counts)


### Broadcasting
# Get a one_hot_encoding of a sequence
one_hot_encoding = reference_sequence[:, np.newaxis] == "ACTG"
print(one_hot_encoding)


### Reductions
# Get reads from file
reads = bnp.open("example_data/big.fq.gz").read()

# Get proportion of G's in each sequence
g_count = (reads.sequence == "G").mean(axis=-1)
plt.hist(g_count); plt.show()

From NumPy to BioNumPy

If you have an array of already encoded sequences, you can convert it to a BioNumPy sequence array by instantiating a ‘EncodedArray’ object with the array and the encoding you used. Here is an example:

import bionumpy as bnp
import numpy as np

original_sequence = 'ACGTACGTACGT'
lookup = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
numpy_array = np.array([lookup[base] for base in original_sequence], dtype=np.uint8)
encoded_array = bnp.EncodedArray(numpy_array, bnp.DNAEncoding)
print(encoded_array)

ACGTACGTACGT

From BioNumPy to NumPy

The most common way to get a NumPy array from a EncodedArray object is to either count occurances, or creating masks of certain bases. Here is an example:

import bionumpy as bnp

encoded_array = bnp.as_encoded_array('ACGTACGAC', bnp.DNAEncoding)
mask = encoded_array == 'A'
counts = bnp.count_encoded(encoded_array)
print(mask)
print(counts.counts)

[ True False False False  True False False  True False]
[3 3 2 1]

If you want to access the underlying Numpy array, you can use the ‘raw()’ method:

import bionumpy as bnp

encoded_array = bnp.as_encoded_array('ACGTACGAC', bnp.DNAEncoding)
print(encoded_array.raw())

[0 1 2 3 0 1 2 0 1]