Encodings

A central concept in BioNumPy is the encoding of data, such as DNA sequence, base qualities, kmers, etc, to memory-efficient data types that are used internally by BioNumPy.

For instance, when asking BioNumPy to store the DNA-sequence s = bnp.as_encoded_array(“ACGT”, bnp.DNAEncoding, BioNumPy does not store the letters A, C, T and G, but instead uses an efficient numeric representation of them. However, the user does not need to know how this works internally, and will only need to think about the stored sequence as letters. This is why things like sequence == “A” works, even though sequence internally is a numeric array.

Encoding data using a specific encoding

When reading data with bnp.open, BioNumPy automatically encodes your data with a suitable encoding (determined by the file format). If you for some reason have data from other sources, e.g. as strings or list of strings, you may use bnp.as_encoded_array to encode your data. In this case, you should make sure you use an encoding suitable for your data (see supported encodings below).

>>> import bionumpy as bnp
>>> sequences = ["ACCT", "AcaATA", "ca"]
>>> bnp.as_encoded_array(sequences, bnp.DNAEncoding)
encoded_ragged_array(['ACCT',
                      'ACAATA',
                      'CA'], AlphabetEncoding('ACGT'))

Supported encodings

These are the most common encodings that can be used:

  • bnp.BaseEncoding: Can encode any string

  • bnp.DNAEncoding: Supports A, C, T and G (not N)

  • bnp.encodings.alphabet_encoding.ACTGnEncoding: Supports N, A, C, T, G

  • bnp.encodings.alphabet_encoding.ACUGEncoding: Supports A, C, U, G

  • bnp.encodings.alphabet_encoding.RNAEncoding: Supports A, C, U, G

  • bnp.encodings.alphabet_encoding.AminoAcidEncoding: Supports all valid AminoAcids

When having already encoded data (advanced usage)

In some cases, you may have already encoded data that you want to use with BioNumPy. In this case, you can wrap your data in BioNumPy’s EncodedArray or EncodedRaggedArray class, but you will need to be sure that your data is encoded correctly as BioNumPy does not verify this.

For example, if you have already encoded DNA-sequences so that A is 0, C is 1, G is 2 and T is 3, your data is compatible with bnp.DNAEncoding:

>>> import numpy as np
>>> already_encoded = np.array([0, 1, 2, 3])
>>> bnp.EncodedArray(already_encoded, bnp.DNAEncoding)
encoded_array('ACGT', AlphabetEncoding('ACGT'))