Using BioNumPy in your existing project

If you already have a Python project that reads biological datasets (e.g. fastq, bam or vcf files), you can often easily add BioNumPy to your project to speed up reading and parsing of such files.

All you need is to install BioNumPy, import it, and read your file with bnp.open:

import numpy as np
import bionumpy as bnp

# open your file, bnp.open automatically detects the file format
f = bnp.open("example_data/variants.vcf")
    # a chunk is an efficient representation of a chunk of many lines
for chunk in f.read_chunks():
            # we can iterate over the entries
    for single_entry in chunk.to_iter():
                    print(single_entry)
                    # and we can access things like, chromosome, position and so on
                    position = single_entry.position
VCFEntry(chromosome='chr1', position=883624, id='rs4970378', ref_seq='A', alt_seq='G', quality='.', filter='.', info='.')
VCFEntry(chromosome='chr1', position=887559, id='rs3748595', ref_seq='A', alt_seq='C', quality='.', filter='.', info='.')
VCFEntry(chromosome='chr1', position=887800, id='rs3828047', ref_seq='A', alt_seq='G', quality='.', filter='.', info='.')

The above example shows how you can read any file and iterate over the lines. This way, one can in most cases easily use BioNumPy as an inplace replacement for the existing code that iterates over lines from biological data files. However, iterating over lines is not very efficient, so we can do better by working directly on chunks of the file. See Reading files for more information.

Using BioNumPy with Pandas

Another way that BioNumPy can be used directly in existing projects is if you are using Pandas.

It is straightforward to convert a chunk of a file or a whole file to a Pandas DataFrame. This time, we read a fastq file:

import bionumpy as bnp
f = bnp.open("example_data/big.fq.gz")
for chunk in f.read_chunks():
        df = chunk.topandas()
        print(df)
                              name                       sequence                        quality
0    2fa9ee19-5c51-4281-abdd-ea...  CGGTAGCCAGCTGCGTTCAGTATGGA...  [10, 5, 5, 12, 5, 4, 3, 4,...
1    1f9ca490-2f25-484a-8972-d6...  GATGCATACTTCGTTCGATTTCGTTT...  [5, 4, 5, 4, 6, 6, 5, 6, 1...
2    06936a64-6c08-40e9-8a10-0f...  GTTTTGTCGCTGCGTTCAGTTTATGG...  [3, 5, 6, 7, 7, 5, 4, 3, 3...
3    d6a555a1-d8dd-4e55-936f-ad...  CGTATGCTTTGAGATTCATTCAGGAG...  [2, 3, 4, 4, 4, 4, 6, 6, 7...
4    91ca9c6c-12fe-4255-83cc-96...  CGGTGTACTTCGTTCCAGCTAGATTT...  [4, 3, 5, 6, 3, 5, 6, 5, 5...
..                             ...                            ...                            ...
995  2eef382a-21f7-4a5b-a8d8-64...  CGTTTGCGCTGGTTCATTTTATCGGT...  [2, 6, 2, 2, 4, 2, 3, 3, 4...
996  18949e40-d30d-49f7-8a1c-c2...  GCGTACTTCGTTCAGTTTCGGAAGTG...  [2, 2, 2, 3, 3, 3, 5, 7, 7...
997  f4aeadf5-174e-4974-aef1-8b...  CAGTAATACTTCGTTCCAGTTCTGGG...  [9, 6, 11, 10, 2, 3, 3, 3,...
998  6b3cb23e-3f71-435b-835f-78...  CTGTTGTACTTCGATTCATTCAGGTG...  [5, 3, 3, 5, 6, 4, 5, 3, 5...
999  d65b5418-65d5-4bf3-aac8-aa...  CGGTGACGCTGGTTTAAATCTAACGG...  [7, 3, 4, 3, 4, 2, 2, 3, 3...

[1000 rows x 3 columns]

Continue to see an overview of what you can do with bionumpy.

Using BioNumPy RaggedArray functionalities

Alternatively, if one is working with biological sequences and are already loaded into a list of strings, one can use the as_encoded_array function to convert the list of strings into a RaggedArray. This is useful if you want to make use of all the efficient ways of BioNumpy’s functionalities on handling non-uniform length sequences and computations on them. Below is an example. See all the available encodings in Encodings.

import bionumpy as bnp
my_sequences = ["TGTGCCAGCAGCGGGGATCGTAATCAGCCCCAGCATTTT",
                "TGCAGCGTCAAGGTCCAAGCTTTCTTT",
                "TGTGCCACCAGTGATTATTATTGGTACGAGCAGTACTTC"]
my_ragged_array = bnp.as_encoded_array(my_sequences, bnp.encodings.alphabet_encoding.DNAEncoding)
print(my_ragged_array)
print(my_ragged_array.shape)
TGTGCCAGCAGCGGGGATCGTAATCAGCCCCAGCATTTT
TGCAGCGTCAAGGTCCAAGCTTTCTTT
TGTGCCACCAGTGATTATTATTGGTACGAGCAGTACTTC
(3, array([39, 27, 39]))