BioNumPy at a glance
BioNumPy is a Python library for easy and efficient representation and analysis of biological data. Since BioNumPy builds on the interface of NumPy, people already used to NumPy or array programming should find BioNumPy very easy to get started with.
With BioNumPy, our goal is that everyone should be able to write simple, clean code that scales well to large biological datasets.
Installing BioNumpy
pip install bionumpy
Analyze your biosequences like numerical vectors in NumPy:
>>> import numpy as np
>>> import bionumpy as bnp
>>> reads = bnp.open("example_data/small.fa").read()
>>> reads
SequenceEntry with 3 entries
name sequence
read1 ACACATCACAGCTACGACGA...
read2 AACACTTGGGGGGGGGGGGG...
read3 AACTGGACTAGCGACGTACT...
>>> gc_content = np.mean((reads.sequence == "C") | (reads.sequence == "G"))
>>> gc_content
0.5526315789473685
What can you do with BioNumpy?
The main philosophy behind BioNumPy is that you should be able to efficiently read biological datasets into NumPy-like data structure, and then analyse the data using NumPy-like methods, such as indexing, slicing, broadcasting and other vectorized operations (sum, mean, etc). Since NumPy arrays are used to store the data, BioNumPy has a very low memory footprint and operations are very efficient, meaning that BioNumPy is suitable for working with large datasets and can be an alternative to using libraries and tools written in more low-level languages such as C and C++.
The core components of bionumpy are highly generic, and thus not limited to any particular types of analysis. A layer of complementary convenience functionality is though included for some major types of biosequence applications, which could be a useful starting point:
Sequence analysis
Reading and analysing DNA and protein sequences
Kmers
Analysing sequence patterns such as kmers, minimzers and motifs
Genomic Data
Analysing genomic data on a genome (Intervals, variants, annotations, etc)
Multiomics
Combining data-sets from multiple sources/domains
What BioNumpy is not
Bionumpy is not meant to be a broad catalog of specific algorithms and data structures that are useful for the biosequence domain.
BioNumpy also do not directly interface with the many useful data repositories in the field.
For the above purposes we refer to libraries like Biopython, which already provides a broad range of specific functionalities. The BioNumpy documentation includes many examples of how to employ BioNumPy for core data representation and processing, while interacting with Biopython for specific needs.
BioNumpy also does not provide any tailored plotting or visualisation functionality for the biosequence domain.
Instead, the flexible data operations on bionumpy objects makes it easy to compute representations that can be visualised using generic libraries like Plotly.