Sequence analysis
BioNumPy lets you very easily read sequence data (e.g. from a FASTA or FASTQ file) into NumPy-like data-structures:
import bionumpy as bnp
f = bnp.open("example_data/big.fq.gz")
chunk = f.read_chunk()
chunk is now an object that contains the names (header in the FASTQ file), qualities and sequences as EncodedRaggedArray.
print(chunk)
SequenceEntryWithQuality with 1000 entries
name sequence quality
2fa9ee19-5c51-4281-a... CGGTAGCCAGCTGCGTTCAG... [10 5 5 12 5 4 3
1f9ca490-2f25-484a-8... GATGCATACTTCGTTCGATT... [ 5 4 5 4 6 6 5
06936a64-6c08-40e9-8... GTTTTGTCGCTGCGTTCAGT... [ 3 5 6 7 7 5 4
d6a555a1-d8dd-4e55-9... CGTATGCTTTGAGATTCATT... [ 2 3 4 4 4 4 6
91ca9c6c-12fe-4255-8... CGGTGTACTTCGTTCCAGCT... [ 4 3 5 6 3 5 6
4dbe5037-abe2-4176-8... GCAGGTGATGCTTTGGTTCA... [ 2 3 4 6 7 7 6
df3de4e9-48ca-45fc-8... CATGCTTCGTTGGTTACCTC... [ 5 5 5 4 7 7 7
bfde9b59-2f6d-48e8-8... CTGTTGTGCGCTTCGTTCAT... [ 8 8 10 7 8 6 3
dbcfd59a-7a96-46a2-9... CGATTATTTGGTTCGTTCAT... [ 5 4 2 3 5 2 2
a0f83c4e-4c20-4c15-b... GTTGTACTTTACGTTTCAAT... [ 3 5 10 6 7 6 6
It is then easy to use NumPy-functionality on this data:
print("Number of Gs in the sequences:")
print(np.sum(chunk.sequence == "G"))
print("Number of sequences with mean quality > 30:")
print(np.sum(np.mean(chunk.quality, axis=1) > 10))
Number of Gs in the sequences:
53686
Number of sequences with mean quality > 30:
760
What to read next?
Tutorial: Quality analysis of FASTQ files
API documentation on the sequence module