Sequence analysis

BioNumPy lets you very easily read sequence data (e.g. from a FASTA or FASTQ file) into NumPy-like data-structures:

import bionumpy as bnp
f = bnp.open("example_data/big.fq.gz")
chunk = f.read_chunk()

chunk is now an object that contains the names (header in the FASTQ file), qualities and sequences as EncodedRaggedArray.

print(chunk)
SequenceEntryWithQuality with 1000 entries
                     name                 sequence                  quality
  2fa9ee19-5c51-4281-a...  CGGTAGCCAGCTGCGTTCAG...  [10  5  5 12  5  4  3
  1f9ca490-2f25-484a-8...  GATGCATACTTCGTTCGATT...  [ 5  4  5  4  6  6  5
  06936a64-6c08-40e9-8...  GTTTTGTCGCTGCGTTCAGT...  [ 3  5  6  7  7  5  4
  d6a555a1-d8dd-4e55-9...  CGTATGCTTTGAGATTCATT...  [ 2  3  4  4  4  4  6
  91ca9c6c-12fe-4255-8...  CGGTGTACTTCGTTCCAGCT...  [ 4  3  5  6  3  5  6
  4dbe5037-abe2-4176-8...  GCAGGTGATGCTTTGGTTCA...  [ 2  3  4  6  7  7  6
  df3de4e9-48ca-45fc-8...  CATGCTTCGTTGGTTACCTC...  [ 5  5  5  4  7  7  7
  bfde9b59-2f6d-48e8-8...  CTGTTGTGCGCTTCGTTCAT...  [ 8  8 10  7  8  6  3
  dbcfd59a-7a96-46a2-9...  CGATTATTTGGTTCGTTCAT...  [ 5  4  2  3  5  2  2
  a0f83c4e-4c20-4c15-b...  GTTGTACTTTACGTTTCAAT...  [ 3  5 10  6  7  6  6

It is then easy to use NumPy-functionality on this data:

print("Number of Gs in the sequences:")
print(np.sum(chunk.sequence == "G"))
print("Number of sequences with mean quality > 30:")
print(np.sum(np.mean(chunk.quality, axis=1) > 10))
Number of Gs in the sequences:
53686
Number of sequences with mean quality > 30:
760

What to read next?