Reading files

There are three main ways of getting your data into memory in BioNumPy. Which way you choose mainly depends on how big your data is:

Method 1: Reading all your data at once

If you are working with datasets that are small enough to fit into memory, you can read the whole file. The benefit of this approach is that you don’t need to take into account that the data has been split into chunks (next methods):

>>> import bionumpy as bnp
>>> data = bnp.open("example_data/test.bed").read()
>>> print(data)
Interval with 71 entries
               chromosome                    start                     stop
                       17                  7512371                  7512447
                       17                  7512377                  7512453
                       17                  7512393                  7512469
                       17                  7512420                  7512496
                       17                  7512422                  7512473
                       17                  7512425                  7512501
                       17                  7512428                  7512504
                       17                  7512474                  7512550
                       17                  7512537                  7512613
                       17                  7512559                  7512635

In the above example, data will be an NpDataClass object with various fields, depending on the file type. For instance, if you read a fastq file, you will be able to access the sequences (data.sequence) and the base qualities (data.quality).

Method 2: Reading a single chunk from a large file

If you have a large file, and don’t want to read all of it, you can read a single chunk. This can be useful if you only want to get to know your data or perform an analysis on a small subset of your data:

>>> file = bnp.open("example_data/reads.fq")
>>> chunk = file.read_chunk(min_chunk_size=10000000)

Here min_chunk_size is the number of bytes to read. If the file is small enough, you may get the whole file by setting chunk size to a big number, but if you want to make sure you read the whole file as one single cunk you should choose method 1 instead.

Method 3: Reading whole file as a stream of chunks

This is the preferred way you should use for all large files (typically fasta files, fastq files and bam files). The idea is to read the whole file as chunks that can be iterated over. This way, only one single chunk will be kept in memory at the time.

>>> file = bnp.open("example_data/big.fq.gz")
>>> for chunk in file.read_chunks(min_chunk_size=100000):
...    # do stuff with this chunk, e.g. take the mean
...    print(chunk.quality.mean())
11.243155401311078
11.799580504498538
11.447879005326635
11.753348856321052
11.67464738973286
12.154069194606311

Remember that when one chunk has been processed, that chunk is “lost” and the generator continues to the next chunk. A typical implementation will thus be to write a function that does all you want to do one one chunk, run that function on each chunk and summarize the results in some way.

Continue to the guide on Working with big data to see the recommended way of working with chunks in BioNumPy.