.. _reading_files: Reading files ------------- There are three main ways of getting your data into memory in BioNumPy. Which way you choose mainly depends on how big your data is: Method 1: Reading all your data at once ======================================= If you are working with datasets that are small enough to fit into memory, you can read the whole file. The benefit of this approach is that you don't need to take into account that the data has been split into chunks (next methods): >>> import bionumpy as bnp >>> data = bnp.open("example_data/test.bed").read() >>> print(data) Interval with 71 entries chromosome start stop 17 7512371 7512447 17 7512377 7512453 17 7512393 7512469 17 7512420 7512496 17 7512422 7512473 17 7512425 7512501 17 7512428 7512504 17 7512474 7512550 17 7512537 7512613 17 7512559 7512635 In the above example, `data` will be an NpDataClass object with various fields, depending on the file type. For instance, if you read a fastq file, you will be able to access the sequences (`data.sequence`) and the base qualities (`data.quality`). Method 2: Reading a single chunk from a large file ================================================== If you have a large file, and don't want to read all of it, you can read a single chunk. This can be useful if you only want to get to know your data or perform an analysis on a small subset of your data: >>> file = bnp.open("example_data/reads.fq") >>> chunk = file.read_chunk(min_chunk_size=10000000) Here `min_chunk_size` is the number of bytes to read. If the file is small enough, you may get the whole file by setting chunk size to a big number, but if you want to make sure you read the whole file as one single cunk you should choose method 1 instead. Method 3: Reading whole file as a stream of chunks ================================================== This is the preferred way you should use for all large files (typically fasta files, fastq files and bam files). The idea is to read the whole file as chunks that can be iterated over. This way, only one single chunk will be kept in memory at the time. >>> file = bnp.open("example_data/big.fq.gz") >>> for chunk in file.read_chunks(min_chunk_size=100000): ... # do stuff with this chunk, e.g. take the mean ... print(chunk.quality.mean()) 11.243155401311078 11.799580504498538 11.447879005326635 11.753348856321052 11.67464738973286 12.154069194606311 Remember that when one chunk has been processed, that chunk is "lost" and the generator continues to the next chunk. A typical implementation will thus be to write a function that does all you want to do one one chunk, run that function on each chunk and summarize the results in some way. Continue to the guide on :ref:`working_with_big_data` to see the recommended way of working with chunks in BioNumPy.