IO
The IO module provides functions and classes for reading and writing files. Central access points are the open function for reading files, the open_indexed function for reading indexed files, and the count_entries function for counting entries in a file.
In addition, several FileBuffer classes are exposed, that can be used to specify how a file should be interpreted. Giving these as buffer_type argument to open overrides any automatic format detection based on filename suffix
API documentation
- bionumpy.io.bnp_open(filename: str, mode: str | None = None, buffer_type=None, lazy=None) NpDataclassReader | NpBufferedWriter [source]
Open a file according to its suffix
Open a NpDataclassReader file object, that can be used to read the file, either in chunks or completely. Files read in chunks can be used together with the @bnp.streamable decorator to call a function on all chunks in the file and optionally reduce the results.
If mode=”w” it opens a writer object.
Parameters
- filenamestr
Name of the file to open
- modestr
Either “w” or “r”
- buffer_typeFileBuffer
A FileBuffer class to specify how the data in the file should be interpreted
- lazybool
If True, the data will be read lazily, i. e. only when it is accessed. This is useful to speed up reading of large files, but it is more memory demanding
Returns
- NpDataclassReader
A file reader object
Examples
>>> import bionumpy as bnp >>> all_data = bnp.open("example_data/big.fq.gz").read() >>> print(all_data) SequenceEntryWithQuality with 1000 entries name sequence quality 2fa9ee19-5c51-4281-a... CGGTAGCCAGCTGCGTTCAG... [10 5 5 12 5 4 3 1f9ca490-2f25-484a-8... GATGCATACTTCGTTCGATT... [ 5 4 5 4 6 6 5 06936a64-6c08-40e9-8... GTTTTGTCGCTGCGTTCAGT... [ 3 5 6 7 7 5 4 d6a555a1-d8dd-4e55-9... CGTATGCTTTGAGATTCATT... [ 2 3 4 4 4 4 6 91ca9c6c-12fe-4255-8... CGGTGTACTTCGTTCCAGCT... [ 4 3 5 6 3 5 6 4dbe5037-abe2-4176-8... GCAGGTGATGCTTTGGTTCA... [ 2 3 4 6 7 7 6 df3de4e9-48ca-45fc-8... CATGCTTCGTTGGTTACCTC... [ 5 5 5 4 7 7 7 bfde9b59-2f6d-48e8-8... CTGTTGTGCGCTTCGTTCAT... [ 8 8 10 7 8 6 3 dbcfd59a-7a96-46a2-9... CGATTATTTGGTTCGTTCAT... [ 5 4 2 3 5 2 2 a0f83c4e-4c20-4c15-b... GTTGTACTTTACGTTTCAAT... [ 3 5 10 6 7 6 6
>>> first_chunk = bnp.open("example_data/big.fq.gz").read_chunk(300000) >>> print(first_chunk) SequenceEntryWithQuality with 511 entries name sequence quality 2fa9ee19-5c51-4281-a... CGGTAGCCAGCTGCGTTCAG... [10 5 5 12 5 4 3 1f9ca490-2f25-484a-8... GATGCATACTTCGTTCGATT... [ 5 4 5 4 6 6 5 06936a64-6c08-40e9-8... GTTTTGTCGCTGCGTTCAGT... [ 3 5 6 7 7 5 4 d6a555a1-d8dd-4e55-9... CGTATGCTTTGAGATTCATT... [ 2 3 4 4 4 4 6 91ca9c6c-12fe-4255-8... CGGTGTACTTCGTTCCAGCT... [ 4 3 5 6 3 5 6 4dbe5037-abe2-4176-8... GCAGGTGATGCTTTGGTTCA... [ 2 3 4 6 7 7 6 df3de4e9-48ca-45fc-8... CATGCTTCGTTGGTTACCTC... [ 5 5 5 4 7 7 7 bfde9b59-2f6d-48e8-8... CTGTTGTGCGCTTCGTTCAT... [ 8 8 10 7 8 6 3 dbcfd59a-7a96-46a2-9... CGATTATTTGGTTCGTTCAT... [ 5 4 2 3 5 2 2 a0f83c4e-4c20-4c15-b... GTTGTACTTTACGTTTCAAT... [ 3 5 10 6 7 6 6
>>> all_chunks = bnp.open("example_data/big.fq.gz").read_chunks(300000)
>>> for chunk in all_chunks: ... print(chunk) ... SequenceEntryWithQuality with 511 entries name sequence quality 2fa9ee19-5c51-4281-a... CGGTAGCCAGCTGCGTTCAG... [10 5 5 12 5 4 3 1f9ca490-2f25-484a-8... GATGCATACTTCGTTCGATT... [ 5 4 5 4 6 6 5 06936a64-6c08-40e9-8... GTTTTGTCGCTGCGTTCAGT... [ 3 5 6 7 7 5 4 d6a555a1-d8dd-4e55-9... CGTATGCTTTGAGATTCATT... [ 2 3 4 4 4 4 6 91ca9c6c-12fe-4255-8... CGGTGTACTTCGTTCCAGCT... [ 4 3 5 6 3 5 6 4dbe5037-abe2-4176-8... GCAGGTGATGCTTTGGTTCA... [ 2 3 4 6 7 7 6 df3de4e9-48ca-45fc-8... CATGCTTCGTTGGTTACCTC... [ 5 5 5 4 7 7 7 bfde9b59-2f6d-48e8-8... CTGTTGTGCGCTTCGTTCAT... [ 8 8 10 7 8 6 3 dbcfd59a-7a96-46a2-9... CGATTATTTGGTTCGTTCAT... [ 5 4 2 3 5 2 2 a0f83c4e-4c20-4c15-b... GTTGTACTTTACGTTTCAAT... [ 3 5 10 6 7 6 6 SequenceEntryWithQuality with 489 entries name sequence quality 5f27fb90-2cb0-43d0-a... CGTTGCTGATTCAGCATCAA... [ 5 3 2 3 2 2 4 e23294d9-0079-4345-a... CGAGCCGCTTCGTTCCGGTT... [ 4 5 3 3 3 4 3 56736851-ccc9-41a6-9... CGGTGCCTTCGTTCATTTCT... [ 8 3 7 7 3 1 2 f156362d-d380-480d-8... CTGTTGCGCCCCGGAACAGT... [ 7 11 9 4 4 4 3 300f89ef-608a-463f-8... CATACTTTGGTTCATTCTGT... [ 3 2 4 4 4 4 5 755b1702-4560-4c04-a... GGTATACTTGCCCTACGTTC... [10 9 13 6 3 3 4 98de4f6b-d094-41e8-9... GTTGTACTTCGTTCAGTTTC... [ 4 5 6 4 7 6 6 00ac3f41-f735-49e5-9... GTTGTACTTCGTTCAGCTCT... [ 3 4 5 4 4 10 12 1 f92d30bc-f77f-401e-9... GTTGTACTGCTTCGTTCAGT... [ 6 3 4 3 6 3 2 7e2c14c0-0662-4cc3-8... TGATACATTACTTCGTTCGA... [ 3 8 4 7 2 4 3
- bionumpy.io.open_indexed(filename: str) IndexedFasta [source]
Open an indexed fasta (for now) file with random access
If an index is not already present for the file, create it
Parameters
- filenamestr
The filename of the file
Returns
- IndexedFasta
An Indexed fasta object that supports random access on chromosome or intervals
Examples
>>> from bionumpy import open_indexed >>> reference = open_indexed("example_data/small_genome.fa") >>> reference Indexed Fasta File with chromosome sizes: {'0': 80, '1': 80, '2': 80, '3': 80} >>> reference["1"] encoded_array('gcttggtatgaaaacccatc...') >>> from bionumpy.datatypes import Interval >>> intervals = Interval.from_entry_tuples([("1", 10, 20), ("2", 20, 30)]) >>> reference.get_interval_sequences(intervals) encoded_ragged_array(['aaaacccatc', 'ggccgttttt'])
- bionumpy.io.count_entries(filename: str, buffer_type: FileBuffer | None = None) int [source]
Count the number of entries in the file
By default it uses the file suffix to imply the file format. But a specific FileBuffer can be provided.
Parameters
- filenamestr
Name of the file to count the entries of
- buffer_typeFileBuffer
A FileBuffer class to specify how the data in the file should be interpreted
Returns
- int
The number of entries in the file
Examples
6
- class bionumpy.io.TwoLineFastaBuffer(buffer_extractor: TextBufferExtractor)[source]
Buffer for fasta files where each entry is contained in two lines (one for header and one for sequence) For multi-line fasta files, use MultiLineFastaBuffer
- class bionumpy.io.IndexedFasta(filename: str | Path)[source]
Class representing an indexed fasta file. Behaves like dict of chrom names to sequences