IO

The IO module provides functions and classes for reading and writing files. Central access points are the open function for reading files, the open_indexed function for reading indexed files, and the count_entries function for counting entries in a file.

In addition, several FileBuffer classes are exposed, that can be used to specify how a file should be interpreted. Giving these as buffer_type argument to open overrides any automatic format detection based on filename suffix

API documentation

bionumpy.io.bnp_open(filename: str, mode: str = None, buffer_type=None, lazy=None) → NpDataclassReader | NpBufferedWriter[source]

Open a file according to its suffix

Open a NpDataclassReader file object, that can be used to read the file, either in chunks or completely. Files read in chunks can be used together with the @bnp.streamable decorator to call a function on all chunks in the file and optionally reduce the results.

If mode=”w” it opens a writer object.

Parameters

filenamestr: Name of the file to open
modestr: Either “w” or “r”
buffer_typeFileBuffer: A FileBuffer class to specify how the data in the file should be interpreted
lazybool: If True, the data will be read lazily, i. e. only when it is accessed. This is useful to speed up reading of large files, but it is more memory demanding

Returns

NpDataclassReader: A file reader object

Examples

>>> import bionumpy as bnp
>>> all_data = bnp.open("example_data/big.fq.gz").read()
>>> print(all_data)
SequenceEntryWithQuality with 1000 entries
                     name                 sequence                  quality
  2fa9ee19-5c51-4281-a...  CGGTAGCCAGCTGCGTTCAG...  [10  5  5 12  5  4  3  
  1f9ca490-2f25-484a-8...  GATGCATACTTCGTTCGATT...  [ 5  4  5  4  6  6  5  
  06936a64-6c08-40e9-8...  GTTTTGTCGCTGCGTTCAGT...  [ 3  5  6  7  7  5  4  
  d6a555a1-d8dd-4e55-9...  CGTATGCTTTGAGATTCATT...  [ 2  3  4  4  4  4  6  
  91ca9c6c-12fe-4255-8...  CGGTGTACTTCGTTCCAGCT...  [ 4  3  5  6  3  5  6  
  4dbe5037-abe2-4176-8...  GCAGGTGATGCTTTGGTTCA...  [ 2  3  4  6  7  7  6  
  df3de4e9-48ca-45fc-8...  CATGCTTCGTTGGTTACCTC...  [ 5  5  5  4  7  7  7  
  bfde9b59-2f6d-48e8-8...  CTGTTGTGCGCTTCGTTCAT...  [ 8  8 10  7  8  6  3  
  dbcfd59a-7a96-46a2-9...  CGATTATTTGGTTCGTTCAT...  [ 5  4  2  3  5  2  2  
  a0f83c4e-4c20-4c15-b...  GTTGTACTTTACGTTTCAAT...  [ 3  5 10  6  7  6  6  

>>> first_chunk = bnp.open("example_data/big.fq.gz").read_chunk(300000)
>>> print(first_chunk)
SequenceEntryWithQuality with 511 entries
                     name                 sequence                  quality
  2fa9ee19-5c51-4281-a...  CGGTAGCCAGCTGCGTTCAG...  [10  5  5 12  5  4  3  
  1f9ca490-2f25-484a-8...  GATGCATACTTCGTTCGATT...  [ 5  4  5  4  6  6  5  
  06936a64-6c08-40e9-8...  GTTTTGTCGCTGCGTTCAGT...  [ 3  5  6  7  7  5  4  
  d6a555a1-d8dd-4e55-9...  CGTATGCTTTGAGATTCATT...  [ 2  3  4  4  4  4  6  
  91ca9c6c-12fe-4255-8...  CGGTGTACTTCGTTCCAGCT...  [ 4  3  5  6  3  5  6  
  4dbe5037-abe2-4176-8...  GCAGGTGATGCTTTGGTTCA...  [ 2  3  4  6  7  7  6  
  df3de4e9-48ca-45fc-8...  CATGCTTCGTTGGTTACCTC...  [ 5  5  5  4  7  7  7  
  bfde9b59-2f6d-48e8-8...  CTGTTGTGCGCTTCGTTCAT...  [ 8  8 10  7  8  6  3  
  dbcfd59a-7a96-46a2-9...  CGATTATTTGGTTCGTTCAT...  [ 5  4  2  3  5  2  2  
  a0f83c4e-4c20-4c15-b...  GTTGTACTTTACGTTTCAAT...  [ 3  5 10  6  7  6  6  

>>> all_chunks = bnp.open("example_data/big.fq.gz").read_chunks(300000)

>>> for chunk in all_chunks:
...       print(chunk)
...
SequenceEntryWithQuality with 511 entries
                     name                 sequence                  quality
  2fa9ee19-5c51-4281-a...  CGGTAGCCAGCTGCGTTCAG...  [10  5  5 12  5  4  3  
  1f9ca490-2f25-484a-8...  GATGCATACTTCGTTCGATT...  [ 5  4  5  4  6  6  5  
  06936a64-6c08-40e9-8...  GTTTTGTCGCTGCGTTCAGT...  [ 3  5  6  7  7  5  4  
  d6a555a1-d8dd-4e55-9...  CGTATGCTTTGAGATTCATT...  [ 2  3  4  4  4  4  6  
  91ca9c6c-12fe-4255-8...  CGGTGTACTTCGTTCCAGCT...  [ 4  3  5  6  3  5  6  
  4dbe5037-abe2-4176-8...  GCAGGTGATGCTTTGGTTCA...  [ 2  3  4  6  7  7  6  
  df3de4e9-48ca-45fc-8...  CATGCTTCGTTGGTTACCTC...  [ 5  5  5  4  7  7  7  
  bfde9b59-2f6d-48e8-8...  CTGTTGTGCGCTTCGTTCAT...  [ 8  8 10  7  8  6  3  
  dbcfd59a-7a96-46a2-9...  CGATTATTTGGTTCGTTCAT...  [ 5  4  2  3  5  2  2  
  a0f83c4e-4c20-4c15-b...  GTTGTACTTTACGTTTCAAT...  [ 3  5 10  6  7  6  6  
SequenceEntryWithQuality with 489 entries
                     name                 sequence                  quality
  5f27fb90-2cb0-43d0-a...  CGTTGCTGATTCAGCATCAA...  [ 5  3  2  3  2  2  4  
  e23294d9-0079-4345-a...  CGAGCCGCTTCGTTCCGGTT...  [ 4  5  3  3  3  4  3  
  56736851-ccc9-41a6-9...  CGGTGCCTTCGTTCATTTCT...  [ 8  3  7  7  3  1  2  
  f156362d-d380-480d-8...  CTGTTGCGCCCCGGAACAGT...  [ 7 11  9  4  4  4  3  
  300f89ef-608a-463f-8...  CATACTTTGGTTCATTCTGT...  [ 3  2  4  4  4  4  5  
  755b1702-4560-4c04-a...  GGTATACTTGCCCTACGTTC...  [10  9 13  6  3  3  4  
  98de4f6b-d094-41e8-9...  GTTGTACTTCGTTCAGTTTC...  [ 4  5  6  4  7  6  6  
  00ac3f41-f735-49e5-9...  GTTGTACTTCGTTCAGCTCT...  [ 3  4  5  4  4 10 12 1
  f92d30bc-f77f-401e-9...  GTTGTACTGCTTCGTTCAGT...  [ 6  3  4  3  6  3  2  
  7e2c14c0-0662-4cc3-8...  TGATACATTACTTCGTTCGA...  [ 3  8  4  7  2  4  3  

bionumpy.io.open_indexed(filename: str) → IndexedFasta[source]

Open an indexed fasta (for now) file with random access

If an index is not already present for the file, create it

Parameters

filenamestr: The filename of the file

Returns

IndexedFasta: An Indexed fasta object that supports random access on chromosome or intervals

Examples

>>> from bionumpy import open_indexed
>>> reference = open_indexed("example_data/small_genome.fa")
>>> reference
Indexed Fasta File with chromosome sizes: {'0': 80, '1': 80, '2': 80, '3': 80}
>>> reference["1"]
encoded_array('gcttggtatgaaaacccatc...')
>>> from bionumpy.datatypes import Interval
>>> intervals = Interval.from_entry_tuples([("1", 10, 20), ("2", 20, 30)])
>>> reference.get_interval_sequences(intervals)
encoded_ragged_array(['aaaacccatc',
                      'ggccgttttt'])

bionumpy.io.count_entries(filename: str, buffer_type: FileBuffer = None) → int[source]

Count the number of entries in the file

By default it uses the file suffix to imply the file format. But a specific FileBuffer can be provided.

Parameters

filenamestr: Name of the file to count the entries of
buffer_typeFileBuffer: A FileBuffer class to specify how the data in the file should be interpreted

Returns

int: The number of entries in the file

Examples

6

class bionumpy.io.BedBuffer(buffer_extractor: TextBufferExtractor, header_data=None)[source]

class bionumpy.io.VCFBuffer(buffer_extractor: TextBufferExtractor, header_data=None)[source]: https://samtools.github.io/hts-specs/VCFv4.2.pdf

class bionumpy.io.FastQBuffer(buffer_extractor: TextBufferExtractor)[source]

class bionumpy.io.TwoLineFastaBuffer(buffer_extractor: TextBufferExtractor)[source]: Buffer for fasta files where each entry is contained in two lines (one for header and one for sequence) For multi-line fasta files, use MultiLineFastaBuffer

class bionumpy.io.IndexedFasta(filename: str | Path)[source]

Class representing an indexed fasta file. Behaves like dict of chrom names to sequences

get_contig_lengths() → Dict[str, int][source]

Return a dict of chromosome names to seqeunce lengths

Returns

dict: chromosome name to sequence length mapping

get_interval_sequences(intervals: Interval) → EncodedRaggedArray[source]

Get the sequences for a set of genomic intervals

Parameters

intervalsInterval: Intervals

Returns

EncodedRaggedArray: Sequences