IO

The IO module provides functions and classes for reading and writing files. Central access points are the open function for reading files, the open_indexed function for reading indexed files, and the count_entries function for counting entries in a file.

In addition, several FileBuffer classes are exposed, that can be used to specify how a file should be interpreted. Giving these as buffer_type argument to open overrides any automatic format detection based on filename suffix

API documentation

bionumpy.io.bnp_open(filename: str, mode: str | None = None, buffer_type=None, lazy=None) NpDataclassReader | NpBufferedWriter[source]

Open a file according to its suffix

Open a NpDataclassReader file object, that can be used to read the file, either in chunks or completely. Files read in chunks can be used together with the @bnp.streamable decorator to call a function on all chunks in the file and optionally reduce the results.

If mode=”w” it opens a writer object.

Parameters

filenamestr

Name of the file to open

modestr

Either “w” or “r”

buffer_typeFileBuffer

A FileBuffer class to specify how the data in the file should be interpreted

lazybool

If True, the data will be read lazily, i. e. only when it is accessed. This is useful to speed up reading of large files, but it is more memory demanding

Returns

NpDataclassReader

A file reader object

Examples

>>> import bionumpy as bnp
>>> all_data = bnp.open("example_data/big.fq.gz").read()
>>> print(all_data)
SequenceEntryWithQuality with 1000 entries
                     name                 sequence                  quality
  2fa9ee19-5c51-4281-a...  CGGTAGCCAGCTGCGTTCAG...  [10  5  5 12  5  4  3  
  1f9ca490-2f25-484a-8...  GATGCATACTTCGTTCGATT...  [ 5  4  5  4  6  6  5  
  06936a64-6c08-40e9-8...  GTTTTGTCGCTGCGTTCAGT...  [ 3  5  6  7  7  5  4  
  d6a555a1-d8dd-4e55-9...  CGTATGCTTTGAGATTCATT...  [ 2  3  4  4  4  4  6  
  91ca9c6c-12fe-4255-8...  CGGTGTACTTCGTTCCAGCT...  [ 4  3  5  6  3  5  6  
  4dbe5037-abe2-4176-8...  GCAGGTGATGCTTTGGTTCA...  [ 2  3  4  6  7  7  6  
  df3de4e9-48ca-45fc-8...  CATGCTTCGTTGGTTACCTC...  [ 5  5  5  4  7  7  7  
  bfde9b59-2f6d-48e8-8...  CTGTTGTGCGCTTCGTTCAT...  [ 8  8 10  7  8  6  3  
  dbcfd59a-7a96-46a2-9...  CGATTATTTGGTTCGTTCAT...  [ 5  4  2  3  5  2  2  
  a0f83c4e-4c20-4c15-b...  GTTGTACTTTACGTTTCAAT...  [ 3  5 10  6  7  6  6  
>>> first_chunk = bnp.open("example_data/big.fq.gz").read_chunk(300000)
>>> print(first_chunk)
SequenceEntryWithQuality with 511 entries
                     name                 sequence                  quality
  2fa9ee19-5c51-4281-a...  CGGTAGCCAGCTGCGTTCAG...  [10  5  5 12  5  4  3  
  1f9ca490-2f25-484a-8...  GATGCATACTTCGTTCGATT...  [ 5  4  5  4  6  6  5  
  06936a64-6c08-40e9-8...  GTTTTGTCGCTGCGTTCAGT...  [ 3  5  6  7  7  5  4  
  d6a555a1-d8dd-4e55-9...  CGTATGCTTTGAGATTCATT...  [ 2  3  4  4  4  4  6  
  91ca9c6c-12fe-4255-8...  CGGTGTACTTCGTTCCAGCT...  [ 4  3  5  6  3  5  6  
  4dbe5037-abe2-4176-8...  GCAGGTGATGCTTTGGTTCA...  [ 2  3  4  6  7  7  6  
  df3de4e9-48ca-45fc-8...  CATGCTTCGTTGGTTACCTC...  [ 5  5  5  4  7  7  7  
  bfde9b59-2f6d-48e8-8...  CTGTTGTGCGCTTCGTTCAT...  [ 8  8 10  7  8  6  3  
  dbcfd59a-7a96-46a2-9...  CGATTATTTGGTTCGTTCAT...  [ 5  4  2  3  5  2  2  
  a0f83c4e-4c20-4c15-b...  GTTGTACTTTACGTTTCAAT...  [ 3  5 10  6  7  6  6  
>>> all_chunks = bnp.open("example_data/big.fq.gz").read_chunks(300000)
>>> for chunk in all_chunks:
...       print(chunk)
...
SequenceEntryWithQuality with 511 entries
                     name                 sequence                  quality
  2fa9ee19-5c51-4281-a...  CGGTAGCCAGCTGCGTTCAG...  [10  5  5 12  5  4  3  
  1f9ca490-2f25-484a-8...  GATGCATACTTCGTTCGATT...  [ 5  4  5  4  6  6  5  
  06936a64-6c08-40e9-8...  GTTTTGTCGCTGCGTTCAGT...  [ 3  5  6  7  7  5  4  
  d6a555a1-d8dd-4e55-9...  CGTATGCTTTGAGATTCATT...  [ 2  3  4  4  4  4  6  
  91ca9c6c-12fe-4255-8...  CGGTGTACTTCGTTCCAGCT...  [ 4  3  5  6  3  5  6  
  4dbe5037-abe2-4176-8...  GCAGGTGATGCTTTGGTTCA...  [ 2  3  4  6  7  7  6  
  df3de4e9-48ca-45fc-8...  CATGCTTCGTTGGTTACCTC...  [ 5  5  5  4  7  7  7  
  bfde9b59-2f6d-48e8-8...  CTGTTGTGCGCTTCGTTCAT...  [ 8  8 10  7  8  6  3  
  dbcfd59a-7a96-46a2-9...  CGATTATTTGGTTCGTTCAT...  [ 5  4  2  3  5  2  2  
  a0f83c4e-4c20-4c15-b...  GTTGTACTTTACGTTTCAAT...  [ 3  5 10  6  7  6  6  
SequenceEntryWithQuality with 489 entries
                     name                 sequence                  quality
  5f27fb90-2cb0-43d0-a...  CGTTGCTGATTCAGCATCAA...  [ 5  3  2  3  2  2  4  
  e23294d9-0079-4345-a...  CGAGCCGCTTCGTTCCGGTT...  [ 4  5  3  3  3  4  3  
  56736851-ccc9-41a6-9...  CGGTGCCTTCGTTCATTTCT...  [ 8  3  7  7  3  1  2  
  f156362d-d380-480d-8...  CTGTTGCGCCCCGGAACAGT...  [ 7 11  9  4  4  4  3  
  300f89ef-608a-463f-8...  CATACTTTGGTTCATTCTGT...  [ 3  2  4  4  4  4  5  
  755b1702-4560-4c04-a...  GGTATACTTGCCCTACGTTC...  [10  9 13  6  3  3  4  
  98de4f6b-d094-41e8-9...  GTTGTACTTCGTTCAGTTTC...  [ 4  5  6  4  7  6  6  
  00ac3f41-f735-49e5-9...  GTTGTACTTCGTTCAGCTCT...  [ 3  4  5  4  4 10 12 1
  f92d30bc-f77f-401e-9...  GTTGTACTGCTTCGTTCAGT...  [ 6  3  4  3  6  3  2  
  7e2c14c0-0662-4cc3-8...  TGATACATTACTTCGTTCGA...  [ 3  8  4  7  2  4  3  
bionumpy.io.open_indexed(filename: str) IndexedFasta[source]

Open an indexed fasta (for now) file with random access

If an index is not already present for the file, create it

Parameters

filenamestr

The filename of the file

Returns

IndexedFasta

An Indexed fasta object that supports random access on chromosome or intervals

Examples

>>> from bionumpy import open_indexed
>>> reference = open_indexed("example_data/small_genome.fa")
>>> reference
Indexed Fasta File with chromosome sizes: {'0': 80, '1': 80, '2': 80, '3': 80}
>>> reference["1"]
encoded_array('gcttggtatgaaaacccatc...')
>>> from bionumpy.datatypes import Interval
>>> intervals = Interval.from_entry_tuples([("1", 10, 20), ("2", 20, 30)])
>>> reference.get_interval_sequences(intervals)
encoded_ragged_array(['aaaacccatc',
                      'ggccgttttt'])
bionumpy.io.count_entries(filename: str, buffer_type: FileBuffer | None = None) int[source]

Count the number of entries in the file

By default it uses the file suffix to imply the file format. But a specific FileBuffer can be provided.

Parameters

filenamestr

Name of the file to count the entries of

buffer_typeFileBuffer

A FileBuffer class to specify how the data in the file should be interpreted

Returns

int

The number of entries in the file

Examples

6

class bionumpy.io.BedBuffer(buffer_extractor: TextBufferExtractor, header_data=None)[source]
class bionumpy.io.VCFBuffer(buffer_extractor: TextBufferExtractor, header_data=None)[source]

https://samtools.github.io/hts-specs/VCFv4.2.pdf

class bionumpy.io.FastQBuffer(buffer_extractor: TextBufferExtractor)[source]
class bionumpy.io.TwoLineFastaBuffer(buffer_extractor: TextBufferExtractor)[source]

Buffer for fasta files where each entry is contained in two lines (one for header and one for sequence) For multi-line fasta files, use MultiLineFastaBuffer

class bionumpy.io.IndexedFasta(filename: str | Path)[source]

Class representing an indexed fasta file. Behaves like dict of chrom names to sequences

get_contig_lengths() Dict[str, int][source]

Return a dict of chromosome names to seqeunce lengths

Returns

dict

chromosome name to sequence length mapping

get_interval_sequences(intervals: Interval) EncodedRaggedArray[source]

Get the sequences for a set of genomic intervals

Parameters

intervalsInterval

Intervals

Returns

EncodedRaggedArray

Sequences