.. _supported_file_formats:

Supported file formats
-----------------------------------

This is a list of  currently supported file formats in BioNumPy. Reading files with any of these extensions with `bnp.open` will make BioNumPy automatically detect the file type and read the data into an appropriate data structure (which will be a dataclass-like object with fields).

* vcf
* bed
* fasta / fa
* bed
* fastq / fq
* gfa (limited support only)
* gff
* gtf
* gff3
* sam / bam

=======
Example
=======
We open a bed file, read one chunk and print a description of that chunk:

.. testcode::

    import bionumpy as bnp
    from bionumpy.io.delimited_buffers import DelimitedBuffer
    data = bnp.open("example_data/test.bed")
    chunk = data.read_chunk()
    print(chunk)

The above example should work wih any of the supported file formats.

This shows us that we have a a chunk of 71 intervals, and we get to see the first of these:

.. testoutput::

    Interval with 71 entries
                   chromosome                    start                     stop
                           17                  7512371                  7512447
                           17                  7512377                  7512453
                           17                  7512393                  7512469
                           17                  7512420                  7512496
                           17                  7512422                  7512473
                           17                  7512425                  7512501
                           17                  7512428                  7512504
                           17                  7512474                  7512550
                           17                  7512537                  7512613
                           17                  7512559                  7512635


Reading a new file format
------------------------------

BioNumpy works well with popular file formats in biological data.
However, if you have a custom file format that you would like to read into BioNumpy,
you can implement a new buffer class that inherits from `DelimitedBuffer` and
specify the dataclass that you would like to use to store the data.
Here is an example of how you can implement a new buffer class for a custom file format:

We define a custom dataclass (e.g. MyCustomFormat here) that corresponds to the columns in our file format.
We then define a new buffer class (e.g. MyCustomBuffer here) that inherits from `DelimitedBuffer` and specify
the dataclass (e.g. MyCustomFormat here) that we would like to use.
We can then use the `bnp.open` function to read all the files that have similar format.

.. testcode::

    from bionumpy.io.delimited_buffers import DelimitedBuffer
    from bionumpy.bnpdataclass import bnpdataclass
    import bionumpy as bnp

    @bnpdataclass
    class MyCustomFormat:
      dna: bnp.DNAEncoding
      amino_acid: bnp.AminoAcidEncoding
      v_gene: str
      j_gene: str

    class MyCustomBuffer(DelimitedBuffer):
       dataclass = MyCustomFormat

    my_sequence_data = bnp.open(filename="example_data/airr.tsv", buffer_type=MyCustomBuffer).read()
    print(my_sequence_data)

.. testoutput::

 MyCustomFormat with 100 entries
                          dna               amino_acid                   v_gene                   j_gene
      TGCGCCACCTGGGGGGACGAGCA               CATWGDEQYF                 TRBV10-2                  TRBJ2-7
      TGTGCCAGCTCACCTACGAATTC         CASSPTNSGSNYGYTF                   TRBV18                  TRBJ1-2
      TGCGGGCCCGTAATGAACACTGA              CGPVMNTEAFF                 TRBV10-2                  TRBJ1-1
      TGTGCCAGCAGTGAAGCGCGTCC         CASSEARPARMYGYTF                  TRBV6-1                  TRBJ1-2
      TGTGCCAGCAGTAGTGGGACAGG          CASSSGTGPDQPQHF                  TRBV6-3                  TRBJ1-5
      TGTGCCAGCAACCTAGCGGGGAA          CASNLAGKNTGELFF                  TRBV6-2                  TRBJ2-2
      TGTGCCAGCAGCCAACCGGGGGG         CASSQPGGSGNYGYTF                  TRBV4-2                  TRBJ1-2
      TGCGCCAGCAGCCGCGGCCTCAG           CASSRGLREETQYF                  TRBV5-1                  TRBJ2-5
      TGTGCCAGCAGCCAAGTCTCACG        CASSQVSRQDSSYEQYF                  TRBV4-2                  TRBJ2-7
      TGTGCCAGCAGGCCGGGACAGGG     CASRPGQGAPGWEDNYGYTF                   TRBV28                  TRBJ1-2