Simulating sequence datasets
------------------------------

Simulating sequences is very straightforward in BioNumPy. Since a sequence arrays have underlying numeric representations that are easy to relate to, one can directly simulate such an underlying representation using any numeric simulation procedure. In addition, bionumpy provides convenience functions that allows to directly simulate sequence data without having to think about any underlying numeric representation. This is the focus of the current tutorial. The function simulate_sequences is an easy way to simulate a set of sequences, by simply specifying a desired alphabet and a dictionary with desired sequence ids as keys and the desired length of each such sequence as value (here simulating 20 sequences with length 10..30):

    >>> import numpy as np
    >>> rng = np.random.default_rng(seed=1)
    >>> from bionumpy.simulate import simulate_sequences
    >>> from bionumpy.sequence import match_string
    >>> named_seqs = simulate_sequences('ACGT', {f's{i}':10+i for i in range(20)}, rng)
    >>> named_seqs
    SequenceEntry with 20 entries
                         name                 sequence
                           s0               CGTTAATTAC
                           s1              TCCTCCGGAAT
                           s2             TTGTCCTACACT
                           s3            ACCTAGCATACCC
                           s4           ATGTAGCGTCGACT
                           s5          CGCACGCTCGTTCAG
                           s6         GTCCACGTTAGTCCTG
                           s7        GGGTTAAGTAGTTTAGT
                           s8       CACAATGTTTCCGCTATG
                           s9      CGCTTCCAGGTTTTTAACC


One can now easily do a variety of analyses on these simulated sequences, e.g. compute the GC content per simulated sequence:

    >>> seqs = named_seqs.sequence
    >>> gc_content_per_seq = np.mean((seqs=='C')|(seqs=='G'), axis=1)
    >>> gc_content_per_seq
    array([0.3       , 0.54545455, 0.41666667, 0.53846154, 0.5       ,
           0.66666667, 0.5625    , 0.35294118, 0.44444444, 0.47368421,
           0.5       , 0.38095238, 0.68181818, 0.30434783, 0.66666667,
           0.52      , 0.46153846, 0.44444444, 0.53571429, 0.51724138])


If desired, such computed values per sequence can easily be added back as an additional column of the bionumpy data structure:

    >>> named_seqs = named_seqs.add_fields({'gc':gc_content_per_seq}, {'gc':float})
    >>> named_seqs
    DynamicSequenceEntry with 20 entries
                         name                 sequence                       gc
                           s0               CGTTAATTAC                      0.3
                           s1              TCCTCCGGAAT       0.5454545454545454
                           s2             TTGTCCTACACT       0.4166666666666667
                           s3            ACCTAGCATACCC       0.5384615384615384
                           s4           ATGTAGCGTCGACT                      0.5
                           s5          CGCACGCTCGTTCAG       0.6666666666666666
                           s6         GTCCACGTTAGTCCTG                   0.5625
                           s7        GGGTTAAGTAGTTTAGT      0.35294117647058826
                           s8       CACAATGTTTCCGCTATG       0.4444444444444444
                           s9      CGCTTCCAGGTTTTTAACC      0.47368421052631576


We can also easily apply a variety of built-in bionumpy functionality on our simulated sequences:
    >>> ac_hits = match_string(seqs, "AC")
    >>> ac_hit_sums = np.sum(ac_hits,axis=1)
    >>> ac_hit_sums
    array([1, 0, 2, 2, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 2, 2, 2, 0, 1, 1])
    >>> named_seqs = named_seqs.add_fields({'ac_hits':ac_hit_sums}, {'ac_hits':int})
    >>> named_seqs
    DynamicSequenceEntry with 20 entries
                         name                 sequence                       gc                  ac_hits
                           s0               CGTTAATTAC                      0.3                        1
                           s1              TCCTCCGGAAT       0.5454545454545454                        0
                           s2             TTGTCCTACACT       0.4166666666666667                        2
                           s3            ACCTAGCATACCC       0.5384615384615384                        2
                           s4           ATGTAGCGTCGACT                      0.5                        1
                           s5          CGCACGCTCGTTCAG       0.6666666666666666                        1
                           s6         GTCCACGTTAGTCCTG                   0.5625                        1
                           s7        GGGTTAAGTAGTTTAGT      0.35294117647058826                        0
                           s8       CACAATGTTTCCGCTATG       0.4444444444444444                        1
                           s9      CGCTTCCAGGTTTTTAACC      0.47368421052631576                        1