Simulating sequence datasets
Simulating sequences is very straightforward in BioNumPy. Since a sequence arrays have underlying numeric representations that are easy to relate to, one can directly simulate such an underlying representation using any numeric simulation procedure. In addition, bionumpy provides convenience functions that allows to directly simulate sequence data without having to think about any underlying numeric representation. This is the focus of the current tutorial. The function simulate_sequences is an easy way to simulate a set of sequences, by simply specifying a desired alphabet and a dictionary with desired sequence ids as keys and the desired length of each such sequence as value (here simulating 20 sequences with length 10..30):
>>> import numpy as np
>>> rng = np.random.default_rng(seed=1)
>>> from bionumpy.simulate import simulate_sequences
>>> from bionumpy.sequence import match_string
>>> named_seqs = simulate_sequences('ACGT', {f's{i}':10+i for i in range(20)}, rng)
>>> named_seqs
SequenceEntry with 20 entries
name sequence
s0 CGTTAATTAC
s1 TCCTCCGGAAT
s2 TTGTCCTACACT
s3 ACCTAGCATACCC
s4 ATGTAGCGTCGACT
s5 CGCACGCTCGTTCAG
s6 GTCCACGTTAGTCCTG
s7 GGGTTAAGTAGTTTAGT
s8 CACAATGTTTCCGCTATG
s9 CGCTTCCAGGTTTTTAACC
One can now easily do a variety of analyses on these simulated sequences, e.g. compute the GC content per simulated sequence:
>>> seqs = named_seqs.sequence
>>> gc_content_per_seq = np.mean((seqs=='C')|(seqs=='G'), axis=1)
>>> gc_content_per_seq
array([0.3 , 0.54545455, 0.41666667, 0.53846154, 0.5 ,
0.66666667, 0.5625 , 0.35294118, 0.44444444, 0.47368421,
0.5 , 0.38095238, 0.68181818, 0.30434783, 0.66666667,
0.52 , 0.46153846, 0.44444444, 0.53571429, 0.51724138])
If desired, such computed values per sequence can easily be added back as an additional column of the bionumpy data structure:
>>> named_seqs = named_seqs.add_fields({'gc':gc_content_per_seq}, {'gc':float})
>>> named_seqs
DynamicSequenceEntry with 20 entries
name sequence gc
s0 CGTTAATTAC 0.3
s1 TCCTCCGGAAT 0.5454545454545454
s2 TTGTCCTACACT 0.4166666666666667
s3 ACCTAGCATACCC 0.5384615384615384
s4 ATGTAGCGTCGACT 0.5
s5 CGCACGCTCGTTCAG 0.6666666666666666
s6 GTCCACGTTAGTCCTG 0.5625
s7 GGGTTAAGTAGTTTAGT 0.35294117647058826
s8 CACAATGTTTCCGCTATG 0.4444444444444444
s9 CGCTTCCAGGTTTTTAACC 0.47368421052631576
- We can also easily apply a variety of built-in bionumpy functionality on our simulated sequences:
>>> ac_hits = match_string(seqs, "AC") >>> ac_hit_sums = np.sum(ac_hits,axis=1) >>> ac_hit_sums array([1, 0, 2, 2, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 2, 2, 2, 0, 1, 1]) >>> named_seqs = named_seqs.add_fields({'ac_hits':ac_hit_sums}, {'ac_hits':int}) >>> named_seqs DynamicSequenceEntry with 20 entries name sequence gc ac_hits s0 CGTTAATTAC 0.3 1 s1 TCCTCCGGAAT 0.5454545454545454 0 s2 TTGTCCTACACT 0.4166666666666667 2 s3 ACCTAGCATACCC 0.5384615384615384 2 s4 ATGTAGCGTCGACT 0.5 1 s5 CGCACGCTCGTTCAG 0.6666666666666666 1 s6 GTCCACGTTAGTCCTG 0.5625 1 s7 GGGTTAAGTAGTTTAGT 0.35294117647058826 0 s8 CACAATGTTTCCGCTATG 0.4444444444444444 1 s9 CGCTTCCAGGTTTTTAACC 0.47368421052631576 1