Simulating sequence datasets ------------------------------ Simulating sequences is very straightforward in BioNumPy. Since a sequence arrays have underlying numeric representations that are easy to relate to, one can directly simulate such an underlying representation using any numeric simulation procedure. In addition, bionumpy provides convenience functions that allows to directly simulate sequence data without having to think about any underlying numeric representation. This is the focus of the current tutorial. The function simulate_sequences is an easy way to simulate a set of sequences, by simply specifying a desired alphabet and a dictionary with desired sequence ids as keys and the desired length of each such sequence as value (here simulating 20 sequences with length 10..30): >>> import numpy as np >>> rng = np.random.default_rng(seed=1) >>> from bionumpy.simulate import simulate_sequences >>> from bionumpy.sequence import match_string >>> named_seqs = simulate_sequences('ACGT', {f's{i}':10+i for i in range(20)}, rng) >>> named_seqs SequenceEntry with 20 entries name sequence s0 CGTTAATTAC s1 TCCTCCGGAAT s2 TTGTCCTACACT s3 ACCTAGCATACCC s4 ATGTAGCGTCGACT s5 CGCACGCTCGTTCAG s6 GTCCACGTTAGTCCTG s7 GGGTTAAGTAGTTTAGT s8 CACAATGTTTCCGCTATG s9 CGCTTCCAGGTTTTTAACC One can now easily do a variety of analyses on these simulated sequences, e.g. compute the GC content per simulated sequence: >>> seqs = named_seqs.sequence >>> gc_content_per_seq = np.mean((seqs=='C')|(seqs=='G'), axis=1) >>> gc_content_per_seq array([0.3 , 0.54545455, 0.41666667, 0.53846154, 0.5 , 0.66666667, 0.5625 , 0.35294118, 0.44444444, 0.47368421, 0.5 , 0.38095238, 0.68181818, 0.30434783, 0.66666667, 0.52 , 0.46153846, 0.44444444, 0.53571429, 0.51724138]) If desired, such computed values per sequence can easily be added back as an additional column of the bionumpy data structure: >>> named_seqs = named_seqs.add_fields({'gc':gc_content_per_seq}, {'gc':float}) >>> named_seqs DynamicSequenceEntry with 20 entries name sequence gc s0 CGTTAATTAC 0.3 s1 TCCTCCGGAAT 0.5454545454545454 s2 TTGTCCTACACT 0.4166666666666667 s3 ACCTAGCATACCC 0.5384615384615384 s4 ATGTAGCGTCGACT 0.5 s5 CGCACGCTCGTTCAG 0.6666666666666666 s6 GTCCACGTTAGTCCTG 0.5625 s7 GGGTTAAGTAGTTTAGT 0.35294117647058826 s8 CACAATGTTTCCGCTATG 0.4444444444444444 s9 CGCTTCCAGGTTTTTAACC 0.47368421052631576 We can also easily apply a variety of built-in bionumpy functionality on our simulated sequences: >>> ac_hits = match_string(seqs, "AC") >>> ac_hit_sums = np.sum(ac_hits,axis=1) >>> ac_hit_sums array([1, 0, 2, 2, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 2, 2, 2, 0, 1, 1]) >>> named_seqs = named_seqs.add_fields({'ac_hits':ac_hit_sums}, {'ac_hits':int}) >>> named_seqs DynamicSequenceEntry with 20 entries name sequence gc ac_hits s0 CGTTAATTAC 0.3 1 s1 TCCTCCGGAAT 0.5454545454545454 0 s2 TTGTCCTACACT 0.4166666666666667 2 s3 ACCTAGCATACCC 0.5384615384615384 2 s4 ATGTAGCGTCGACT 0.5 1 s5 CGCACGCTCGTTCAG 0.6666666666666666 1 s6 GTCCACGTTAGTCCTG 0.5625 1 s7 GGGTTAAGTAGTTTAGT 0.35294117647058826 0 s8 CACAATGTTTCCGCTATG 0.4444444444444444 1 s9 CGCTTCCAGGTTTTTAACC 0.47368421052631576 1