Sequences

The sequence module of BioNumPy provides various functions for analysing sequences, such as getting kmers and minizers or computing motif scores across sequences.

Example:

import bionumpy as bnp
file = bnp.open("example_data/big.fq.gz")
sequence = file.read().sequence
sequence = bnp.change_encoding(sequence, bnp.DNAEncoding)
kmers = bnp.sequence.get_kmers(sequence, 31)
print(kmers[0:3, 0:2])  # first three sequences, first 2 kmers

[CGGTAGCCAGCTGCGTTCAGTATGGAAGATT, GGTAGCCAGCTGCGTTCAGTATGGAAGATTT]
[GATGCATACTTCGTTCGATTTCGTTTCAACT, ATGCATACTTCGTTCGATTTCGTTTCAACTG]
[GTTTTGTCGCTGCGTTCAGTTTATGGGTGCG, TTTTGTCGCTGCGTTCAGTTTATGGGTGCGG]

API documentation

bionumpy.sequence.get_kmers(sequence: EncodedRaggedArray, k: int) → EncodedArray[source]

Get kmers for sequences. Sequences should be encoded with an AlphabetEncoding (e.g. DNAEncoding). Use bnp.change_encoding if your sequences do not have a suitable encoding.

Parameters

sequenceEncodedRaggedArray: Sequences to get kmers from
kint: The kmer size (1-31)

Returns

EncodedRaggedArray: Kmers from the sequences.

Examples

>>> import bionumpy as bnp
>>> sequences = bnp.encoded_array.as_encoded_array(["ACTG", "AAA", "TTGGC"], bnp.DNAEncoding)
>>> bnp.sequence.get_kmers(sequences, 3)
encoded_ragged_array([[ACT, CTG],
                      [AAA],
                      [TTG, TGG, GGC]], 3merEncoding(AlphabetEncoding('ACGT')))

>>> sequences = bnp.open("example_data/big.fq.gz").read().sequence
>>> sequences = bnp.change_encoding(sequences, bnp.DNAEncoding)
>>> bnp.sequence.get_kmers(sequences, 31)[0, 0:3]  # first three kmers of first sequence
encoded_array([CGGTAGCCAGCTGCGTTCAGTATGGAAGATT, GGTAGCCAGCTGCGTTCAGTATGGAAGATTT, GTAGCCAGCTGCGTTCAGTATGGAAGATTTG], 31merEncoding(AlphabetEncoding('ACGT')))

bionumpy.sequence.get_minimizers(sequence: EncodedRaggedArray, k: int, window_size: int) → EncodedRaggedArray[source]

Get minimizers for sequences. Sequences should be encoded with an AlphabetEncoding (e.g. DNAEncoding).

Parameters

sequenceEncodedRaggedArray: Sequences to get minimizers from
kint: The kmer size
window_sizeint: The window size

Returns

EncodedRaggedArray: Minimizers from the sequences.

Examples

>>> import bionumpy as bnp
>>> sequences = bnp.encoded_array.as_encoded_array(["ACTG", "AAA", "TTGGC"], bnp.DNAEncoding)
>>> bnp.sequence.get_minimizers(sequences, 2, 4)
encoded_ragged_array([[AC],
                      [],
                      [GG, GC]], 2merEncoding(AlphabetEncoding('ACGT')))

bionumpy.sequence.get_motif_scores(sequence: EncodedRaggedArray, pwm: PWM) → RaggedArray[source]

Computes motif scores for a motif on a sequence. Returns a RaggedArray with the score at each position in every read.

Parameters

sequence: EncodedRaggedArray motif: PositionWeightMatrix

Returns

RaggedArray: A numeric RaggedArray. Contains one row for every read with the scores for every position of that read.

Examples

>>> import bionumpy as bnp
>>> pwm = bnp.sequence.position_weight_matrix.PWM.from_dict({"A": [5, 1], "C": [1, 5], "G": [0, 0], "T": [0, 0]})
>>> sequences = bnp.as_encoded_array(["ACTGAC", "CA", "GG"])
>>> bnp.get_motif_scores(sequences, pwm)
ragged_array([5.99146455       -inf       -inf       -inf 5.99146455]
[2.77258872]
[-inf])

Count the occurances of encoded entries. Works on any encoding with finite alphabet.

Parameters

values : EncodedArrayLike weights : ArrayLike

Weights for each entry

axisint: 0 for column counts, -1 or 1 for row counts None for flattened counts

Returns

EncodedCounts

Matches a sequence aginst sequences and returns a boolean RaggedArray representing positions where the sequence matches. Parameters ———- sequence : matching_sequence :

Returns

ArrayLike: A boolean RaggedArray representing positions where the sequence matches.

Examples

>>> import bionumpy as bnp
>>> sequence = bnp.as_encoded_array(['ACGT', 'TACTAC'])
>>> matching_sequence = bnp.as_encoded_array('AC', sequence.encoding)
>>> bnp.match_string(sequence, matching_sequence)
ragged_array([ True False False]
[False  True False False  True])

class bionumpy.sequence.PWM(matrix, alphabet)[source]

Class representing a Position Weight Matrix. Calculates scores based on the log likelihood ratio between the motif and a background probability

calculate_score(sequence: EncodedArrayLike) → float[source]: Calculates the pwm score for a sequence of the same length as the motif

Parameters

sequence : EncodedArrayLike

Calculate motif scores for an entire sequence

Parameters

sequence : EncodedArrayLike

Returns

ArrayLike: Motif scores for all valid and invalid windows

classmethod from_counts(counts: Dict[str, List[int]]) → PWM[source]: Create a PWM object from a dict of letters to position counts

Create a PWM object from a dict of letters to position probabilities

This takes raw probabilities as input. Not log likelihood(ratios)

Parameters

cls : dictionary : Dict[str, ArrayLike]

Mapping of alphabet letters to position probability scores

backgroundDict[str, float]: Background probabilities. By default assume uniform probabilities

Returns

“PWM”: Position Weight Matrix object with log-likelihood ratios