Sequences

The sequence module of BioNumPy provides various functions for analysing sequences, such as getting kmers and minizers or computing motif scores across sequences.

Example:

import bionumpy as bnp
file = bnp.open("example_data/big.fq.gz")
sequence = file.read().sequence
sequence = bnp.change_encoding(sequence, bnp.DNAEncoding)
kmers = bnp.sequence.get_kmers(sequence, 31)
print(kmers[0:3, 0:2])  # first three sequences, first 2 kmers
[CGGTAGCCAGCTGCGTTCAGTATGGAAGATT, GGTAGCCAGCTGCGTTCAGTATGGAAGATTT]
[GATGCATACTTCGTTCGATTTCGTTTCAACT, ATGCATACTTCGTTCGATTTCGTTTCAACTG]
[GTTTTGTCGCTGCGTTCAGTTTATGGGTGCG, TTTTGTCGCTGCGTTCAGTTTATGGGTGCGG]

API documentation

bionumpy.sequence.get_kmers(sequence: EncodedRaggedArray, k: int) EncodedArray[source]

Get kmers for sequences. Sequences should be encoded with an AlphabetEncoding (e.g. DNAEncoding). Use bnp.change_encoding if your sequences do not have a suitable encoding.

Parameters

sequenceEncodedRaggedArray

Sequences to get kmers from

kint

The kmer size (1-31)

Returns

EncodedRaggedArray

Kmers from the sequences.

Examples

>>> import bionumpy as bnp
>>> sequences = bnp.encoded_array.as_encoded_array(["ACTG", "AAA", "TTGGC"], bnp.DNAEncoding)
>>> bnp.sequence.get_kmers(sequences, 3)
encoded_ragged_array([[ACT, CTG],
                      [AAA],
                      [TTG, TGG, GGC]], 3merEncoding(AlphabetEncoding('ACGT')))
>>> sequences = bnp.open("example_data/big.fq.gz").read().sequence
>>> sequences = bnp.change_encoding(sequences, bnp.DNAEncoding)
>>> bnp.sequence.get_kmers(sequences, 31)[0, 0:3]  # first three kmers of first sequence
encoded_array([CGGTAGCCAGCTGCGTTCAGTATGGAAGATT, GGTAGCCAGCTGCGTTCAGTATGGAAGATTT, GTAGCCAGCTGCGTTCAGTATGGAAGATTTG], 31merEncoding(AlphabetEncoding('ACGT')))
bionumpy.sequence.get_minimizers(sequence: EncodedRaggedArray, k: int, window_size: int) EncodedRaggedArray[source]

Get minimizers for sequences. Sequences should be encoded with an AlphabetEncoding (e.g. DNAEncoding).

Parameters

sequenceEncodedRaggedArray

Sequences to get minimizers from

kint

The kmer size

window_sizeint

The window size

Returns

EncodedRaggedArray

Minimizers from the sequences.

Examples

>>> import bionumpy as bnp
>>> sequences = bnp.encoded_array.as_encoded_array(["ACTG", "AAA", "TTGGC"], bnp.DNAEncoding)
>>> bnp.sequence.get_minimizers(sequences, 2, 4)
encoded_ragged_array([[AC],
                      [],
                      [GG, GC]], 2merEncoding(AlphabetEncoding('ACGT')))
bionumpy.sequence.get_motif_scores(sequence: EncodedRaggedArray, pwm: PWM) RaggedArray[source]

Computes motif scores for a motif on a sequence. Returns a RaggedArray with the score at each position in every read.

Parameters

sequence: EncodedRaggedArray motif: PositionWeightMatrix

Returns

RaggedArray

A numeric RaggedArray. Contains one row for every read with the scores for every position of that read.

Examples

>>> import bionumpy as bnp
>>> pwm = bnp.sequence.position_weight_matrix.PWM.from_dict({"A": [5, 1], "C": [1, 5], "G": [0, 0], "T": [0, 0]})
>>> sequences = bnp.as_encoded_array(["ACTGAC", "CA", "GG"])
>>> bnp.get_motif_scores(sequences, pwm)
ragged_array([5.99146455       -inf       -inf       -inf 5.99146455]
[2.77258872]
[-inf])
bionumpy.sequence.count_encoded(values: EncodedArrayLike, weights: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, axis: int = -1) EncodedCounts[source]

Count the occurances of encoded entries. Works on any encoding with finite alphabet.

Parameters

values : EncodedArrayLike weights : ArrayLike

Weights for each entry

axisint

0 for column counts, -1 or 1 for row counts None for flattened counts

Returns

EncodedCounts

bionumpy.sequence.match_string(sequence: EncodedArrayLike, matching_sequence: SingleEncodedArrayLike) _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes][source]

Matches a sequence aginst sequences and returns a boolean RaggedArray representing positions where the sequence matches. Parameters ———- sequence : matching_sequence :

Returns

ArrayLike

A boolean RaggedArray representing positions where the sequence matches.

Examples

>>> import bionumpy as bnp
>>> sequence = bnp.as_encoded_array(['ACGT', 'TACTAC'])
>>> matching_sequence = bnp.as_encoded_array('AC', sequence.encoding)
>>> bnp.match_string(sequence, matching_sequence)
ragged_array([ True False False]
[False  True False False  True])
class bionumpy.sequence.PWM(matrix, alphabet)[source]

Class representing a Position Weight Matrix. Calculates scores based on the log likelihood ratio between the motif and a background probability

calculate_score(sequence: EncodedArrayLike) float[source]

Calculates the pwm score for a sequence of the same length as the motif

Parameters

sequence : EncodedArrayLike

calculate_scores(sequence: EncodedArrayLike) _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes][source]

Calculate motif scores for an entire sequence

Parameters

sequence : EncodedArrayLike

Returns

ArrayLike

Motif scores for all valid and invalid windows

classmethod from_counts(counts: Dict[str, List[int]]) PWM[source]

Create a PWM object from a dict of letters to position counts

classmethod from_dict(dictionary: Dict[str, _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]], background: Dict[str, float] | None = None) PWM[source]

Create a PWM object from a dict of letters to position probabilities

This takes raw probabilities as input. Not log likelihood(ratios)

Parameters

cls : dictionary : Dict[str, ArrayLike]

Mapping of alphabet letters to position probability scores

backgroundDict[str, float]

Background probabilities. By default assume uniform probabilities

Returns

“PWM”

Position Weight Matrix object with log-likelihood ratios