Sequences
The sequence module of BioNumPy provides various functions for analysing sequences, such as getting kmers and minizers or computing motif scores across sequences.
Example:
import bionumpy as bnp
file = bnp.open("example_data/big.fq.gz")
sequence = file.read().sequence
sequence = bnp.change_encoding(sequence, bnp.DNAEncoding)
kmers = bnp.sequence.get_kmers(sequence, 31)
print(kmers[0:3, 0:2]) # first three sequences, first 2 kmers
[CGGTAGCCAGCTGCGTTCAGTATGGAAGATT, GGTAGCCAGCTGCGTTCAGTATGGAAGATTT]
[GATGCATACTTCGTTCGATTTCGTTTCAACT, ATGCATACTTCGTTCGATTTCGTTTCAACTG]
[GTTTTGTCGCTGCGTTCAGTTTATGGGTGCG, TTTTGTCGCTGCGTTCAGTTTATGGGTGCGG]
API documentation
- bionumpy.sequence.get_kmers(sequence: EncodedRaggedArray, k: int) EncodedArray [source]
Get kmers for sequences. Sequences should be encoded with an AlphabetEncoding (e.g. DNAEncoding). Use bnp.change_encoding if your sequences do not have a suitable encoding.
Parameters
- sequenceEncodedRaggedArray
Sequences to get kmers from
- kint
The kmer size (1-31)
Returns
- EncodedRaggedArray
Kmers from the sequences.
Examples
>>> import bionumpy as bnp >>> sequences = bnp.encoded_array.as_encoded_array(["ACTG", "AAA", "TTGGC"], bnp.DNAEncoding) >>> bnp.sequence.get_kmers(sequences, 3) encoded_ragged_array([[ACT, CTG], [AAA], [TTG, TGG, GGC]], 3merEncoding(AlphabetEncoding('ACGT')))
>>> sequences = bnp.open("example_data/big.fq.gz").read().sequence >>> sequences = bnp.change_encoding(sequences, bnp.DNAEncoding) >>> bnp.sequence.get_kmers(sequences, 31)[0, 0:3] # first three kmers of first sequence encoded_array([CGGTAGCCAGCTGCGTTCAGTATGGAAGATT, GGTAGCCAGCTGCGTTCAGTATGGAAGATTT, GTAGCCAGCTGCGTTCAGTATGGAAGATTTG], 31merEncoding(AlphabetEncoding('ACGT')))
- bionumpy.sequence.get_minimizers(sequence: EncodedRaggedArray, k: int, window_size: int) EncodedRaggedArray [source]
Get minimizers for sequences. Sequences should be encoded with an AlphabetEncoding (e.g. DNAEncoding).
Parameters
- sequenceEncodedRaggedArray
Sequences to get minimizers from
- kint
The kmer size
- window_sizeint
The window size
Returns
- EncodedRaggedArray
Minimizers from the sequences.
Examples
>>> import bionumpy as bnp >>> sequences = bnp.encoded_array.as_encoded_array(["ACTG", "AAA", "TTGGC"], bnp.DNAEncoding) >>> bnp.sequence.get_minimizers(sequences, 2, 4) encoded_ragged_array([[AC], [], [GG, GC]], 2merEncoding(AlphabetEncoding('ACGT')))
- bionumpy.sequence.get_motif_scores(sequence: EncodedRaggedArray, pwm: PWM) RaggedArray [source]
Computes motif scores for a motif on a sequence. Returns a RaggedArray with the score at each position in every read.
Parameters
sequence: EncodedRaggedArray motif: PositionWeightMatrix
Returns
- RaggedArray
A numeric RaggedArray. Contains one row for every read with the scores for every position of that read.
Examples
>>> import bionumpy as bnp >>> pwm = bnp.sequence.position_weight_matrix.PWM.from_dict({"A": [5, 1], "C": [1, 5], "G": [0, 0], "T": [0, 0]}) >>> sequences = bnp.as_encoded_array(["ACTGAC", "CA", "GG"]) >>> bnp.get_motif_scores(sequences, pwm) ragged_array([5.99146455 -inf -inf -inf 5.99146455] [2.77258872] [-inf])
- bionumpy.sequence.count_encoded(values: EncodedArrayLike, weights: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None = None, axis: int = -1) EncodedCounts [source]
Count the occurances of encoded entries. Works on any encoding with finite alphabet.
Parameters
values : EncodedArrayLike weights : ArrayLike
Weights for each entry
- axisint
0 for column counts, -1 or 1 for row counts None for flattened counts
Returns
EncodedCounts
- bionumpy.sequence.match_string(sequence: EncodedArrayLike, matching_sequence: SingleEncodedArrayLike) _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] [source]
Matches a sequence aginst sequences and returns a boolean RaggedArray representing positions where the sequence matches. Parameters ———- sequence : matching_sequence :
Returns
- ArrayLike
A boolean RaggedArray representing positions where the sequence matches.
Examples
>>> import bionumpy as bnp >>> sequence = bnp.as_encoded_array(['ACGT', 'TACTAC']) >>> matching_sequence = bnp.as_encoded_array('AC', sequence.encoding) >>> bnp.match_string(sequence, matching_sequence) ragged_array([ True False False] [False True False False True])
- class bionumpy.sequence.PWM(matrix, alphabet)[source]
Class representing a Position Weight Matrix. Calculates scores based on the log likelihood ratio between the motif and a background probability
- calculate_score(sequence: EncodedArrayLike) float [source]
Calculates the pwm score for a sequence of the same length as the motif
Parameters
sequence : EncodedArrayLike
- calculate_scores(sequence: EncodedArrayLike) _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] [source]
Calculate motif scores for an entire sequence
Parameters
sequence : EncodedArrayLike
Returns
- ArrayLike
Motif scores for all valid and invalid windows
- classmethod from_counts(counts: Dict[str, List[int]]) PWM [source]
Create a PWM object from a dict of letters to position counts
- classmethod from_dict(dictionary: Dict[str, _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]], background: Dict[str, float] | None = None) PWM [source]
Create a PWM object from a dict of letters to position probabilities
This takes raw probabilities as input. Not log likelihood(ratios)
Parameters
cls : dictionary : Dict[str, ArrayLike]
Mapping of alphabet letters to position probability scores
- backgroundDict[str, float]
Background probabilities. By default assume uniform probabilities
Returns
- “PWM”
Position Weight Matrix object with log-likelihood ratios