Genomic Data

Analysis of genomic data is best done with the knowledge of how the genome looks like. At the very least, we should know how the names of each chromosome in the genome, and how long the sequence of each chromosome is. This serves several purposes:

  • We (and the computer) become aware of which reference genome we are using, which can prevent errors from mismatched reference genomes

  • Chromosomes without any data gets handled correctly

  • We get a speed increase in the algorithms

For these reasons, we always recomend using the genomic data framework for handling genomic data in BioNumPy, not using raw bed, vcf or bdg files directly from the bnp.open interface. The main access point of the genomic data interface is the genome object, which can be created directly from a dict of chromosome names and sizes; or, more commonly, from a .chrom.sizes or .fa.fai file, which are tab separated lines of chromosome names and sizes:

import bionumpy as bnp
genome = bnp.Genome.from_file("example_data/small.chrom.sizes")
print(genome)
Chromosome                     Size
      chr1                    10000
      chr2                    20000
      chr3                    30000

From this genome object we can create genomic data from reading files, or from bnpdataclass objects directly. For instance, we can now read a set of intervals from a .bed file and get a GenomicIntervals object. Because GenomicIntervals objects knows which reference genome they belong to, they are easier and faster to work with than raw ‘bnp.Interval’ objects.

intervals = genome.read_intervals('example_data/small_peaks.narrowPeak')
print(intervals)
mask = intervals.get_mask()
print(mask)
 Genomic Intervals on ['chr1', 'chr2', 'chr3']:
Interval with 13 entries
                chromosome                    start                     stop
                      chr1                      482                      790
                      chr1                     5873                     6183
                      chr1                     6969                     7289
                      chr2                      691                     1007
                      chr2                     6155                     6459
                      chr2                     8337                     8641
                      chr2                    11170                    11481
                      chr2                    12375                    12677
                      chr2                    12918                    13232
                      chr2                    18804                    19134
 chr1: [ False False False ... False False False]
 chr2: [ False False False ... False False False]
 chr3: [ False False False ... False False False]

The mask here is a boolean array which is True wherever any of the intervals overlap. We can also get a GenomicArray from a bedgraph file by using genome.read_track()

pileup = genome.read_track('example_data/small_treat_pileup.bdg')
print(pileup)
chr1: [ 0.0 0.0 1.0 ... 0.0 0.0 0.0]
chr2: [ 0.0 0.0 0.0 ... 0.0 0.0 0.0]
chr3: [ 0.0 0.0 0.0 ... 0.0 0.0 0.0]

A benefit of working with genomic data in this way is that they cooperate in a consistent way. We can use GenomicIntervals as indexes to GenomicData, and can again use the extracted data to filter the intervals. Let’s see how the treat_pileup looks in peak areas (we get the max and mean pileup value for each peak):

peak_pileups = pileup[intervals]
print(peak_pileups.max(axis=-1))
print(peak_pileups.mean(axis=-1))
[ 227.  231.  412.  296.  165.  163.  271.  148.  268.  568. 1901.   90.
  236.]
[145.68181818 147.86129032 256.671875   187.71202532 106.79934211
 106.23355263 173.04501608  96.27152318 169.59235669 346.37272727
 593.44084507  60.25614035 150.72580645]

we can now again use these values to filter the intervals based on the treat_pileup values in each interval. For instance only keep peaks with a max treatment_value above 200:

high_pileup_peaks = intervals[peak_pileups.max(axis=-1)>200]
print(high_pileup_peaks)
Genomic Intervals on ['chr1', 'chr2', 'chr3']:
 Interval with 9 entries
                chromosome                    start                     stop
                      chr1                      482                      790
                      chr1                     5873                     6183
                      chr1                     6969                     7289
                      chr2                      691                     1007
                      chr2                    11170                    11481
                      chr2                    12918                    13232
                      chr2                    18804                    19134
                      chr3                    10677                    11387
                      chr3                    27057                    27367

A further way to analyze these peaks is to check the sequence in the peaks for motifs. We can load the reference sequence using the genome.read_sequence() method:

genome_sequence = genome.read_sequence('example_data/small_sequence.fa')
print(genome_sequence)
GenomicSequence over chromosomes: ['chr1', 'chr2', 'chr3']

Now we can use our intervals as indexed to the reference sequence in much the same way as with genomic arrays. This we can use to get the sequences of the peaks and check them for motifs:

 peak_sequences = genome_sequence[high_pileup_peaks]
 print(peak_sequences)
 from pyjaspar import jaspardb
 from bionumpy.sequence import PWM
 jaspar_object = jaspardb(release="JASPAR2020")
 ctcf_motif = jaspar_object.fetch_motifs_by_name('CTCF')[0]
 motif = PWM.from_dict(ctcf_motif.pwm)
 print(bnp.get_motif_scores(peak_sequences, motif).max(axis=-1))
 AGAGCCGGACCGAATGACGT...
 TCAGGTAGAACTCGCATTTC...
 AAGACTTATTTGATGGCCGG...
 ATAGAGAGCGTTCGGCGCTA...
 CTCCTTAGCATACAAACGGG...
 TGCCCCATCTCTACACAATT...
 GCGCTGCCGTCACGGCGGGG...
 TTACATCCTGACGAAATACA...
 GACGGGTAGGCGATTTTTAT...
 [-0.82852725 -4.78231564 -0.90188366 -4.97871502 -3.06703885 -1.00359223
   7.75920829  0.19382436  2.5188205 ]

Further reading