The Smith-Waterman Algorithm and Parallelization

Results from university projects in Junior Year

Nitin
3 min readOct 7, 2020

To begin with, you can get this information on Wikipedia —

Chargaff’s rules state that DNA from any cell of any organisms should have a 1:1 ratio (base Pair Rule) of pyrimidine and purine bases and, more specifically, that the amount of guanine should be equal to cytosine and the amount of adenine should be equal to thymine. This pattern is found in both strands of the DNA.

And to add to it —

A with T: the purine adenine (A) always pairs with the pyrimidine thymine (T)

C with G: the pyrimidine cytosine always pairs with the purine guanine (G)

Image of a DNA
Source: The US National Library of Medicine

What is sequence alignment?

Sequence alignment is basically to identify regions within nucleotides such as RNA or DNA that may have functional or evolutionary similarity or relationship. Through years of research and of course cuz that research fits with the massive database available, we know that genetic sequences can be mathematically modelled as matrices or vectors. And this modelling is what actually helps us use all possible graph theory algorithms on our genes or it’s amino acids!

Local Vs Global Alignment

Sequence Alignment can be both global and local — to identify through an entire query at once or to find small patches of similarity in a varied population.(So very Shakespearean) . The Smith-Waterman Algorithm is a local sequence alignment algorithm with a very high degree of sensitivity. You can literally start from any point in a query(available population of genetic data) and end anywhere. Other algorithms — the local and especially the global algorithms are not very sensitive.

Picture: Report
Search might be both global and local

So, what makes Smith-Waterman better or worse?

Better sensitivity is reduced fault and hence, better results. But better sensitivity means that the algo needs to possess greater memory so that it remembers patterns longer. And it also means more time needed cuz a more thorough scanning through the available data is needed. And that is why Smith-Waterman is not widely used.

Genetic data can be massive. Data collected from a single sample group of humans can amount to millions of rows. With its quadratic complexity in both time and space, it is not a good idea to use the algorithm for a million or maybe billion plus database.

Parallelization, GPU, Containerization are answers, perhaps.

Parallelizing the algo might be a good idea. Running it on several machines or maybe using parallelization concepts introduced about 3 decades back. But, of course that has been tried. Parallel processing has its own bottlenecks and the infrastructure needed to establish a system that uses this algorithm to assess data with be costly both in setup and power consumption. The same is true for using GPU based systems. As per research, GPUs increases speed by more than 11 times for this algorithm. However, they can also be used for the already faster algorithms (shrug emoji). The same is true for containerization. This is what we tried(don’t know if such a project already exists. Shall be glad to compare results). Containerization using tools like Docker or Kubernetics reduces the need for hardware resources for setting up a system of computers. But then again, the actual processing power needed by the machine where this system is created will be pretty high.

That might be the cost for better results. (Remember, that is what Chris gardener did with that x-ray machine.) However, in our case, the results can anyway be improvised with repeating the faster algorithms several times and deducing faults from the massive dataset thus collected.

But, it is still a great idea since sequence alignment is also used in NLP and other fields. And I strongly believe that languages should be subjected to algorithms of very high sensitivity.

Here’s the link to our project report: P.D.C Report

Additional Tools and articles

  1. Wikipedia page on Gene Sequence Alignment
  2. The Smith-Waterman Algorithm Explained
  3. Pair-Wise Sequence Alignment Tool
  4. ClustalW2 tool for three or more sequence alignment program
  5. Comparing Sequences through curves

--

--