Multiple sequence alignment is a method of getting genomic relationships between 3 sequences or more. In multiple alignments, there are 3 mutation network analyses, namely topological network system, mutation region network and network system of mutation mode. In general, the three analyses show stable and unstable regions that map mutation regions. This area of mutation is described further in a phylogenetic tree which simultaneously illustrates the path of the spread of an epidemic, the Severe Acute Respiratory Syndrome (SARS) epidemic. The process of spreading the SARS viruses, in this case, is described as the process of phylogenetic tree formation, and as a novelty of this research, multiple alignments in the process are analyzed in detail and then optimized with genetic algorithms.
The data used to form the phylogenetic tree for the spread of the SARS epidemic are 14 DNA sequences which are then optimized by using genetic algorithms. The phylogenetic tree is constructed by using the neighbor-joining algorithm with a distance matrix that the intended distance is the genetic distance obtained from sequence alignment by using the Needleman Wunsch Algorithm.
The results of the analysis obtained 3649 stable areas and 19 unstable areas. The results of phylogenetic tree from the network system analysis indicated that the spread of the SARS epidemic extended from Guangzhou 16/12/02 to Zhongshan 27/12/02, then spread simultaneously to Guangzhou 18/02/03 and Guangzhou hospital. After that, the virus reached Metropole, Zhongshan, Hongkong, Singapore, Taiwan, Hong kong, and Hanoi which then continued to Guangzhou 01/01/03 and Toronto at once. The results of the mutation region network system demonstrate decomposition of orthogonal mutations in the 1st order arc.
The genetic algorithm is a searching and optimizing technique, which works by imitating the process of evolution and the genetic structure of living things. Sequence Alignment by Genetic Algorithm (SAGA) software tool is a software package that is also built on the genetic algorithm strategy, which appears to have the capability of finding comprehensively optimal or close-to-optimal multiple alignments in reasonable time [1Notredame C, Higgins DG. SAGA : Sequence alignment by genetic algorithm 24(8): 1515-24.1996; ]. However, such kind of practices of 22 operators by SAGA’s Genetic Algorithm were observed to be too obscure, and the level of complexity could be reduced. From that background, the multiple alignment of the epidemic, spread the pattern of Severe Acute Respiratory Syndrome (SARS) in the Arabian Peninsula, a case example in this research; which will eventually be developed by optimization with genetic algorithms. The multiple alignment applied a progressive alignment to generate phylogenetic trees delineating spread of the SARS epidemic, as well as a network topology and a network orthogonal decomposition of some used data. All the three analyses apply to multiple alignments, and in the following stage, add optimization of Genetic Algorithm (GA) simulated in matlab, which are expected to give more optimal results. The data used in this research are taken from www.ncbi.nlm.nih.gov, a world GenBank database owned by National Center for Biotechnology Information (NCBI) of USA.
Progressive alignment method is a heuristic algorithm that generates a multiple alignment based on a number of pairwise alignments. The general scheme of the method is: two sequences (Sequence 1 and Sequence 2) are aligned, then the third sequence is chosen and aligned with the first sequence. Thus, the process continues until all the sequences are aligned [2Isaev A. Introduction to mathematical methods in bioinformatics 2004; 29-31.].
CLUSTALW [3Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22(22): 4673-80.
[http://dx.doi.org/10.1093/nar/22.22.4673] [PMID: 7984417] ] is a very popular multiple sequence alignment program. It employs a progressive alignment method which assessed by a biologist to analyze the same spreading virus case. The alignment processes are 1) an alignment sequence performed by using dynamic programming, and 2) scores of pairwise alignment are used to form the matrix of distance genetic distance, while phylogenetic trees are used to form the neighbor-joining method [4Isa Irawan M, Amiroch S. Construction of phylogenetic tree using neighbor joining algorithms to identify the host and the spreading of SARS epidemic. J Theor Appl Inf Technol 2015; 71(3)]. 3) A dynamic programming is utilized to highlight the nearest distance sequence alignment from the tree.
Needleman Wunsch is a global alignment algorithm to pairwise alignment [5Shen S, Tuszynski JA. Theory and Mathematical Methods for Bioinformatics 2007; 197-218.]. The steps in this algorithm are as follows. If there are sequences A=(a_{1},a_{2},...,a_{n}) and B = (b_{1},b_{2},...,b_{m}), then the table of the two sequences is included in Table 1 with s (i, j) obtained from the next step.
For element s(i,j) can be calculated using the following formula:
s(i,j)ijs(a_{i},b_{j})sijdsijdThen, the obtained alignments of these sequences are described as follows:
Among various tree topologies generated by the MA, the outputs are a graph and tree used to indicate the relationship between the mutation and evolution. Let M={1,2,....,m} be the subscript set of a MA output C, that is, i ϵ M corresponds to a sequence C_{i} Then, graphs G={M,V and G'={M',V'} are the extension of {M,V} which is similar to that as has been given by phylogenetic tree T'={M',V'}. The network systems generated by the MA output are as follows [5Shen S, Tuszynski JA. Theory and Mathematical Methods for Bioinformatics 2007; 197-218.]:
Penalty matrix is defined [5Shen S, Tuszynski JA. Theory and Mathematical Methods for Bioinformatics 2007; 197-218.], for example, C is alignment matrix induced by multiple sequence A. For any s,t ϵ M penalty function W= w(a,b) stated on V_{s} would be obtained from two expansions C_{s,t} and C_{t,s} based on pairwise sequence A_{s} and A_{t}. For pairwise C_{s,t} and C_{t,s} the penalty score is defined as follows:
(1) |
Where matrix
(2) |
is a matrix of penalties produced by pairwise alignment from multiple sequence A and shortened as a penalty matrix.
One of the methods to construct a tree from a collection of distances among each pair of sequence alignment is called Distance Method. This set of distances is shown in a matrix form, named matrix of distance [5Shen S, Tuszynski JA. Theory and Mathematical Methods for Bioinformatics 2007; 197-218.]. Example:
Fig. (1) shows a distance matrix for 5 sequences, i.e. Operational Taxonomy Units (OTUs), which are a set of sequences {x^{1}, x^{2}, x^{3}, x^{4}, x^{5}} representing five different virus species. Each element of the matrix represents the genetic distance between the sequences involved. For example, the distance between OTU x^{2} and x^{3} is 9, meaning that there are 9 genetic differences between the x^{2} and x^{3} sequences. This difference occurs because of the evolutionary process within its genetic structure, or the difference in the number of genes due to evolution.
Fig. (1) A distance matrix for 5 OTUs. |
The phylogenetic or evolutionary tree is a “tree” that shows the evolutionary relationships between different species of living things based on their similarities and physical and/or genetic characteristics. The taxa are derived from a common ancestor [6Cristianini N, Hahn M. Introduction to Computational Genomics 2006; 110-26.
[http://dx.doi.org/10.1017/CBO9780511808982.009] ].
The phylogenetic tree in this SARS case is used to find out the pattern of epidemic spread based on the locations and dates of sampling in the NCBI.
One of the distance based methods used to construct phylogenetic trees is the Neighbor-Joining method [7Amiroch S, Rohmatullah A. Determining geographical spread pattern of MERS-CoV by distance method using Kimura model AIP Conference Proceedings18252017;
[http://dx.doi.org/10.1063/1.4978970] ]. This algorithm requires the input of a distance matrix, where distance represents the dissimilarity of the aligned sequences.
The Neighbor Joining method starts from a star like structure as in Fig. (2a) and gathers all the “neighbors” together to form a tree without roots as output. For the set of N sequences, the computational steps are given as follows:
Genetic algorithms are classified in several forms such as the Simple Genetic Algorithm. The mechanism of simple genetic algorithms is by copying strings and partially exchanging them. This algorithm gives good results in many practical problems consisting of three operations, namely: 1) reproduction, 2) Cross over and, 3) Mutations [8Sivanandam S, Deepa S. Introduction to Genetic Algorithms Springer2008.].
The general steps of the common Genetic Algorithm can be figured out by the following pseudo-code [9Naznin F, Sarker R, Essam D. Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment. BMC Bioinformatics 2011; 12: 353-79.
[http://dx.doi.org/10.1186/1471-2105-12-353] [PMID: 21867510] ]:
Genetic algorithm measures how well a chromosome can solve a problem. The measurements are made by using the fitness function, which is the purpose function of the problem to be solved. The greater the fitness value is, the more chromosome in the population, the more likely it is to survive the next generation. The objective function used in the genetic algorithm for multiple sequence alignment follows a scoring scheme called Sum of Pairs [10Gondro C, Kinghorn BP. A simple genetic algorithm for multiple sequence alignment. Genet Mol Res 2007; 6(4): 964-82.
[PMID: 18058716] ]:
Fig. (2a,b) Neighbor-Joining a. Star-like, b. Treelike structure |
(3) |
Where W_{ij} = weight/distance sequences i and j
S_{ij} = score between sequences i and j
The selection is one operation to ensure the amount of upcoming chromosome generation which is gained from the fitness value compared to the average fitness value of population. It then defines the number of representatives of a chromosome received in the next generation. Afterwards, chromosomes with excellent fitness values will have a greater chance of being elected parent and remains to the next generation, while worse chromosomes will be substituted by new chromosomes. One of the techniques of selection in genetic algorithms is the technique of roulette wheel selection introduced by [11Goldberg DA. Genetic Algorithms in Search, Optimization and Machine Learning 1st ed. 1989.]. This selection technique is illustrated as a roulette disc playback technique. The size of the slot is equal to the ratio between the fitness value of a chromosome to the total fitness value of all chromosomes. To produce a population, the roulette is rotated as much as the size of the population [8Sivanandam S, Deepa S. Introduction to Genetic Algorithms Springer2008.].
A child's chromosome can be formed by two main processes; first is by combining elements between two parent chromosomes using a crossover operator, and secondly by modifying a parent chromosome using a mutation operator. The explanation is as follows:
Crossover is the primary operator or primary operator in the genetic algorithm. These operators work on a pair of parent chromosomes to produce two child chromosomes by exchanging some of the elements (genes) that each of the parent chromosomes possess.
Mutations are secondary operators or support operators in genetic algorithms that play a role in altering the chromosome structure unexpectedly. This unexpected change leads to the formation of a mutant, a new chromosome that is genetically distinctive from the preceding chromosome.
The data used in this study were 14 DNA sequences of patients infected with the SARS virus with the genbank access code AY278489, AY394997, AY395004, AY394978, AY394983, AY304495, AY278554, AY278741, AY274119, AY283794, AY291451, AY345986, AY394999 and AY627048. Six of them have been analyzed in paper [12Amiroch S, Pradana MS, Irawan MI, Mukhlash I. Multiple alignment analysis on phylogenetic tree of the spread of SARS epidemic using distance method. J Phys Conf Ser 2017; 890(1)], but in this paper 14 DNA sequences were analyzed along with optimization with genetic algorithms. After analyzing multiple alignment of the 14 DNA of human diseases with the SARS virus, the results obtained by the analysis of the network system topology, network systems area mutation, and network system mode mutations in detail are described as follows:
System network topology is produced by the result Multiple alignment, namely G(W)={M,V,W} where W is a function penalty of outcome Multiple alignment where pairwise alignment uses Needleman Wunsch algorithm simulated in Matlab as shown in the user menu interface Fig. (3).
From the alignment as displayed in Fig. (3), a penalty matrix can be derived as follows:
where A, B, C, D, E, F, G, H, I, J, K, L, and N represent the sequences in particular cities and dates respectively as follows Guangzhou, December 16^{th} 2002; Zhongshan, December 26^{th} 2002; Zhongshan, Jan 4^{th} 2003; Guangzhou, Jan 24^{th} 2003; Guangzhou Hospital; Guangzhou Feb 2^{nd} 2003; Metropole, Feb 21^{st} 2003; Hanoi, Feb 26^{th} 2003; Toronto, Feb 27^{th} 2003; Singapore, March 1^{st} 2003; Taiwan, March 8^{th} 2003; Hongkong, March 19^{th} 2003; Hongkong, May 15^{th} 2003, and Palm civet. Palm Civet is a ferret that was allegedly as host of the SARS epidemic [4Isa Irawan M, Amiroch S. Construction of phylogenetic tree using neighbor joining algorithms to identify the host and the spreading of SARS epidemic. J Theor Appl Inf Technol 2015; 71(3)]. SARS viruses were isolated from Himalayan palm civets found in a live-animal market in Guangdong, China. Evidence of virus infection was also detected in humans working at the same market. Palm civet sequence is raised from that patients [13Guan Y, Zheng BJ, He YQ, et al. Isolation and characterization of viruses related to the SARS coronavirus from animals in Southern ChinaScience (80- ) 302(5643): 276-8.2003;
[http://dx.doi.org/10.1126/science.1087139] ]. Network system topology analysis gains a stable area [12Amiroch S, Pradana MS, Irawan MI, Mukhlash I. Multiple alignment analysis on phylogenetic tree of the spread of SARS epidemic using distance method. J Phys Conf Ser 2017; 890(1)] telling similar nucleotide locus in multiple alignment, and unstable area capturing dissimilar nucleotide locus. The unstable area among sequences here is then well-known as mutation. Stable and unstable regions in the multiple alignment of SARS epidemic can be seen in Table 2.
In Table 2, there are 19 positions that seem unstable regions, with a percentage of 0.5%. It is clear that all the studied SARS DNA sequences have a very high similarity. The number of mutated nucleotides in each sequence is shown in detail in the Table 3.
Fig. (3) Output GUI of sequence alignment. |
Fig. (4) Phylogenetic tree spread of SARS Epidemic. |
The next analysis is mutation network system on multiple alignments of the SARS epidemic. In this section, the outline is how to construct a graph and tree produced by the SARS epidemic. The graph in Fig. (4) displays the phylogenetic tree that tells the SARS epidemic spread in particular regions. Distance matrix which is then converted into evolutionary distance matrix is used as the input for phylogenetic tree construction. Distance matrix obtained from dissimilarities nucleotide between pairs of sequences in multiple alignments. Furthermore, convert dissimilarity into evolutionary distance by correcting for multiple events per site with jukes cantor model [14Lemey P, Salemi M, Vandamme A-M. The Phylogenetic Handbook; A Practical Approach to Phylogenetic Analysis and Hypothesis Testing Second. New York: Cambridge University Press142-81.2009;
[http://dx.doi.org/10.1017/CBO9780511819049] ]. Here are the results of the simulation Matlab phylogenetic tree using neighbor-joining algorithm with Jukes Cantor distance correction.
In Fig. (4), it appears that the closest sequence to Palm Civet as host [4Isa Irawan M, Amiroch S. Construction of phylogenetic tree using neighbor joining algorithms to identify the host and the spreading of SARS epidemic. J Theor Appl Inf Technol 2015; 71(3)] is Zhongshan 26/12/02. However, if the attention is not much on genetic distance from Guangzhou 12/16/02, then it could be reasonably inferred that the extent of the SARS epidemic of Guangzhou 16/12/02, then spread to Zhongshan 26/12/02, then almost simultaneously to Guangzhou 02/18/03 and Guangzhou hospital. From there, the virus continued to spread to the Metropole, Zhongshan, Hongkong, Singapore, Taiwan, Hongkong, Hanoi, Guangzhou 24/01/03 and Toronto simultaneously.
Before explaining analysis of network system of mutations mode, from the penalty matrix, a non-directional graph can be visualized showing the relationship between sequence mutations. The notation on the node indicates the name of the encoded sequence as the letters A,B,...,N with the codes representing names of particular regions as mentioned before.
The number of mutations can be shown in Fig. (5). The thicker the lines, the more the mutation occurred. As mentioned, 19 mutations appeared in unstable regions on 14 different DNA sequences of this SARS epidemic. As shown in Fig. (5), some mutations occur only in the arc orthogonal order to-1, for example in ΔABE, ΔABF, ΔABD, ΔAFD, ΔBFD Mode mutation H_{AE} (a mutation in the sequence Guangzhou, Dec 16^{th} 2002 to Toronto, March 27^{th} 2003), mode mutations H_{AB} (mutations in the sequence Guangzhou, Dec 16^{th} 2002 to sequence Guangzhou Hospital), as well as the mode of mutation H_{BE} (a mutation in the sequence Guangzhou Hospital to sequence Toronto, March 27^{th} 2003).
In ΔABE effect: and structure modulus H_{AE}, H_{AB}, H_{BE} mutually orthogonal.
Genetic algorithm approach for multiple alignment in the case of the SARS epidemic is defined below:
Initial population is a penalty matrix of multiple alignment result of Needleman Wunsch alignment with the use of progressive alignment. Because all the data used have the same sequence length 3768bp, so the multiple alignment outcome has no gap at all.
Fig. (5) Decomposition of mutation network spread of SARS epidemic. |
The objective function used in this case is the score of the weight of the matrix MA Wunsch Needleman results. Objective function:
Fitness value is the value of the objective function:
Because of the spread of the SARS epidemic, the shortest distance showed the closest kinship, in the sense most closely with the host, then the fitness value is taken from the most minimal value.
Selection procedures used an approach of Roulette wheel. One chromosome was selected to produce a new population, and a number r was generated at random from the range [0,1], and the roulette disc was played 14 times.
At this crossing process, the sequence is broken down into several parts. Separation is assumed for each multiple of 500, so for a long sequence of 3768bp there are 8 part-solving sequences. Meanwhile the cross-linking process is done randomly but the benchmark crossing probability (P_{c}) is set to be 0.25. It means that with the values is expected to average 25% of chromosomes in the population will experience a crossing.
Mutation probability (P_{m}) value is set to be 0.01. This means that it is expected on average 1% of the total number of bits in the population will mutate.
In this case of 3768×14 = 52752 when it is multiplied by 0.01, this means that there are 528 mutations in a single generation.
The initial population is the Multiple Alignment with Needleman Wunsch algorithm as shown previously.
The program results are displayed in the command window in matlab. The 14 sequences in the initial population have a very high similarity as seen in Fig. (6).
Before the fitness value is calculated, the weight of MA is calculated from the penalty matrix. From the weight matrix, the score of each sequence is calculated. Retrieved:
Eval(v_1) = 124 ; Eval(v_6) = 64; Eval(v_11) = 46
Eval(v_2) = 102 ; Eval(v_7) = 46; Eval(v_12) = 46
Eval(v_3) = 46 ; Eval(v_8) = 58; Eval(v_13) = 70
Eval(v_4) = 58 ; Eval(v_9) = 58; Eval(v_14) = 112
Eval(v_5) = 52 ; Eval(v_10) = 46
From the above values, the strongest chromosome is the chromosome with the value closest to v_14 (palm civet, host of this SARS epidemic), ie chromosome v_1. And the weakest chromosome is the chromosome with the smallest values, i.e., which are v_7, v_10, v_11, and v_12.
For the 1st generation of the result process, a selection roulette wheel disc is obtained as in Table 4.
Then the disc is rotated 14 times randomly in the range [0,1], and a random value is assumed to be the value of r at each time of rotation as
Finally, after the selection is completed, a new population (potential parental chromosomes) is generated, consisting of chromosomes from sequences 1, 2, 3, 7, 8, 9, 13, and sequence 14.
In this crossover process, a sequence will be broken at any multiple of 500. The probability of crossing is set as 0.25 and based on the random generation in the range [0,1], sequence 2 and sequence 8 are selected to be crossed. So the sequence is broken at the positions of 500, 1000, 1500, and so on.
The mutation process will replace one or more genes with an opportunity equal to the mutation probability. A mutation refers to the change of one nucleotide to another nucleotide. After the crossover process on sequences 2 and sequence 8, the final population in the 1st generation had mutations in the sequences 5, 8, 12, and 14 as in Table 5. The number of mutations in the respective sequences is detailed as follows:
Thus the process repeats over and over again until a plot of fitness values gets close to a constant.
The results of iterations are performed until the 10th generation because for a higher number of generations, the system is not yet supportive. The last values obtained in the 10th iteration are shown as in Table 6.
So the plot of fitness values for the 10th generations based on the obtained results can be shown in the Fig. (7).
Fig. (7) appears that the result is still far from convergent so it needs to be iterated again until the result is near constant.
Fig. (6) Initial population from GA. |
Fig. (7) Plot the result of fitness value for the 10th generation. |
Based on the results of this study, some conclusions can be obtained as follows:
Not applicable.
No animals/humans were used for studies that are the basis of this research.
Not applicable.
The authors decalre that there is no conflict of interest, financial or otherwise.
This research is the outcome of a cooperation research among universities in 2017 were financed by Directorate of Research and Community Service, Directorate General for Research and Development, Ministry of Research, Technology and Higher Education (Kemenristekdikti) number 101/ SP2H/PPM/ DRPM/IV/2017, April 3^{rd} 2017.
[1] | Notredame C, Higgins DG. SAGA : Sequence alignment by genetic algorithm 24(8): 1515-24.1996; |
[2] | Isaev A. Introduction to mathematical methods in bioinformatics 2004; 29-31. |
[3] | Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22(22): 4673-80. [http://dx.doi.org/10.1093/nar/22.22.4673] [PMID: 7984417] |
[4] | Isa Irawan M, Amiroch S. Construction of phylogenetic tree using neighbor joining algorithms to identify the host and the spreading of SARS epidemic. J Theor Appl Inf Technol 2015; 71(3) |
[5] | Shen S, Tuszynski JA. Theory and Mathematical Methods for Bioinformatics 2007; 197-218. |
[6] | Cristianini N, Hahn M. Introduction to Computational Genomics 2006; 110-26. [http://dx.doi.org/10.1017/CBO9780511808982.009] |
[7] | Amiroch S, Rohmatullah A. Determining geographical spread pattern of MERS-CoV by distance method using Kimura model AIP Conference Proceedings18252017; [http://dx.doi.org/10.1063/1.4978970] |
[8] | Sivanandam S, Deepa S. Introduction to Genetic Algorithms Springer2008. |
[9] | Naznin F, Sarker R, Essam D. Vertical decomposition with Genetic Algorithm for Multiple Sequence Alignment. BMC Bioinformatics 2011; 12: 353-79. [http://dx.doi.org/10.1186/1471-2105-12-353] [PMID: 21867510] |
[10] | Gondro C, Kinghorn BP. A simple genetic algorithm for multiple sequence alignment. Genet Mol Res 2007; 6(4): 964-82. [PMID: 18058716] |
[11] | Goldberg DA. Genetic Algorithms in Search, Optimization and Machine Learning 1st ed. 1989. |
[12] | Amiroch S, Pradana MS, Irawan MI, Mukhlash I. Multiple alignment analysis on phylogenetic tree of the spread of SARS epidemic using distance method. J Phys Conf Ser 2017; 890(1) |
[13] | Guan Y, Zheng BJ, He YQ, et al. Isolation and characterization of viruses related to the SARS coronavirus from animals in Southern ChinaScience (80- ) 302(5643): 276-8.2003; [http://dx.doi.org/10.1126/science.1087139] |
[14] | Lemey P, Salemi M, Vandamme A-M. The Phylogenetic Handbook; A Practical Approach to Phylogenetic Analysis and Hypothesis Testing Second. New York: Cambridge University Press142-81.2009; [http://dx.doi.org/10.1017/CBO9780511819049] |