Abstract HTML Views: 491 PDF Downloads: 123 Total Views/Downloads: 614
Abstract HTML Views: 332 PDF Downloads: 94 Total Views/Downloads: 426
Repetitive sequence mining plays very important roles and has been widely studied in DNA or genome, but
there are only a few researches in sequence database. Taking sequence <ababababc> for example, traditional sequential
pattern mining algorithms only regard <ab> appearing one time when calculating the support of <ab>, regardless of
<ab> appearing at least 4 times within the same data sequence. Taking this repetitive property into consideration can help
analysts to grasp more useful information. However, most of the existing algorithms on repetitive sequence mining are
used for DNA or genome and cannot be used for mining such repetitive patterns due to the different data properties.
Therefore, in this paper, (1) we propose a method to clearly determine the number of times a sequence appears in a data
sequence; (2) To solve the problem that the support of a sequence will be more than 100% because the sequence may appear
more than one times in a data sequence. We propose a method to ensure the support range of repetitive sequence still
within [0,100%] so as to let users set up minimum support threshold in a traditional way; and (3) we propose an algorithm,
RptGSP to efficiently mine such repetitive patterns in sequence database by improving the classic algorithm GSP.
Experimental results show that RptGSP is very efficient.
Repetitive sequential patterns (RSP) mining plays very important roles and has been widely studied in DNA or
genome, but there are only a few relevant approaches focusing on mining RSP from sequence database. Taking sequence
<bcbcbcbca> for example, traditional sequential pattern mining algorithms only consider that <bc> appears at one time
when calculating the support of <bc>, regardless of at least 4 times that <bc> appears within this same data sequence. Accordingly,
to catch much more interesting sequential patterns, repetitive property needs to be involved during the mining
process. However, currently the most relevant RSP methods focus on DNA analysis considering that they cannot be used
for recognizing repetitive patterns on events sequences. Therefore, we propose an approach to determine the number of
times a sequence repeatedly makes an appearance in a certain data sequence. The support value of a sequence could be
more than 100% as this sequence might repeat in one data sequence, therefore we proposed a strategy to ensure the support
range of repetitive sequence still within [0,100%]. Finally, we proposed an efficient algorithm, called RptGSP, to discover
such repetitive sequential patterns based on improving GSP Algorithm. The experimental results reveal that
RptGSP can efficiently discover the repetitive patterns.