-
Notifications
You must be signed in to change notification settings - Fork 22
/
Subsequence vs Substring
181 lines (145 loc) · 8.08 KB
/
Subsequence vs Substring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
The difference between them, can cearly be made out from the following readings:
http://en.wikipedia.org/wiki/Subsequence
http://en.wikipedia.org/wiki/Substring
The ALGORITHMS for finding both are also DIFFERENT:
1. For Subsequnece:
We use the traditional LCS algorithm
2. For Substring:
We use,
1. KMP (Knuth–Morris–Pratt) algorithm
URL: http://jakeboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/
http://www.quora.com/What-is-the-best-resource-to-learn-KMP-Algorithm
2. Rabin-Karp algorithm
URL: http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=stringSearching
In actual understanding, subsequence and substring are DIFFERENT.
However Java developers implemented these two methods to return the same output.
For Example:
(Example Source: http://mytactics.blogspot.com/2013/10/difference-between-subsequence-and.html)
class StringSeqSub{
public static void main(String args[])
{
String str = "Hello World.!!";
// String s1 = str.subSequence(0, 5); // this line will throw error
CharSequence s1 = str.subSequence(0, 5);
String s2 = str.substring(0, 5);
System.out.println(s1);
System.out.println(s2);
}
}
Output -
Hello
Hello
Explanation of KMP Algorithm:
Source: http://jakeboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/
The Knuth-Morris-Pratt Algorithm in my own words
DEC 13TH, 2009 | COMMENTS
For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt string searching algorithms.
For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once
I started reading “the prefix of the suffix of the prefix of the…”.
Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down,
do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it.
For those who think like me, here it is in my own words. As a side note, I’m not going to explain why it’s more
efficient than na”ive string matching; that’s explained perfectly well in a multitude of places.
I’m going to explain exactly how it works, as my brain understands it.
The Partial Match Table
The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was
the fact that I didn’t quite fully grasp what the values in the partial match table really meant.
I will now try to explain them in the simplest words possible.
Here’s the partial match table for the pattern “abababca”:
1
2
3
char: | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match
table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the
entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first
seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building
or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t
talked about what each cell means yet, but just what it’s referring to.
Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.
Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap”
are all the proper prefixes of “Snape”.
Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”,
“id”, and “d” are all proper suffixes of “Hagrid”.
With this in mind, I can now give the one-sentence meaning of the values in the partial match table:
The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.
Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means
we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”)
and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes.
However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that
matches a proper suffix, in this case, is 1.
Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper
prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both,
and is two characters long, so cell four gets value 2.
Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four
proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now,
we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”,
it wins, and cell five gets value 3.
Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”.
Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going
to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there
are no matches, cell seven gets 0.
Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both
start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two
and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix
does not match the seven-character suffix (“bababca”), so cell eight gets 1.
How to use the Partial Match Table
We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons)
when we find partial matches. The formula works like this:
If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead
partial_match_length - table[partial_match_length - 1] characters.
Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match
table again for easy reference:
1
2
3
char: | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
The first time we get a partial match is here:
1
2
3
bacbababaabcbab
|
abababca
This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we
don’t get to skip ahead any. The next partial match we get is here:
1
2
3
bacbababaabcbab
|||||
abababca
This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3.
That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4]
or 5 - 3 or 2) characters:
1
2
3
4
5
// x denotes a skip
bacbababaabcbab
xx|||
abababca
This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1.
That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2]
or 3 - 1 or 2) characters:
1
2
3
4
5
// x denotes a skip
bacbababaabcbab
xx|
abababca
At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.
Conclusion
So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP;
it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail.
If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all
learn something.