題目來源
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
看到這道題的最初想法是用哈希,記錄每一個十個字母長度的數(shù)目,如果大于等于2則輸出。然后直接AC了!
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> res;
int n = s.length();
if (n < 10)
return res;
unordered_map<string, int> map;
for (int i=0; i<=n-10; i++) {
map[s.substr(i, 10)]++;
if (map[s.substr(i, 10)] == 2)
res.push_back(s.substr(i, 10));
}
return res;
}
};
然后看看大神們的做法吧,結(jié)合上位操作。
The main idea is to store the substring as int in map to bypass the memory limits.
There are only four possible character A, C, G, and T, but I want to use 3 bits per letter instead of 2.
Why? It's easier to code.
A is 0x41, C is 0x43, G is 0x47, T is 0x54. Still don't see it? Let me write it in octal.
A is 0101, C is 0103, G is 0107, T is 0124. The last digit in octal are different for all four letters. That's all we need!
We can simply use s[i] & 7 to get the last digit which are just the last 3 bits, it's much easier than lookup table or switch or a bunch of if and else, right?
We don't really need to generate the substring from the int. While counting the number of occurrences, we can push the substring into result as soon as the count becomes 2, so there won't be any duplicates in the result.
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> res;
int n = s.length();
if (n < 10)
return res;
unordered_map<int, int> map;
int i = 0, t = 0;
while (i < 9) {
t = t << 3 | s[i++] & 7;
}
while (i < n){
if (map[t = t << 3 & 0x3FFFFFFF | s[i++] & 7]++ == 1)
res.push_back(s.substr(i-10, 10));
}
return res;
}
};