c# - Most effective way to lookup a substring C of string B in string A in LINQ -
having 2 strings like:
string = "attagacctgccggaa"; string b = "gccggaatac";
i delete part common in both strings , rest concatenate it. have tell need delete left matched part get
input
attagacctgccggaa gccggaatac
output
attagacctgccggaatac
firstly thought use pattern , seacrh it, not possible not know pattern in advance (the length of matched chars variable)
then thought on search whole string b
in a
if had no succes, delete char in string a
(last 1 since want preserve left unmatched string) , loop until have no more chars in b
like
string = "attagacctgccggaa"; string b = "gccggaatac"; int times = b.length; string wantedstring = string.empty; string auxstring = b; while (times > 0) { if (!a.contains(auxstring)) { //save last char , delete auxstring wantedstring += auxstring[auxstring.length - 1]; auxstring = auxstring.trimend(auxstring[auxstring.length - 1]); } else break; times--; } //reverse string char[] reversedtoappend = wantedstring.tochararray(); array.reverse(reversedtoappend); string toappend = new string(reversedtoappend);
so answer a + toappend ;
there way make more efficient? (maybe in linq?)
edit
as @lavin points out correctly c
can occur anywhere in a
, while being prefix of b. instance if a=aat
, b=aag
, code should return aatg
. reason because common string starting on left c=aa
. delete b
, a=aat
resulting g
aat aag
resulting
aatg
other example be:
a=atttgggccgcgcgcgaaaaccccgcg b= aaccccgcgcgca
here
c= aaccccgcg
so result should be
result = atttgggccgcgcgcgaaaaccccgcgcgca
(all arrays , strings 0 based in answer)
first want point out op's problem confusing. assume c
common part of a
, b
, op's example of input , output suggest c
needs suffix of a
, , prefix of b
@ same time. see of answers above adopted understanding of problem.
however, original implementation provided op suggests that, c
can occur anywhere in a
, while being prefix of b
, because using of a.contains(auxstring)
. means a=aat
, b=aag
, code return aatg
. other people's answers return aataag
.
so there 2 possible interpretation of problem. please clarify.
second, assuming size of first string a
n
, , second string b
m
, unlike o(n*m)
solution provided in original solution , existing answers, o(n+m)
algorithm can achieved using of following: kmp, suffix array, suffix tree, z-algorithm.
i'll briefly describe how use z-algorithm solve problem here, since seems less mentioned on stackoverflow compared others.
about details of z-algorithm, see http://www.cs.umd.edu/class/fall2011/cmsc858s/lec02-zalg.pdf
basically string s
of length l
, calculates array z
of length l
, in z[i]
equals longest common prefix of s
, s[i:]
(s[i:]
means substring of s
starting position i
).
for problem, combine strings a
, b
d=b+a
(b
in front of a
), , calculates z
array of combined string d
. using z
array, can figure out longest prefix of b
occurs in a
.
for possible interpretation 1 of problem, in c
needs suffix of a
, prefix of b
:
max_prefix = 0 in range(m, n+m): if z[i] == n+m - i: if z[i] > max_prefix: max_prefix = z[i]
and answer be:
a+b[max_prefix:]
for possible interpretation 2 of problem, in c
needs prefix of b
, , can anywhere in a
:
max_prefix = 0 in range(m, n+m): if z[i] > max_prefix: max_prefix = z[i]
again answer be:
a+b[max_prefix:]
the difference in 2 cases line:
if z[i] == n+m-i:
to understand line, remember z[i]
longest common prefix of strings d
, d[i:]
, then:
- note
d=b+a
- we enumerate
i
m
m+n-1
, that's range ofa
ind
.d[i:]
equala[i-m:]
. , length ofa[i-m:]
n-(i-m)=n+m-i
. - since
d
startsb
, checking ifz[i]
equaln+m-i
, checking ifa[i-m:]
prefix ofb
. if indeed equal, found common stringc
, prefix ofb
, , suffix ofa
. - without line, know found string
c
prefix ofb
, , occurs ina
starting positioni
, , not guaranteed reach end ofa
.
Comments
Post a Comment