c# - Most effective way to lookup a substring C of string B in string A in LINQ -


having 2 strings like:

string = "attagacctgccggaa"; string b = "gccggaatac"; 

i delete part common in both strings , rest concatenate it. have tell need delete left matched part get

input

 attagacctgccggaa           gccggaatac 

output

attagacctgccggaatac 

firstly thought use pattern , seacrh it, not possible not know pattern in advance (the length of matched chars variable)

then thought on search whole string b in a if had no succes, delete char in string a (last 1 since want preserve left unmatched string) , loop until have no more chars in b like

string = "attagacctgccggaa"; string b = "gccggaatac"; int times = b.length; string wantedstring = string.empty; string auxstring = b; while (times > 0) {      if (!a.contains(auxstring))     {         //save last char , delete auxstring         wantedstring += auxstring[auxstring.length - 1];         auxstring = auxstring.trimend(auxstring[auxstring.length - 1]);     }     else         break;     times--; } //reverse string  char[] reversedtoappend = wantedstring.tochararray(); array.reverse(reversedtoappend); string toappend = new string(reversedtoappend); 

so answer a + toappend ;
there way make more efficient? (maybe in linq?)

edit

as @lavin points out correctly c can occur anywhere in a, while being prefix of b. instance if a=aat , b=aag, code should return aatg. reason because common string starting on left c=aa. delete b , a=aat resulting g

aat aag 

resulting

aatg 

other example be:

a=atttgggccgcgcgcgaaaaccccgcg b=                  aaccccgcgcgca 

here

c= aaccccgcg 

so result should be

result = atttgggccgcgcgcgaaaaccccgcgcgca 

(all arrays , strings 0 based in answer)

first want point out op's problem confusing. assume c common part of a , b, op's example of input , output suggest c needs suffix of a, , prefix of b @ same time. see of answers above adopted understanding of problem.

however, original implementation provided op suggests that, c can occur anywhere in a, while being prefix of b, because using of a.contains(auxstring). means a=aat , b=aag, code return aatg. other people's answers return aataag.

so there 2 possible interpretation of problem. please clarify.

second, assuming size of first string a n, , second string b m, unlike o(n*m) solution provided in original solution , existing answers, o(n+m) algorithm can achieved using of following: kmp, suffix array, suffix tree, z-algorithm.

i'll briefly describe how use z-algorithm solve problem here, since seems less mentioned on stackoverflow compared others.

about details of z-algorithm, see http://www.cs.umd.edu/class/fall2011/cmsc858s/lec02-zalg.pdf

basically string s of length l, calculates array z of length l, in z[i] equals longest common prefix of s , s[i:] (s[i:] means substring of s starting position i).

for problem, combine strings a , b d=b+a (b in front of a), , calculates z array of combined string d. using z array, can figure out longest prefix of b occurs in a.

for possible interpretation 1 of problem, in c needs suffix of a , prefix of b:

max_prefix = 0 in range(m, n+m):   if z[i] == n+m - i:      if z[i] > max_prefix:       max_prefix = z[i] 

and answer be:

a+b[max_prefix:] 

for possible interpretation 2 of problem, in c needs prefix of b, , can anywhere in a:

max_prefix = 0 in range(m, n+m):   if z[i] > max_prefix:     max_prefix = z[i] 

again answer be:

a+b[max_prefix:] 

the difference in 2 cases line:

  if z[i] == n+m-i:  

to understand line, remember z[i] longest common prefix of strings d , d[i:], then:

  1. note d=b+a
  2. we enumerate i m m+n-1, that's range of a in d. d[i:] equal a[i-m:]. , length of a[i-m:] n-(i-m)=n+m-i.
  3. since d starts b, checking if z[i] equal n+m-i, checking if a[i-m:] prefix of b. if indeed equal, found common string c, prefix of b, , suffix of a.
  4. without line, know found string c prefix of b, , occurs in a starting position i, , not guaranteed reach end of a.

Comments

Popular posts from this blog

python - argument must be rect style object - Pygame -

webrtc - Which ICE candidate am I using and why? -

c# - Better 64-bit byte array hash -