c++ - Finding most common sextuplets in a huge set of numbers -


i have matrix 6000x20, containing numbers range 1-80. need find sextuplets appear in matrix. need efficient , fastest solution. current solution steps: 1.i take first row matrix , generate sextuplets (38600 in 1 row) 2. compare each of sextuple other 5999 rows , count them 3. write them file because memory gets full pretty fast 4. take 2nd row , generate sextuplets , steps again

this algorithm pretty bad , im aware of because have 38600x6000 comparisons , possible file writing, , have lots of sextuplets repetition. cant know because cant use variables of size.

i need algorithmic solution can write in matlab/java/c++/python

since values range 1-80, total number of possible sextuplets "only" tad on 300 million (300,500,200 precise). since there 6000 rows in matrix, maximum count sextuplet 6000, count comfortably fit in two-byte integer (uint16_t, assuming exists in c++ implementation). 3 hundred million two-byte integers total 600 megabytes, have available.

so simple algorithm create count vector, initialized zeros, , iterate on rows in matrix; each row, iterate on 38,760 sextuplets, , each sextuplet increment corresponding count.

the trick figuring out element in count vector corresponds given set of 6 numbers. happens, not difficult long numbers in sextuplet in order smallest largest. (that's not limitation, since need have canonical order sextuplet, , sorted in order simple canonical ordering.)

to see how generate indices, consider how (hypothetically) enumerate 300,500,200 combinations of 6 integers set {1..80}. first, enumerate combinations start 1, , continue 5 integers set {2..80}. then, enumerate combinations start 2, , continue 5 integers set {3..80}. then, enumerate combinations start 3 , continue 5 integers set {4..80}. , on. inside enumeration of each starting point, apply same algorithm recursively.

now, let's turn enumeration on head. suppose have sextuplet {a,b,c,d,e,f}. let's ask how many sextuplets come after sextuplet?

  • first, sextuplets start value greater a. since sextuplets ordered, if sextuplet starts value greater a, of values greater a, means combination of 6 values set {a+1..80}, of there 80-ac6.

  • then, there sextuplets start a , continue quintuplet first value greater b. same logic in above point, number of such sextuplets 80-bc5.

  • then, there sextuplets start a,b , continue quartet first value greater c: total of 80-cc4

  • etc.

so total number of sextuplets following {a,b,c,d,e,f} precisely:

80-ac6+80-bc5+80-cc4+80-dc3+80-ec2+80-fc1

the interesting above equation there no interaction between 6 variables. compute value creating 6 lookup tables values of 80-xci values of x 1 80, , values of 1 6. can compute (inverse) index of sextuplets doing 6 lookups , adding values together. (if needed to, subtract inverse index total number of combinations forward index. in case, need bijection combinations integers, , inverse index work fine.)

at end of algorithm, necessary turn count indices sextuplets. can done using same lookup tables, doing succession of binary searches: first, find value of a binary searching in lookup table i==6, subtract corresponding index , continue search remainder in lookup table i==5, etc. (in practice, since lookup table small, might turn out linear search faster binary search. doesn't make difference.)


Comments

Popular posts from this blog

python - argument must be rect style object - Pygame -

webrtc - Which ICE candidate am I using and why? -

c# - Better 64-bit byte array hash -