sql server - T-SQL Fewest Sets of Common Dates that Includes All Row IDs -


my table (@mytable) list of ids start dates , end dates (inclusive) represent interval of days when id appears in file received once per day:

id    start_date    end_date 1     10/01/2014    12/15/2014 2     11/05/2014    03/03/2015 3     12/07/2014    12/09/2014 4     04/01/2015    04/15/2015 

each id appears once, i.e. has 1 associated time interval, , intervals between start_dates , end_dates can (but not necessarily) overlap across different ids. need sql query find sets of dates each id appear @ least once when files these sets of dates merged, in smallest number of dates possible. in table above solution these 2 dates:

file_date     id(s) 12/07/2015    1,2,3 04/01/2015    4 

but example 1 date between id(3)'s start_date , end_date & combined 1 date between id(4)'s start_date , end_date solution.

the actual data consists of 10,000 different ids. date range of possible file dates 04/01/2014 - 07/01/2015. each daily file large in size , must downloaded manually, hence want minimize number must download include ids.

so far have cte results in separate rows dates between start_date , end_date of each id:

;with cte (id, d) (     select          tbl.id id,         tbl.start_date d     @mytable tbl     union     select          tbl.id id,         dateadd(day, 1, cte.d) d     cte     inner join      @mytable tbl on cte.id = tbl.id     cte.d < tbl.end_date ) select     id id,     d file_date  cte order id,d option (maxrecursion 500) 

using @mytable example results are:

id    file_date 1     10/01/2014 1     10/02/2014 1     10/03/2014 1     etc... 

my thinking determine common file_date among ids, pick next common file_date among ids left, , on...but i'm stuck. put in more mathy terms, trying find fewest sets (file_dates) contain items (ids), similar https://softwareengineering.stackexchange.com/questions/263095/finding-the-fewest-sets-which-contain-all-items, don't care minimizing duplicates. final results not have include ids appear in file_dates; need know file_dates.

i'm using ms sql server 2008.

just go on started. result found method not optimal, enough purposes.

for each id generate set of rows each day in range. know how it, though i'd use table of numbers it, rather generating on fly cte every time, doesn't matter.

put result temporary table. have 10,000 ids * ~400 days = ~4m rows. temp table has 2 columns (id, filedate). create appropriate indexes. i'd start two: on (id, filedate) , on (filedate, id). make 1 of them clustered , primary key. i'd try make (filedate, id) clustered primary key.

then process in loop:

find date has number of ids:

select top(1) @vardate = filedate #temp group filedate order count(*) desc; 

remember found date (and optionally ids) in temp table final result.

delete date , ids correspond date big table.

delete #temp filedate = @vardate or id in (     select t2.id     #temp t2     t2.filedate = @vardate ) 

repeat loop until there no rows in #temp.


Comments

Popular posts from this blog

python - argument must be rect style object - Pygame -

webrtc - Which ICE candidate am I using and why? -

c# - Better 64-bit byte array hash -