sql server - T-SQL Fewest Sets of Common Dates that Includes All Row IDs -
my table (@mytable) list of ids start dates , end dates (inclusive) represent interval of days when id appears in file received once per day:
id start_date end_date 1 10/01/2014 12/15/2014 2 11/05/2014 03/03/2015 3 12/07/2014 12/09/2014 4 04/01/2015 04/15/2015
each id appears once, i.e. has 1 associated time interval, , intervals between start_dates , end_dates can (but not necessarily) overlap across different ids. need sql query find sets of dates each id appear @ least once when files these sets of dates merged, in smallest number of dates possible. in table above solution these 2 dates:
file_date id(s) 12/07/2015 1,2,3 04/01/2015 4
but example 1 date between id(3)'s start_date , end_date & combined 1 date between id(4)'s start_date , end_date solution.
the actual data consists of 10,000 different ids. date range of possible file dates 04/01/2014 - 07/01/2015. each daily file large in size , must downloaded manually, hence want minimize number must download include ids.
so far have cte results in separate rows dates between start_date , end_date of each id:
;with cte (id, d) ( select tbl.id id, tbl.start_date d @mytable tbl union select tbl.id id, dateadd(day, 1, cte.d) d cte inner join @mytable tbl on cte.id = tbl.id cte.d < tbl.end_date ) select id id, d file_date cte order id,d option (maxrecursion 500)
using @mytable example results are:
id file_date 1 10/01/2014 1 10/02/2014 1 10/03/2014 1 etc...
my thinking determine common file_date among ids, pick next common file_date among ids left, , on...but i'm stuck. put in more mathy terms, trying find fewest sets (file_dates) contain items (ids), similar https://softwareengineering.stackexchange.com/questions/263095/finding-the-fewest-sets-which-contain-all-items, don't care minimizing duplicates. final results not have include ids appear in file_dates; need know file_dates.
i'm using ms sql server 2008.
just go on started. result found method not optimal, enough purposes.
for each id generate set of rows each day in range. know how it, though i'd use table of numbers it, rather generating on fly cte every time, doesn't matter.
put result temporary table. have 10,000 ids * ~400 days = ~4m rows. temp table has 2 columns (id, filedate)
. create appropriate indexes. i'd start two: on (id, filedate)
, on (filedate, id)
. make 1 of them clustered , primary key. i'd try make (filedate, id)
clustered primary key.
then process in loop:
find date has number of ids:
select top(1) @vardate = filedate #temp group filedate order count(*) desc;
remember found date (and optionally ids) in temp table final result.
delete date , ids correspond date big table.
delete #temp filedate = @vardate or id in ( select t2.id #temp t2 t2.filedate = @vardate )
repeat loop until there no rows in #temp.
Comments
Post a Comment