My Problem
Consider a set of data with two intervals. For instance, consider a student schedule of classes. Each record has a begin and end date, and each class has a period start time and a period end time. But this schedule is not 'normalized' in the sense that some records overlap. So if you search for records encompassing a given date and period for a student, you might get multiple matches.
Here's a contrived example. I represent the dates as integers to simplify the problem:
declare #schedule table (
student char(3),
fromDate int,
toDate int,
fromPeriod int,
toPeriod int
)
insert #schedule values
('amy', 1, 7, 7, 9),
('amy', 3, 9, 5, 8),
('amy', 10, 12, 1, 3),
('ted', 1, 5, 11, 14),
('ted', 7, 11, 13, 16);
Amy's date and period ranges either overlap or are adjacent. If I queried for 'date 5 period 7', I would get two matches. I need these reworked so that they represent the same 'area' but no longer overlap.
Ted's periods overlap but his dates do not. This means there's no real overlap, so no need to re-work anything.
My Research
I've read many posts and some articles on working overlapping intervals. Namely:
Merge overlapping date intervals
Flattening intersecting timespans
Condense Time Periods with SQL
SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value
Flatten/merge overlapping time intervals
SQL Server separate overlapping dates
I've implemented one from Itzik from a blog entitled 'solutions-packing-date-and-time-intervals-puzzle' that has worked great for one particular project. I don't think it's a stable link, but I've found a copy of it here.
But I'm having difficulty extending the knowledge in those resources to my problem at hand. It might be my limitation. I have trouble following them. I've studied Itzik's solution and have come to understand a lot of it, but I do recall there's one piece I just couldn't understand. Or it might be that those solutions only work with singular ranges.
My Attempt
I resolved this question by treating the ranges as literal rectangle objects. It works. I've even made a version of it somewhat performant in my own application. So I'll post it as a solution in case it is of use to anyone with the same issue.
But it is so long and involved and there are enough quirks to it (e.g. buffering lines, looping shapes, working with float values, rounding issues) that I can't help but think that there's a much better way. Can the concepts of my listed resources be extended to dual ranges? Or do some SRID's allow cutting rectangles with zero-length lines?
Expected Results:
There is no one answer to this problem, because you can aggregate ranges and deconstruct them in different ways. But to minimize the number of resulting rectangles, there are really only two acceptable answers. Visually, with dates on the X axis and periods on the Y axis, overlapping ranges can start out like this:
+------------+
| |
| +------------+
| |||||||| | <- 2 overlapping rectangles
+----| |
| |
+------------+
We can rework it this way:
+---+ +-----+
| | | |
| | | | +---+ <- 3 non-overlapping
| | | | | | vertically cut rectangles
+---| | | | |
| | | |
+-----+ +---+
Or this way:
+-----------+
+-----------+
+-----------------+ <- 3 non-overlapping
+-----------------+ horizontally cut rectangles
+-----------+
+-----------+
Going with vertical cuts, the results would look like this:
+-------------------------------------------+
|student|fromDate|toDate|fromPeriod|toPeriod|
|-------------------------------------------|
|amy |1 |2 |7 |9 |
|amy |3 |7 |5 |9 |
|amy |8 |9 |5 |8 |
|amy |10 |12 |1 |3 |
|ted |1 |5 |11 |14 |
|ted |7 |11 |13 |16 |
+-------------------------------------------+
Going with horizontal cuts, the results would look like this:
+-------------------------------------------+
|student|fromDate|toDate|fromPeriod|toPeriod|
|-------------------------------------------|
|amy |1 |7 |9 |9 |
|amy |1 |9 |7 |8 |
|amy |3 |9 |5 |6 |
|amy |10 |12 |1 |3 |
|ted |1 |5 |11 |14 |
|ted |7 |11 |13 |16 |
+-------------------------------------------+
Either is acceptable. Though, to keep it deterministic and tractable, you would want to choose one strategy and stick with it.
Numbers table:
To address the problem geometrically as I indicate in my post, you have to work with the SQL Server geometry data type. Unfortunately, to get each individual shape or point inside a geometry value, you have to call for the shape by index. A numbers table helps with this. So I do that first (swap this out for your preferred implementation).
create table #numbers (i int);
declare #i int = 1;
while #i <= 100 begin
insert #numbers values (#i);
set #i += 1;
end;
Aggregate the Ranges:
The first required task is to convert the numeric ranges to geometric rectangles. Point creates the corner points. STUnion and STEnvelope serve to turn these into a
rectangle. Also, since we desire ranges to merge together when they are integer-adjacent, we add 1 to the 'to' fields before geometric conversion.
Then the rectangles must be unioned so that there are no overlaps. This is done by UnionAggregate. The result is a geometry object of rectilinearPolygons (boxy shapes).
The geometry object can still have multiple rectillinearPolygons. So these are listed and output as individual shapes to rectilinears.
with
aggregateRectangles as (
select student,
rectilinears = geometry::UnionAggregate(rectangle)
from #schedule s
cross apply (select
minPt = geometry::Point(s.fromDate, s.fromPeriod, 0),
maxPt = geometry::Point(s.toDate + 1, s.toPeriod + 1, 0)
) extremePoints
cross apply (select rectangle = minPt.STUnion(maxPt).STEnvelope()) enveloped
group by student
)
select ar.student,
r.rectilinear,
mm.minY,
mm.maxY
into #rectilinears
from aggregateRectangles ar
join #numbers n on n.i between 1 and ar.rectilinears.STNumGeometries()
cross apply (select rectilinear = ar.rectilinears.STGeometryN(n.i)) r
cross apply (select envelope = r.rectilinear.STEnvelope()) e
cross apply (select
minY = e.envelope.STPointN(1).STY,
maxY = e.envelope.STPointN(3).STY
) mm;
SideNote - Performance Option:
I'm not implementing it here. But if you're working with big-data, and your 'rectilinears' (plural) field above is shared among many groupings (such as many students having the same schedule), then save the Well-known-text version of the rectilinear object (Just do ToString()). After this, create a second dataset with distinct rectilinears and perform the remaining geometric operations on that condensed dataset. Join it back in to the student-level later on. This has significantly improved performance in my real case.
Decompose the Ranges:
Next, those rectilinears have to be decomposed back into rectangles. Splitters are created by creating vertical lines at the x coordinates of each point. The y axis could just as easily been chosen, I just chose x for my own semantics. Both axis could also have been chosen, but this would result in more records than necessary.
Unfortunately, SQL Server does not split a shape if the splitter has zero-width (set-theoretically, that's inappropriate, but I imagine that you can't represent the result properly in WKT format). So we need to give the splitters a buffer so that they have an area. There's STBuffer, though I've had trouble with it so I just create one manually.
With this, the rectangles are split. When they're split, they still all reside in the same geometry object, so they enumerated and then inserted individually into the #rectangles table.
with
createSplitters as (
select r.student,
rectilinear = geometry::STGeomFromText(r.rectilinear.ToString(), 0),
splitters = geometry::UnionAggregate(sp.splitter)
from #rectilinears r
join #numbers n on n.i between 1 and r.rectilinear.STNumPoints()
cross apply (select
x = r.rectilinear.STPointN(n.i).STX,
buffer = 0.001
) px
cross apply (select splitter =
geometry::Point(x - buffer, minY - buffer, 0).STUnion(
geometry::Point(x + buffer, maxY + buffer, 0)
).STEnvelope()
) sp
group by r.student,
r.rectilinear.ToString()
)
select student,
rectangle = rectangles.STGeometryN(n.i)
into #rectangles
from createSplitters sp
cross apply (select
rectangles = rectilinear.STDifference(sp.splitters)
) r
join #numbers n on n.i between 1 and r.rectangles.STNumGeometries();
Parse the Ranges:
That's the crux of it. What remains is simply to extract the proper values from the rectangles to give the ranges.
To do this, we first invoke STEnvelope to ensure the rectangles are only represented by their corner points. Then we round the corner points to undo the effects of our buffer, and any issues with float representation. We also subtract 1 from the 'to' fields to undo what we did before converting to geometric points.
select student,
fromDate = round(minPt.STX,0),
toDate = round(maxPt.STX,0) - 1,
fromPeriod = round(minPt.STY,0),
toPeriod = round(maxPt.STY,0) - 1
into #normalized
from #rectangles r
cross apply (select
minPt = r.rectangle.STPointN(1),
maxPt = r.rectangle.STPointN(3)
) corners
order by student, fromDate, fromPeriod;
Optional - Visualize Before and After:
I've made it this far, so I minus well give a visual representation of the before and after results. Press the 'Spatial Results' tab in SSMS, choose 'student' as the label column, and toggle between 'unnormalized' and 'normalized' as the spatial column.
The gaps between Amy's rectangles seem like an error at first, but remember that our 'to' fields represent not just the number recorded in them, but the entire fractional part up-to but excluding the next integer number. So, for instance, a toDate of 2 is really a to date of 2.99999 etc.
select student,
unnormalized =
geometry::Point(fromDate, fromPeriod, 0).STUnion(
geometry::Point(toDate, toPeriod, 0)
).STEnvelope(),
normalized = null
from #schedule s
union all
select student,
unnormalized = null,
normalized =
geometry::Point(fromDate, fromPeriod, 0).STUnion(
geometry::Point(toDate, toPeriod, 0)
).STEnvelope()
from #normalized;
that is a very creative solution and an interesting read!!
A rather simplistic approach:
with
a as (
select student, fromdate from #schedule union
select student, todate+1 from #schedule
),
b as (
select *,
todate = (
select min(aa.fromdate)
from a as aa
where aa.student = a.student
and aa.fromdate > a.fromdate
) - 1
from a
)
select *
from b
where exists (
select *
from #schedule as s
where s.student = b.student
and s.fromdate < b.todate
and s.todate > b.fromdate
);
Postgresql behaves strangely when unnesting multiple arrays in the select list:
select unnest('{1,2}'::int[]), unnest('{3,4}'::int[]);
unnest | unnest
--------+--------
1 | 3
2 | 4
vs when arrays are of different lengths:
select unnest('{1,2}'::int[]), unnest('{3,4,5}'::int[]);
unnest | unnest
--------+--------
1 | 3
2 | 4
1 | 5
2 | 3
1 | 4
2 | 5
Is there any way to force the latter behaviour without moving stuff to the from clause?
The SQL is generated by a mapping layer and it will be very much easier for me to implement the new feature I am adding if I can keep everything in the select.
https://www.postgresql.org/docs/10/static/release-10.html
Set-returning functions are now evaluated before evaluation of scalar
expressions in the SELECT list, much as though they had been placed in
a LATERAL FROM-clause item. This allows saner semantics for cases
where multiple set-returning functions are present. If they return
different numbers of rows, the shorter results are extended to match
the longest result by adding nulls. Previously the results were cycled
until they all terminated at the same time, producing a number of rows
equal to the least common multiple of the functions' periods.
(emphasis mine)
I tested with version 12.12 of Postgres and as mentioned by OP in a comment, having two unnest() works as expected when you move those to the FROM clause like so:
SELECT a, b
FROM unnest('{1,2}'::int[]) AS a,
unnest('{3,4}'::int[]) AS b;
Then you get the expected table:
a | b
--+--
1 | 3
1 | 4
2 | 3
2 | 4
As we can see, in this case the arrays have the same size.
In my case, the arrays come from a table. You can first name the name and then name column inside the unnest() calls like so:
SELECT a, b
FROM my_table,
unnest(col1) AS a,
unnest(col2) AS b;
You can, of course, select other columns as required. This is a normal cartesian product.
So I have a relation:
Cars(model, passenger)
The models are all unique, let's say, {A, B, C, D, E}.
Passengers is just the capacity of the car (any positive non-zero integer), let's say {1,2,2,3,3}
Model|Passenger
A |1
B |2
C |2
D |3
E |3
I need to find the relational algebra expression that would yield what capacities occur for more than 1 vehicle. So with the example values above, the expression should return {2, 3} since they appear more than once for different vehicles.
I have a strong inclination to think that the expression will use a join of some sort but I can't figure out how to do it.
I figured it out:
Assuming an existing relation Cars(model, passenger) that contains all of the cars in question and their passenger capacities.
CARS2(model,passenger)≔ρ_(m,p) (CARS)
Answer (passenger)≔π_passenger (CARS⋈_(model ≠ m AND passenger=p) CARS2)
I'm not sure about relational algebra expression, which might look something along the lines
π Passenger σ Count(Model) >= 2 G Passenger (Table1)
but if you're looking for a query than it doesn't include a JOIN in it
SELECT passenger
FROM table1
GROUP BY passenger
HAVING COUNT(model) >= 2
Outcome:
| PASSENGER |
|-----------|
| 2 |
| 3 |
Here is SQLFiddle demo
I have a very small question related to unsupervised learning because my teacher have not use this word in any lectures. I got this word while reading tutorials. Does this mean if values are same to initial values in last iteration of clusters then it is called converge? for example
| c1 | c2 | cluster
| (1,0) | (2,1)|
|-------|------|------------
A(1,0)| .. |.. |get smallest value
B(0,1)|.. |... |
c(2,1)|.. |... |
D(2,1)|.. |.... |
now after performing n-iteration and if values come same in both c1 and c2 that is (1,0) and (2,1) in last n-th iteration and taking avg if other than single , is it convergence?
Ideally, if the values in the last two consequent iterations are same then the algorithm is said to have converged. But often people use a less strict criteria for convergence, like, the difference in the values of last two iterations is less than a particular threshold etc,.
Incase of K-means clustering, the word convergence means the algorithm have successfully completed this clustering or grouping of data points in k number of clusters.The algorithm will make sure it has completely grouped the data points into correct clusters, if the centroids (k values) in k-means remains same place or in point for 2 iteration .
I'm currently reading "Artificial Intelligence: A Modern Approach" (Russell+Norvig) and "Machine Learning" (Mitchell) - and trying to learn basics of AINN.
In order to understand few basic things I have two 'greenhorn' questions:
Q1: In a genetic algorithm given the two parents A and B with the chromosomes 001110 and 101101, respectively, which of the following offspring could have resulted from a one-point crossover?
a: 001101
b: 001110
Q2: Which of the above offspring could have resulted from a two-point crossover? and why?
Please advise.
It is not possible to find parents if you do not know the inverse-crossover function (so that AxB => (a,b) & (any a) => (A,B)).
Usually the 1-point crossover function is:
a = A1 + B2
b = B1 + A2
Even if you know a and b you cannot solve the system (system of 2 equations with 4 variables).
If you know any 2 parts of any A or/and B then it can be solved (system of 2 equations with 2 variables). This is the case for your question as you provide both A and B.
Generally crossover function does not have inverse function and you just need to find the solution logically or, if you know parents, perform the crossover and compare.
So to make a generic formula for you we should know 2 things:
Crossover function.
Inverse-crossover function.
The 2nd one is not usually used in GAs as it is not required.
Now, I'll just answer your questions.
Q1: In a genetic algorithm given the
two parents A and B with the
chromosomes 001110 and 101101,
respectively, which of the following
offspring could have resulted from a
one-point crossover?
Looking at the a and b I can see the crossover point is here:
1 2
A: 00 | 1110
B: 10 | 1101
Usually the crossover is done using this formula:
a = A1 + B2
b = B1 + A2
so that possible children are:
a: 00 | 1101
b: 10 | 1110
which excludes option b from the question.
So the answer to Q1 is the result child is a: 001101 assuming given crossover function
Q2: Which of the above offspring could
have resulted from a two-point
crossover? and why?
Looking at the a and b I can see the crossover points can be here:
1 2 3
A: 00 | 11 | 10
B: 10 | 11 | 01
Usual formula for 2-point crossover is:
a = A1 + B2 + A3
b = B1 + A2 + B3
So the children would be:
a = 00 | 11 | 10
b = 10 | 11 | 01
Comparing them to the options you asked (small a and b) we can say the answer:
Q2. A: Neither of a or b could be result of 2-point crossover with AxB according to the given crossover function.
Again it is not possible to answer your questions without knowing the crossover function.
The functions I provided are common in GA, but you can invent so many of them so they could answer the question (see the comment below):
One point crossover is when you make one join from each parent, two point crossover is when you make two joins. i.e. two from one parent and one from the others.
See crossover (wikipedia) for further info.
Regarding Q1, (a) could have been produced by a one-point crossover, taking bits 0-4 from parent A and bit 5 from parent B. (b) could not unless your crossover algorithm allows for null contributions, i.e. parent contributions of null weight. In that case, parent A could contribute its full chromosome (bits 0-5) and parent B would contribute nil, yielding (b).
Regarding Q2, both (a) and (b) are possible. There are a few combinations to test; too tedious to write, but you can do the work with pen and paper. :-)