I would like to know if any of you guys have written a query for record clustering based on overlapping time intervals AND locations.
Data in my application is represented as individual events of a person being at any given location from start time to end time. Location is defined as latitude and longitude. During a day one person will have multiple different locations and start and end times. I need to get groups of persons who were at the same location and at the same time. One person will most likely be in several groups during a day.
Example:
Person A can be with Person B at the office from 10 AM to 11 AM.
Then Person A leaves the office for gym.
There he is with Person C from 12 noon to 1PM.
At 12:30 Person C leaves gym for the office.
At 1:30PM I have Person B and C at the office.
Persons B and C leave the office at 5PM.
In this example I have
Cluster 1 (Person A and B at the office) from 10AM to 11AM,
Cluster 2 (Person A and C at the gym) from 12 noon to 1PM, and
Cluster 3 (Person B and C at the office) from 1:30PM to 5PM.
The Location of each individual person will not match exactly to another person's location. I'm using SQL geography point type with the STBuffer of proximity threshold and check for STIntersects. I'm also joining the table on itself to check time overlaps. But i'm experiencing some weird behaviors when Person A gets clustered on itself without other person ever joining him.
I'm wondering if there's a design pattern for handling situations like this. Ideally i would have the recordset grouped on "Overlapping Time Period" and "Centroid of an arbitrary geometry" but can't figure out how to get the overlapping time period and the arbitrary geometry.
Any ideas are welcome and highly appreciated.
P.S. writing a windows application is not an option unless it's the only way.
EDIT: Failed to mention that locations of clustering is never known in advance. There can be indefinite number of locations where two or more of my customers may cluster. I don't know if clustering will happen in the office, gym, some park or at a bus station. Clustering location (i think ) will be the Centroid of a polygon represented by all congregated people's Latitudes and Longitudes.
Would the code be something like
select a.person,a.eventtime,a.eventplace,
b.person,b.eventtime,b.eventplace
from people a
join people b on a.eventtime between dateadd(hh,-2,b.eventime) and dateadd(hh,2,b.eventime)
and yourdistancefunction(a.eventplace ,b.eventplace) < 5 -- don't know what you are measuring
and a.person<>b.person
I solved the puzzle by first getting the entire dataset for the given time period. Looping through the recordset and generating STUnion shapes for all overlapping locations. Then joining the generated temporary table on the initial datased and getting only the records that intersected with STUnion shapes AND with each other in time.
Used three temp tables but hey, who cares if it does the job :)
Related
I have trouble getting my head around the correct data structure for the following application. Although I am sure this is quite standard for people in the field.
I want to create, for learning purposes, a workout journal app. So the idea is to be able to log, each day, a particular workout.
A workout is comprised of exercises. And each exercise has particular attributes.
For example. Workout 1 is a strength session. So I will have e.g. dumbell press, squats, ... which are all sets and reps based. So for workout 1 I need to be able to enter for each exercise the sets, reps and weight used for that particular workout.
Workout 2 is say a running session. This is time based and distance based. So for that workout 2 I need to be able to enter time and distance.
What would be the structure I need to have in my database for such application ?
I guess I should have an "exercise" table. Then this should somehow be a foreign key in the "workout" table. But how can the workout table accommodate varying attributes ? As well as varying number of entries ? (since a workout can be one, three or ten exercises) Also all this should constitute only one record of the "workout" table.
EDIT :
I have tried to come up with a structure. Could someone confirm/infirm this is the correct way to do this ?
So the final result is the one below, for human representation :
Final Result (sport journal)
Date
Timestamp
Exercise 1
Sets
Reps
Weight
Exercise 2
Time
Distance
Exercise 3
Sets
Reps
Weight
120821
10.30
Bench press
5
10
40
Run
60
400
NULL
NULL
NULL
NULL
120821
17.00
Bench press
5
10
40
NULL
NULL
NULL
Squats
3
5
120
But I guess this can't really be achieved as such as this is not (I think) possible in a relational database. So I need to have separate tables and the human view shown above will be a join of those various databases. For example, one "record" of the human view can be obtained by a join of various tables based on the date and timestamp i.e. the actual timing of the workout.
If that is correct, then I think a structure like this could work (at least, the ideas are there I think) :
Exercise database (so simply the list of exercises with their type which determines the attributes needed)
Name
Type
Bench press
Setsreps
Run
Run
Squats
Setsreps
Attributes (the attributes depending on exercise type. Maybe should split furhter to avoid varying number of columns depending on this exercise type i.e. run vs setsreps?)
Attributes
Attribute1
Attribute2
Attribute3
setsreps
Sets
Reps
Weight
Run
Time
Distance
NULL
Carry
Weight
Distance
NULL
Setsreps instances database (so the actual realization of the exercise on a certain day. This table will be huge !)
Date
Timestamp
Exercise
Sets
Reps
Weight
120821
10.30
Bench press
5
10
40
120821
17.00
Squats
3
5
120
Run instances database (same as above but for run instances. Since a run instance has different attributes than a setsreps instance. Is this the correct way to do this ?)
Date
Timestamp
Time
Distance
120821
10.30
60
400
170821
17.00
120
800
So then I could have the "human" view by performing a join of the setsreps & run tables on a particular data and timestamp (which together form a primary key)
Is this a correct way of thinking ?
Thanks for the support
A workout is comprised of exercises. And each exercise has particular
attributes.
One way to do this is to have a Workout table, an Exercise table, and an Attribute table. The Wikipedia article, Database normalization, will help you understand how to create a normalized database.
I'm assuming that you're not going to share attributes with exercises. If you do, this changes a one to many relationship between exercises and attributes to a many-to-many relationship. You would need a junction table to model a many to many relationship.
The Exercise table will have a foreign key for the Workout table. The Attribute table will have a foreign key for the Exercise table.
I am currently working on a web application that stores information of Cooks in the user table. We have a functionality to search the cooks from our web application. If a cook is not available on May 3, 2016, we want to show the Not-Bookable or Not-Available message for that cook if user performs the search for May 3, 2016. The solution we have come up to is to create a table named CooksAvailability with following fields
ID, //Primary key, auto increment
IDCook, //foreign key to user's table
Date, //date he is available on
AvailableForBreakFast, //bool field
AvailableForLunch, //bool field
AvailableForDinner, //book field
BreakFastCookingPrice, //decimal nullable
LunchCookingPrice, //decimal nullable
DinnerCookingPrice //decimal nullable
With this schema, we are able to tell if the user is available for a specific date or not. But the problem with this approach is that it requires a lot of db space i.e if a cook is available for 280 days/year, there has to be 280 rows to reflect just one cook's availability.
This is too much space given the fact that we may have potentially thousands of cooks registered with our application. As you can see the CookingPrice fields for breakfast, lunch and dinner. it means a cook can charge different cooking rates for cooking on different dates and times.
Currently, we are looking for a smart solution that fulfils our requirements and consumes less space than our solution does.
You are storing a record for each day and the main mistake, which led you to this redundant design was that you did not separate the concepts enough.
I do not know whether a cook has an expected rate for a given meal, that is, a price one can assume in general if one has no additional information. If that is the case, then you can store these default prices in the table where you store the cooks.
Let's store the availability and the specific prices in different tables. If the availability does not have to store the prices, then you can store availability intervals. In the other table, where you store the prices, you need to store only the prices which deviate from the expected price. So, you will have defined availability intervals in a table, specific prices when the price differs from the expected one in the oter and default meal price values in the cook table, so, if there is no special price, the default price will be used.
To answer your question I should know more about the structure of the information.
For example if most cooks are available in a certain period, it could be helpful to organize your availability table with
avail_from_date - avail_to_date, instead of a row for each day.
this would reduce the amount of rows.
The different prices for breakfast, lunch and dinner could be stored better in the cooks table, if the prices are not different each day. Same is for the a availability for breakfast, lunch and dinner if this is not different each day.
But if your information structure makes it necessary to keep a record for every cook every day this would be 365 * 280 = 102,200 records for a year, this is not very much for a sql db in my eyes. If you put the indexes at the right place this will have a good performance.
There are a few questions that would help with the overall answer.
How often does availability change?
How often does price change?
Are there general patterns, e.g. cook X is available for breakfast and lunch, Monday - Wednesday each week?
Is there a normal availability / price over a certain period of time,
but with short-term overrides / differences?
If availability and price change at different speeds, I would suggest you model them separately. That way you only need to show what has changed, rather than duplicating data that is constant.
Beyond that, there's a space / complexity trade-off to make.
At one extreme, you could have a hierarchy of configurations that override each other. So, for cook X there's set A that says they can do breakfast Monday - Wednesday between dates 1 and 2. Then also for cook X there's set B that says they can do lunch on Thursday between dates 3 and 4. Assuming that dates go 1 -> 3 -> 4 -> 2, you can define whether set B overrides set A or adds to it. This is the most concise, but has quite a lot of business logic to work through to interpret it.
At the other extreme, you just say for cook X between date 1 and 2 this thing is true (an availability for a service, a price). You find all things that are true for a given date, possibly bringing in several separate records e.g. a lunch availability for Monday, a lunch price for Monday etc.
I try to design database which contains data about street parking. Parking have gps coordinates, time restriction by day, day of week rules (some days are permitted, other restricted), free or paid status. In the end, I need to do some queries that can specify parking by criteria.
For first overdraw I try to do something like this:
Pakring
-------
parkingId
Lat
Long
Days (1234567)
Time -- already here comes trouble
But it`s not normalized and quickly overflow database. How to design data in the best way?
Update For now I have two approaches
The first one is:
I try to use restrictions tables with many-to-many links.(This is example for days and months). But queries will be complicated and I don`t now how to link time with day.
The second approach is:
Using one restricted table with Type field, that will have priority. But this solution also not normalized.
Just to be clear what data I have.
PakingId Coords String Description(NO PARKING11:30AM TO 1PM THURS)
And I want to show user where he can find street parking by area, time and day.
Thanks to all for your help and time.
This seems like a difficult task. Just a few thoughts.
Are you only concerned with street parking? Parking houses have multiple floors so GPS coordinates won't work unless you stay on the streets.
What is the accuracy of the coordinates? Would it be easier to identify each parking space individually by some other standard. Like unique identifiers of the painted parking squares. (But what happens if people don't park into squares? Or the GPS coordinates accuraycy fails/is not exact enough because of illegal parking? Do you intend to keep records of the parking tickets too?)
Some thought for the tables or information you need to take into account:
time: opening hours, days
price: maybe a different price for different time intervals?
exceptions: holidays, maintenance (maybe not so important, you could just make parking space status active/inactive)
parking slot: id (GPS/random id), status
Three or four tables above could be linked by an intermediate table which reveals the properties of a parking space for every possible parking time (like a prototype for all possible combinations). That information could be linked into another table where you keep records of a actual parking events (so you can for example keep records of people who have or have not paid their bills if you need to).
There are lots of stuff that affect your implementation so you really need to list all the rules of the parking space (and event?). Database structure can be done (and redone) later after you have an understanding of the properties of the events you need to keep records of. And thats the key to everything: understanding what you need to do so you can design and create the implementation. If the implementation (application) doesn't work change the implementation. If the design is faulty redesign. If you don't undestand the whole process (what you really need), everything you do is bound to fail. (Unless you are incredibly lucky but I wouldn't count on luck...)
Try using two tables with an intersection entity between them.
Table parking will have parking_id, lat and long columns. Table Restrictions will have all the type of restrictions that you have in your scenario with something like restriction_id, restriction_day, restriction_time and restriction_status and maybe restriction_type.
Then you can link the two tables with foreign key constraints in the intersection entity.
Example parking_id has restriction_id.
This way a parking can have more than one restriction and a restriction can be applied to more than one parking.
As you seem to have heard of normalization, and following the comment from Damien, you should use different tables to represent different things.
You should then think about how to link those tables together, and in the process define the type of relationship between the 2. Could be one-to-one (this one is the one where you could be tempted to put everything in the same table, but a simple foreign key in a linked table is cleaner), one-to-many (this is where the trouble would begin if you put everything in one table, cause now there will be several lines in the linked table with the same foreign key, and if everything was in the same table, you'd have to myltiply the fields in that table), or many to many (where you would need to add a table only to make the link between 2 other tables, thus with 2 foreign key fields pointing to records in both tables).
For example, in your case, a Parking table could hold the parking name, coordinates, etc.
A second table TimeTable could hold the opening days/time for each parking, with a foreign key to the parkingId (making it a one-to-many rlationship, 1 parking can have many opening frames). The fields of this table could for example be DayOfWeek (number indicating the day), openingTime, closingTime. This would allow you to define several timeframes on the same day, or a single one (if it's always open for example), giving in this case 7 records in this table for this parking (=> one-to-many relationship).
You could then imagine a 3rd table Price where you put data concerning the price of that parking (probably a one-to-many too, with records for hourly rates/long stay/..., and so on depending on the needs and the different "objects" you would need to represent.
Please note these are only rough examples. Database design can sometimes be very tricky and that's a matter I'm not specialist in, but I think these advises can help you go further and come back with another question if you get stuck.
Good luck !
I'm new to creating cubes, so please be patient.
Here's an example of my data
I have multiple companies, each company has multiple stores.
Sales are each linked to a particular company, with a particular store on a particular date.
ex:5 sales took place for Company A, Store 1, on 5/19/2011
Returns are linked to a particular company on a particular date.
ex: 3 returns took place for Company A on 3/11/2012
Sometimes my users want to see a list of stores, the date, and how many returns took place, and how many sales.
Sometimes they want to see a list of companies, the specific stores, and the number of sales.
I have a table that stores
COMPANY - DATE - STORE- SALES - RETURNS
I end up having the value for returns repeated for each store under a particular COMPANY - DATE pair.
so if I'm writing a query, and I want to find out returns, I just do a
select distinct company, date, returns from mytable
but I am not sure how to this into a cube (using SS BI and Visual Studio).
(I've only made a couple of cubes so far)
Thanks! (also, feel free to point me at appropriate references)
It sounds like Company is an attribute of the Store and should be in the Store dimension rather than the fact table. There may have to be a transformation on returns to convert the Company to a store.
Am I missing anything?
I'm working on a project that must store employees' timetables. For example, one employee works Monday through Thursday from 8am to 2pm and from 4pm to 8pm. Another employee may work Tuesday through Saturday from 6am to 3pm.
I'm looking for an algorithm or a method to store these kind of data in a MySQL database. These data will be rarely accessed so it's not important performance questions.
I've thought to store it as a string but I don't know any algorithm to "encode" and "decode" this string.
As many of the comments indicate, it's usually a poor idea to encode all the data into a string that is basically meaningless to the data base. It's usually better to define the data elements and their relations and represent these structures in the data base. The Wikipedia article on data models is a good overview of what's involved (although it's way more general than what you need). The problem you are describing seems simple enough that you could do this with pencil and paper.
One way to start is to write down a lists of logical relationships between concepts in your problem. For instance, the list might look like this (your rules may be different):
Every employee follows a single schedule.
Every employee has a first and last name, as well as an employee ID. Different employees may have the same name, but each employee's ID is unique to that employee.
A schedule has a start and stop day of the week and a start and stop time of day.
The start and stop time is the same for every day of the schedule.
Several employees may be on the same schedule.
From this, you can list the nouns used in the rules. These are candidates for entities (columns) in the data base:
Employee
Employee ID
Employee first name
Employee last name
Schedule
Schedule start day
Schedule start time
Schedule end day
Schedule end time
For the rules I listed, schedules seem to exist independently of employees. Since there needs be a way of identifying which schedule an employee follows, it makes sense to add one more entity:
Schedule ID
If you then look at the verbs in the rules ("follows", "has", etc.), you start to get a handle on the relationships. I would group everything so far into two relationships:
Employees
ID
first_name
last_name
schedule_ID
Schedules
ID
start_day
start_time
end_day
end_time
That seems to be all that's needed by way of data structures. (A reasonable alternative to start_day and end_day for the Schedules table would be a boolean field for each day of the week.) The next step is to design the indexes. This is driven by the queries you expect to make. You might expect to look up the following:
What schedule is employee with ID=xyz following?
Who is at work on Mondays at noon?
What days have nobody at work?
Since employees and schedules are uniquely identified by their respective IDs, these should be the primary fields of their respective tables. You also probably want to have consistency rules for the data. (For instance, you don't want an employee on a schedule that isn't defined.) This can be handled by defining a "foreign key" relationship between the Employees.schedule_ID field and the Schedules.ID field, which means that Employees.schedule_ID should be indexed. However, since employees can share the same schedule, it should not be a unique index.
If you need to look up schedules by day of week and time of day, those might also be worth indexing. Finally, if you want to look up employees by name, those fields should perhaps be indexed as well.
Assuming you're using PHP:
Store a timetable in a php array and then use serialize function to transform it in a string;
to get back the array use unserialize.
However this form of memorization is almost never a good idea.