Combining data sets without losing observations in SAS - database

Hye guys,
I know, another post another problem :D :(.
I took a screenshot to easily explain my problem.
http://i39.tinypic.com/rhms0h.jpg
As you can see I want to merge two tables (again), the Base & Analyst table. What I want to achieve is displayed in the right bottom corner table. I’m calculating the number of total analysts and female analysts for each month in the analyst table. In the base table I have different observations for one company (here company Alcoa with ticker AA). When I use the following command:
data want;
merge base analyst;
by month ;
run;
I get the right up corner problem. My observations in the main table are being narrowed down to only 4 observations (for each different year one observation, 2001, 2002, 2005, 2006). What I want is that the observations are not reduced but that for every year the same data is being placed as shown in the right bottom corner. What am I missing in my merge command?
In both tables I have month as a time count variable ( the observations in my base table are monthly) on which I need to merge. For clarity I added 2 screenshots of my real databases in SAS.
The base table:
http://i42.tinypic.com/dr5jky.jpg
The analyst table:
http://i40.tinypic.com/eqpmqq.jpg
Here is what my merged table looks like:
http://i43.tinypic.com/116i62s.jpg
You can clearly see that the merged table only has four observations left for AA (one for each unique year) instead of the original 8.
Anyone an idea to solve this?

Ugh, it appears you can easily solve this by merging on both ticker and month.
Data ftest;
Merge ftest tryf1 ;
By ticker month;
Run;
/shame.

Related

Measure that shows the oldest product in the table - DAX

DAX is still a little bit confusing and new for me and that's why I'm looking for your help.
I have a very simple table that shows the products that are in the production line. It contains 3 columns: One is for the product name, the other one contains the date in which the product went to the production line and the last one, the amound.
All I want is to show in a dashboard in Power BI the first product that still is in the production line. In another words, I want a measure that calculate the oldest date in my table and return the following product (or products ) related to that date.
One solution is to show a table sorted by the date but it is not really what I'm looking for.
Thanks in advance

Power BI Aggregation of End Tables

I am new to Power BI and data-base management and I want clarify for myself how Power BI works in reference to my last two questions (Database modelling Bridge Table , Power BI Report Bridge Table ). I have a main_table with firm specific information each year which is connected to an end_table that contains some quantitative information (e.g. sales data). The tables are modelled as a 1:N relationship, so that I do not have to store the same values twice, which I thought is a good thing to do in data modelling.
I want to aggregate the value column of end table over the group column Year. I am surprised that to my understanding Power BI sums up the value column within the end table when I would expect the aggregation over the group variable in the connected tables
My basic example is based on this data and data model (you need to adjust the relationship manually):
main_table<-data.frame(id=1:20, FK_id=sample(1:2,20, replace=TRUE), Jahre=2016:2020)
main_table<-rbind(main_table,data.frame(id=21:25, FK_id=sample(2:3,5, replace=TRUE), Jahre=2015) )
end_table<-data.frame(id=1:3, value=c(10,20,30))
The first 5 rows of the data including all columns looks like this:
If I take out all row specific information and sum up over value. It will always show the sum of the end table, which is 60, in each Year.
Making the connection bi-directional does not help. It just sums up for the existing values of the end_table in each year. I get the correct results, if I add the value column to the main table using Related value = RELATED(end_table[value])
I am just wondering if there is another way to model or analyse this 1:N relationship in Power BI. This comes up frequently and it feels a bit tedious to always add the column using Related() in the main table while it would be intuitive to just click both columns and expect the aggregation to be based on the grouping variable.
In any case, just asking this and my other two questions helped me a lot.
This is a bit of a weird modeling situation (even though it's not terribly uncommon). In general, it's handy to build star schemas where you have dimension tables in 1:N relationships to fact table(s). E.g.
In this setup, the items from the dimension tables (e.g. year or customer) are used in the columns and rows in a visual and measures generally aggregate columns from the fact table (e.g. sales amount).
Your example inverts this. You are trying to sum over a column in your end table using the year as a dimension. As a result, it's not automatically behaving as you'd expect.
In order to get the result that you want, where Year is treated as a dimension, you need to write a measure that sums over Year as if it were a dimension. Since main_table is essentially a dimension table for Year (one unique row per year), you can write
SumValue = SUMX ( main_table, RELATED ( end_table[value] ) )

MS SQL Server, arrays, looping, and inserting qualified data into a table

I've searched around for answer and I'm not sure how best to frame the question since I'm rather new to SQL Server.
Here's what I got going on: I get a weekly report detailing the products that have been sold and the quantity of each. This data needs to go into a yearly totals table. In this table the first column is the product_id and the next 52 columns are week numbers, 1-52.
There's a JOIN that runs on the product_id of both the weekly and yearly tables. That finds the proper row and column to put the weekly quantity data for that product.
Here's where I'm not sure what to do. In 2019 there are no product_id in that column. So there's nothing to JOIN on. Those product_id need to be added weekly if they aren't there. I need to take the weekly report of product_id and quantity and check each product_id to see if it's in the yearly table. If not I need to add it.
If I had it my way I'd create an array of the product_id numbers from the weekly data and loop through each one creating a new record in the yearly table for any product_id that is not already there. I don't know how best to do that in SSMS.
I've searched around and have found different strategies for this. Nothing strikes me as being a perfect solution. There's creating a #temp table variable, a UNION using exclude to get just those that aren't in the table, and a WHILE loop. Any suggestions would be helpful.
I ended up using a MERGE to solve this. I create a table WeeklyParts to dump the weekly data into. Then I do a MERGE with the yearly table inserting only those where the is no match. Works well.
-- Merge the PartNo's so that only unique ones are added to the yearly table
MERGE INTO dbo.WeeklySales2018
USING dbo.WeeklyParts
ON (dbo.WeeklySales2018.PartNo = dbo.WeeklyParts.PartNo)
WHEN NOT MATCHED THEN
INSERT (PartNo) VALUES (dbo.WeeklyParts.PartNo);

Database table structure for price list

I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.

Record clustering based on Place and Time in SQL

I would like to know if any of you guys have written a query for record clustering based on overlapping time intervals AND locations.
Data in my application is represented as individual events of a person being at any given location from start time to end time. Location is defined as latitude and longitude. During a day one person will have multiple different locations and start and end times. I need to get groups of persons who were at the same location and at the same time. One person will most likely be in several groups during a day.
Example:
Person A can be with Person B at the office from 10 AM to 11 AM.
Then Person A leaves the office for gym.
There he is with Person C from 12 noon to 1PM.
At 12:30 Person C leaves gym for the office.
At 1:30PM I have Person B and C at the office.
Persons B and C leave the office at 5PM.
In this example I have
Cluster 1 (Person A and B at the office) from 10AM to 11AM,
Cluster 2 (Person A and C at the gym) from 12 noon to 1PM, and
Cluster 3 (Person B and C at the office) from 1:30PM to 5PM.
The Location of each individual person will not match exactly to another person's location. I'm using SQL geography point type with the STBuffer of proximity threshold and check for STIntersects. I'm also joining the table on itself to check time overlaps. But i'm experiencing some weird behaviors when Person A gets clustered on itself without other person ever joining him.
I'm wondering if there's a design pattern for handling situations like this. Ideally i would have the recordset grouped on "Overlapping Time Period" and "Centroid of an arbitrary geometry" but can't figure out how to get the overlapping time period and the arbitrary geometry.
Any ideas are welcome and highly appreciated.
P.S. writing a windows application is not an option unless it's the only way.
EDIT: Failed to mention that locations of clustering is never known in advance. There can be indefinite number of locations where two or more of my customers may cluster. I don't know if clustering will happen in the office, gym, some park or at a bus station. Clustering location (i think ) will be the Centroid of a polygon represented by all congregated people's Latitudes and Longitudes.
Would the code be something like
select a.person,a.eventtime,a.eventplace,
b.person,b.eventtime,b.eventplace
from people a
join people b on a.eventtime between dateadd(hh,-2,b.eventime) and dateadd(hh,2,b.eventime)
and yourdistancefunction(a.eventplace ,b.eventplace) < 5 -- don't know what you are measuring
and a.person<>b.person
I solved the puzzle by first getting the entire dataset for the given time period. Looping through the recordset and generating STUnion shapes for all overlapping locations. Then joining the generated temporary table on the initial datased and getting only the records that intersected with STUnion shapes AND with each other in time.
Used three temp tables but hey, who cares if it does the job :)

Resources