Using SSIS, How do I find the cities with the largest population? - sql-server

I have a dataflow task with information that looks something like this:
Province | City | Population
-------------------------------
Ontario | Toronto | 7000000
Ontario | London | 300000
Quebec | Quebec | 300000
Quebec | Montreal| 6000000
How do I use the Aggregate transformation to get the city with the largest population in each province:
Province | City | Population
-------------------------------
Ontario | Toronto | 7000000
Quebec | Montreal| 6000000
If I set "Province" as the Group-By column and "Population" to the "Max" aggregate, what do I do with the City column?

Completely agree with #PaulStock that aggregates are best left to source systems. An aggregate in SSIS is a fully blocking component much like a sort and I've already made my argument on that point.
But there are times when doing those operations in the source system just aren't going to work. The best I've been able to come up with is to basically double process the data. Yes, ick but I was never able to find a way to pass a column through unaffected. For Min/Max scenarios, I'd want that as an option but obviously something like a Sum would make it hard for the component to know what the "source" row it'd tie to.
2005
A 2005 implementation would look like this. Your performance is not going to be good, in fact a few orders of magnitude from good as you'll have all these blocking transforms in there in addition to having to reprocess your source data.
Merge join
2008
In 2008, you have the option of using the Cache Connection Manager which would help eliminate the blocking transformations, at least where it matters, but you're still going to have to pay the cost of double processing your source data.
Drag two data flows onto the canvas. The first will populate the cache connection manager and should be where the aggregate takes place.
Now that the cache has the aggregated data in there, drop a lookup task in your main data flow and perform a lookup against the cache.
General lookup tab
Select the cache connection manager
Map the appropriate columns
Great success
Script task
The third approach I can think of, 2005 or 2008, is to write it your own self. As a general rule, I try to avoid the script tasks but this is a case where it probably makes sense. You will need to make it an asynchronous script transformation but simply handle your aggregations in there. More code to maintain but you can save yourself the trouble of reprocessing your source data.
Finally, as a general caveat, I'd investigate what the impact of ties will do to your solution. For this data set, I would expect something like Guelph to suddenly swell and tie Toronto but if it did, what should the package do? Right now, both will result in 2 rows for Ontario but is that the intended behaviour? Script, of course, allows you to define what happens in the case of ties. You could probably stand the 2008 solution on its head by caching the "normal" data and using that as your lookup condition and using the aggregates to pull back just one of the ties. 2005 can probably do the same just by putting the aggregate as the left source for the merge join
Edits
Jason Horner had a good idea in his comment. A different approach would be to use a multicast transformation and perform the aggregation in one stream and bring it back together. I couldn't figure out how to make it work with a union all but we could use sorts and merge join much like in the above. This is probably a better approach as it saves us the trouble of reprocessing the source data.

Instead of using the Aggregate transformation, could you use a SQL query instead?
SELECT
p.province,
p.city,
p.[population]
FROM
temp_pop P
JOIN ( SELECT
province,
[population] = MAX([POPULATION])
FROM
temp_pop
GROUP BY
province
) AS M ON p.province = M.province AND
p.[population] = M.[population]

Related

Pattern matching based on variable name for variable selection in a Postgres query?

I'm trying to query some data in Postgres and I'm wondering how I might use some sort of pattern matching not merely to select rows - e.g. SELECT * FROM schema.tablename WHERE varname ~ 'phrase' - but to select columns in the SELECT statement, specifically doing so based on the common names of those columns.
I have a table with a bunch of estimates of rates over the course of many years - say, of apples picked per year - along with upper and lower 95% confidence intervals for each year. (For reference, each estimate and 95% CI comes from a different source - a written report, to be precise; these sources are my rows, and the table describes various aspect of each source. Based on a critique from below, I think it's important that the reader know that the unit of analysis in this relational database is a written report with different estimates of things picked per year - apples in one Table, oranges in another, pears in a third.)
So in this table, each year has three columns / variables:
rate_1994
low_95_1994
high_95_1994
The thing is, the CIs are mostly null - they haven't been filled in. In my query, I'm really only trying to pull out the rates for each year: all the variables that begin with rate_. How can I phrase this in my SELECT statement?
I'm trying to employ regexp_matches to do this, but I keep getting back errors.
I've done some poking around StackOverflow on this, and I'm getting the sense that it may not even be possible, but I'm trying to make sure. If it isn't possible, it's easy to break up the table into two new ones: one with just the rates, and another with the CIs.
(For the record, I've looked at posts such as this one:
Selecting all columns that start with XXX using a wildcard? )
Thanks in advance!
If what you are basically asking is can columns be selected dynamically based on an execution-time condition,
No.
You could however use PL/SQL to build up a query as a string and then execute it using EXECUTE IMMEDIATE.

Is there a more efficient way to do a one-to-many relationship than having the "one" value in every row returned?

I have a pretty basic one-to-many relationship. There are "nodes", and each node relates to multiple "options". I am doing a fairly simple join, and my results are like this:
content | optionid | content
--------------------------+----------+------------------
This is the node content | 1 | This is option 1
This is the node content | 2 | This is option 2
However, because it's one-to-many, every row has the same node content: This is the node content. It seems redundant to return that same value with every row when I only need it once. Is there a better way?
The jsonb_object_agg
aggregate function
seems like a good choice for this.
SELECT
n.content,
jsonb_object_agg(o.optionid, o.content)
FROM node n
JOIN option o ON (
-- what ever are the conditions
)
GROUP BY n.content;
Well, this is how relational databases work - you have to do compromises.
If you think this query generates too large data flow between application server and the database, you could split it in two queries. Load the content separately and then load options avoiding the redundancy. This could also have a performance benefit (less memory, no JOIN, etc).
On the other hand, this has latency due to separate queries being run. Especially in looped queries this can be notably slower. So, it's all about compromises.

Making a table with fixed columns versus key-valued pairs of metadata?

I was asked to create a table to store paid-hours data from multiple attendance systems from multiple geographies from multiple sub-companies. This table would be used for high level reporting so basically it is skipping the steps of creating tables for each system (which might exist) and moving directly to what the final product would be.
The request was to have a dimension for each type of hours or pay like this:
date | employee_id | type | hours | amount
2016-04-22 abc123 regular 80 3500
2016-04-22 abc123 overtime 6 200
2016-04-22 abc123 adjustment 1 13
2016-04-22 abc123 paid time off 24 100
2016-04-22 abc123 commission 600
2016-04-22 abc123 gross total 4413
There are multiple rows per employee but the though process is that this will allow us to capture new dimensions if they are added.
The data is coming from several sources and I was told not to worry about the ETL, but just design the ultimate table and make it work for any system. We would provide this format to other people for them to fill in.
I have only seen the raw data from one system and it like this:
date | employee_id | gross_total_amount | regular_hours | regular_amount | OT_hours | OT_amount | classification | amount | hours
It is pretty messy. Multiple rows for employees and values like gross_total repeat each row. There is a classification column which has items like PTO (paid time off), adjustments, empty values, commission, etc. Because of repeating values, it is impossible to just simply sum the data up to make it equal the gross_total_amount.
Anyways, I kind of would prefer to do a column based approach where each row describes the employees paid hours for a cut off. One problem is that I won't know all of the possible types of hours which are possible so I can't necessarily make a table like:
date | employee_id | gross_total_amount | commission_amount | regular_hours | regular_amount | overtime_hours | overtime_amount | paid_time_off_hours | paid_time_off_amount | holiday_hours | holiday_amount
I am more used to data formatted that way though. The concern is that you might not capture all of the necessary columns or if something new is added. (For example, I know there is maternity leave, paternity leave, bereavement leave, in other geographies there are labor laws about working at night, etc)
Any advice? Is the table which was suggested to me from my superior a viable solution?
TAM makes lots of good points, and I have only two additional suggestions.
First, I would generate some fake data in the table as described above, and see if it can generate the required reports. Show your manager each of the reports based on the fake data, to check that they're OK. (It appears that the reports are the ultimate objective, so work back from there.)
Second, I would suggest that you get sample data from as many of the input systems as you can. This is to double check that what you're being asked to do is possible for all systems. It's not so you can design the ETL, or gather new requirements, just testing it all out on paper (do the ETL in your head). Use this to update the fake data, and generate fresh fake reports, and check the reports again.
Let me recapitulate what I understand to be the basic task.
You get data from different sources, having different structures. Your task is to consolidate them in a single database to be able to answer questions about all these data. I understand the hint about "not to worry about the ETL, but just design the ultimate table" in that way that your consolidated database doesn't need to contain all detail information that might be present in the original data, but just enough information to fulfill the specific requirements to the consolidated database.
This sounds sensible as long as your superior is certain enough about these requirements. In that case, you will reduce the information coming from each source to the consolidated structure.
In any way, you'll have to capture the domain semantics of the data coming in from each source. Lacking access to your domain semantics, I can't clarify the mess of repeating values etc. for you. E.g., if there are detail records and gross total records, as in your example, it would be wrong to add the hours of all records, as this would always yield twice the hours actually worked. So someone will have to worry about ETL, namely interpreting each set of records, probably consisting of all entries for an employee and one working day, find out what they mean, and transform them to the consolidated structure.
I understand another part of the question to be about the usage of metadata. You can have different columns for notions like holiday leave and maternity leave, or you have a metadata table containing these notions as a key-value pair, and refer to the key from your main table. The metadata way is sometimes praised as being more flexible, as you can introduce a new type (like paternity leave) without redesigning your database. However, you will need to redesign the software filling and probably also querying your tables to make use of the new type. So you'll have to develop and deploy a new software release anyway, and adding a few columns to a table will just be part of that development effort.
There is one major difference between a broad table containing all notions as attributes and the metadata approach. If you want to make sure that, for a time period, either all or none of the values are present, that's easy with the broad table: Just make all attributes `not nullĀ“, and you're done. Ensuring this for the metadata solution would mean some rather complicated constraint that may or may not be available depending on the database system you use.
If that's not a main requirement, I would go a pragmatic way and use different columns if I expect only a handful of those types, and a separate key-value table otherwise.
All these considerations relied on your superior's assertion (as I understand it) that your consolidated table will only need to fulfill the requirements known today, so you are free to throw original detail information away if it's not needed due to these requirements. I'm wary of that kind of assertion. Let's assume some of your information sources deliver additional information. Then it's quite probable that someday someone asks for a report also containing this information, where present. This won't be possible if your data structure only contains what's needed today.
There are two ways to handle this, i.e. to provide for future needs. You can, after knowing the data coming from each additional source, extend your consolidated database to cover all data structures coming from there. This requires some effort, as different sources might express the same concept using different data, and you would have to consolidate those to make the data comparable. Also, there is some probability that not all of your effort will be worth the trouble, as not all of the detail information you get will actually be needed for your consolidated database. Another more elegant way would therefore be to keep the original data that you import for each source, and only in case of a concrete new requirement, extend your database and reimport the data from the sources to cover the additional details. Prices of storage being low as they are, this might yield an optimal cost-benefit ratio.

DISTINCT COUNT basket analysis between multiple measures with shared dimension

Let's say I have a cube with two different distinct count measures, call them Measure1 and Measure2. Both of these measures contain a common dimension, Dimension1, which is counted by both measures.
What I need to do is return a distinct count of Dimension1 members that exist in both Measure1 and Measure2, after appropriate filtering on each measure as required.
I can define MDX queries for both Measure1 and Measure2 individually and get distinct counts, but I need to be able to "overlap" the result to avoid double-counting the members that exist in both sets.
Note: in the actual scenario, there are more than 2 measures involved, and all MDX queries will be dynamically constructed (the user defines which measures and dimension criteria are included).
Can this be done in SSAS/MDX? If not, is there another Microsoft tool/feature that can? The minimum requirement for the system is SQL Server 2008 R2 Standard Edition.
Honestly I have no idea where to start. Google turned up nothing like this (I saw some basket analysis stuff involving a single measure, but I'm unsure if or how to apply that to my scenario). I'm not an SSAS/MDX/BI expert by any means.
There are two alternatives that I can think of:
Use DRILLTHROUGH using the individual MDX queries and (essentially) COUNT DISTINCT the results.
Use T-SQL on the data warehouse source database. (May be difficult to account for all scenarios efficiently.)
We do have a requirement to also be able to drillthrough, so I'll probably have to implement solution #1 anyway, but it would be nice to have a more efficient way to obtain just the counts, as counts will be needed far more frequently.
I would add a Distinct Count measure based on the Dimension1 Key attribute. I would present it in Excel 2010+ using the Sets MDX feature to filter on Measure1, 2 etc.
I never did find an MDX solution for this.
I went ahead with a solution that queries the data warehouse directly, and it's working pretty well so far after some performance tweaks. This approach may not be suitable for all applications, but it looks like it will work for our particular scenario.
I would recomend union function either on a SQL Server side (create view froom two tables) or at a SSAS side (create a single measure but with diferent partitions from diferent sources (for example, Partition1 - for Credits, Partition2 - for Deposits).
For second way, initialy over this "monstrous" decision you need make
simple Measure1 using SUM function. And after, ckick on Measure1 and choose "Create new measure" using DistionctCount function.
So SSAS would make a separate new Measure Group with Measure2 using DistionctCount function.
It must work perfectly.
Lets simplify the problem statement. You want the count of customers who bought both bread and eggs or who have a toyota and honda. I faced this issue a long time back and came up with a query design. The performance for these queries was not good. By the nature of these queries they are opening the fact to grain level. Hence all aggreagtion benifits lost.
Here is the code, I am counting the customers based on their names, who ordered
ClassicVestS or HLMountainTire and other products
with
member [Measures].[CustomersWhoBoughtClassicVestS] as
count(
intersect(
{nonempty(
existing ([Customer].[Customer].children),[Measures].[Internet Order Count]
)},
{extract(nonempty( ([Customer].[Customer].children* [Product].[Product].&[471]),[Measures].[Internet Order Count]),[Customer].[Customer])}
)
)
member [Measures].[CustomersWhoBoughtHLMountainTire] as
count(
intersect(
{nonempty(
existing ([Customer].[Customer].children),[Measures].[Internet Order Count]
)},
{extract(nonempty( ([Customer].[Customer].children* [Product].[Product].&[537]),[Measures].[Internet Order Count]),[Customer].[Customer])}
)
)
Select {[Measures].[CustomersWhoBoughtClassicVestS],[Measures].[CustomersWhoBoughtHLMountainTire]
} on columns ,
{ nonempty( [Product].[Product].children
,[Measures].[Internet Order Count]) }
on rows
from [Adventure Works]

Join-Free Table structure for Tags

I'm working on a little blog software, and I'd like to have tags attached to a post. Each Post can have between 0 and infinite Tags, and I wonder if it's possible to do that without having to join tables?
As the number of tags is not limited, I can not just create n fields (Tag1 to TagN), so another approach (which is apparently the one StackOverflow takes) is to use one large text field and a delimiter, i.e. "<Tag1><Tag2><Tag3>".
The problem there: If I want to display all posts with a tag, I would have to use a "Like '%<Tag2>%'" statement, and those can AFAIK not use any indexes, requiring a full table scan.
Is there any suitable way to solve this?
Note: I know that a separate Tag-Link-Table offers benefits and that I should possibly not worry about performance without measuring etc. I'm more interested in the different ways to design a system.
Wanting to do this without joins strikes me as a premature optimisation. If this table is being accessed frequently, its pages are very likely to be in memory and you won't incur an I/O penalty reading from it, and the plans for the queries accessing it are likely to be cached.
A separate tag table is really the only way to go here. It is THE only way to allow an infinite number of tags.
This sounds like an exercise in denormalization. All that's really needed is a table that can naturally support any query you happen to have, by repeating any information you would otherwise have to join to another table to satisfy. A normalized database for something like what you've got might look like:
Posts:
PostID | PostTitle | PostBody | PostAuthor
--------+--------------+-------------------+-------------
1146044 | Join-Free... | I'm working on... | Michael Stum
Tags:
TagID | TagName
------+-------------
1 | Archetecture
PostTags:
PostID | TagID
--------+------
1146044 | 1
Then You can add a columns to optimise your queries. If it were me, I'd probably just leave the Posts and Tags tables alone, and add extra info to the PostTags join table. Of course what I add might depend a bit on the queries I intend to run, but probably I'd at least add Posts.PostTitle, Posts.PostAuthor, and Tags.TagName, so that I need only run two queries for showing a blog post,
SELECT * FROM `Posts` WHERE `Posts`.`PostID` = $1
SELECT * FROM `PostTags` WHERE `PostTags`.`PostID` = $1
And summarizing all the posts for a given tag requires even less,
SELECT * FROM `PostTags` WHERE `PostTags`.`TagName` = $1
Obviously the downside to denormalization is that it means you have to do a bit more work to keep the denormalized tables up to date. A typical way of dealing with this is to put some sanity checks in your code that detects when a denormalized query is out of sync by comparing it to other information it happens to have available. Such a check might go in the above example by comparing the post titles in the PostTags result set against the title in the Posts result. This doesn't cause an extra query. If there's a mismatch, the program could notify an admin, ie by logging the inconsistency or sending an email.
Fixing it is easy (but costly in terms of server workload), throw out the extra columns and regenerate them from the normalized tables. Obviously you shouldn't do this until you have found the cause of the database going out of sync.
If you're using SQL Server, you could use a single text field (varchar(max) seems appropriate) and full-text indexing. Then just do a full-text search for the tag you're looking for.

Resources