Need help understanding alternatives to scd in SSIS - sql-server

I am working on a data warehouse project that will involve integrating data from multiple source systems. I have set up an SSIS package that populates the customer dimension and uses the slowly changing dimension tool to keep track of updates to the customer.
I'm running into some issues. Take this example:
Source system A might have a record like that looks like this:
First Name, Last Name, Zipcode
Jane, Doe, 14222
Source system B might have a record for the same client that looks like this:
First Name, Last Name, Zipcode
Jane, Doe, Unknown
If I first import the record from system A, I'll have the first name, last name, and ethnicity. Great. Now, if I import the client record from system B, I can do fuzzy matching to recognize that this is the same person and use the slowly changing dimension tool to update the information. But in this case, I'm going to lose the zipcode because the 'unknown' will overwrite the valid data.
I am wondering if I am approaching this problem in the wrong way. The SCD tool doesn't seem to offer any way of selectively updating attributes based on whether the new data is valid or not. Would a merge statement work better? Am I making some kind of fundamental design mistake that I'm not seeing?
Thanks for any advice!

In my experience the built-in SCD tool is not flexible enough to handle this requirement.
Either a couple of MERGE statements, or a series of UPDATE and INSERT statements will probably give you most flexibility with logic, and performance.
There are probably models out there for MERGE statement for SCD Type 2 but here is the pattern I use:
Merge Target
Using Source
On Target.Key = Source.Key
When Matched And
Target.NonKeyAttribute <> Source.NonKeyAttribute
Or IsNull(Target.NonKeyNullableAttribute, '') <> IsNull(Source.NonKeyNullableAttribute, '')
Then Update Set SCDEndDate = GetDate(), IsCurrent = 0
When Not Matched By Target Then
Insert (Key, ... , SCDStartDate, IsCurrent)
Values (Source.Key, ..., GetDate(), 1)
When Not Matched By Source Then
Update Set SCDEndDate = GetDate(), IsCurrent = 0;
Merge Target
Using Source
On Target.Key = Source.Key
-- These will be the changing rows that were expired in first statement.
When Not Matched By Target Then
Insert (Key, ... , SCDStartDate, IsCurrent)
Values (Source.Key, ... , GetDate(), 1);

Related

Snowflake JSON unknown keyword error when trying to get distinct values

I have a table in Snowflake where one of the fields, called 'value' is sometimes plain text sometimes JSON, this field is stored as string in Snowflake
I created this view to get only the rows where there is a Json format
CREATE OR REPLACE VIEW tmp_events AS
SELECT PARSE_JSON(value) as json_data,
id
FROM SessionEvent
WHERE session_event_type_id=7;
Then I flatten the rows to create a new field
CREATE OR REPLACE VIEW tmp_events_step2 AS
SELECT id,
json_data:meta:selected::string AS choice
from tmp_events ,
LATERAL FLATTEN(input => tmp_events.json_data)
WHERE choice IS NOT NULL
Everything runs fine until now, I can preview data from these two views, no error and I get the results I was expecting.
The error comes when I try to get distinct values from choice
SELECT DISTINCT choice from tmp_events_step2;
Error parsing JSON: unknown keyword "brain", pos 6
This name Brain seems to come from my initial table without the WHERE statement.
If I run the query without DISTINCT there is no error.
Weird thing I noticed while trying to debug: when I put a limit in tmp_events_step2, the code works fine again, even though I put a limit that's bigger than the number of rows in the table
CREATE OR REPLACE VIEW tmp_events_step2 AS
SELECT id,
json_data:meta:selected::string AS choice
from tmp_events ,
LATERAL FLATTEN(input => tmp_events.json_data)
WHERE choice IS NOT NULL
LIMIT 10000000;
SELECT DISTINCT choice from tmp_events_step2;
What's the catch? Why does it work only with the limit?
The very simple answer to this is the built-in function TRY_PARSE_JSON()
Er, not. You seem to have problems with the Query optimizer that may do incorrect predicate pushdowns. One way to prevent the optimizer from doing this is to use the secure view option:
CREATE SECURE VIEW tmp_events_step2 ...
and file a support ticket...
We reported this error two years ago and they said they where not going to fix, because by hoisting the JSON access prior to running the filters in the WHERE clause that makes the cast valid/safe, impacted performance.
create table variant_cast_bug(num number, var variant);
insert into variant_cast_bug
select column1 as num, parse_json(column2) as var
from values (1, '{"id": 1}'),
(1, '{"id": 2}'),
(2, '{"id": "text"}')
v;
select * from variant_cast_bug;
select var:id from variant_cast_bug;
select var:id from variant_cast_bug where num = 1;
select var:id::number from variant_cast_bug where num = 1; -- <- broken
select TRY_TO_NUMBER(var:id) from variant_cast_bug where num = 1; -- <- works
Sometimes you can nest the select and it will work, and then you can add another SELECT layer around it, and do some aggregation and the cost explodes again.
The only two safe solutions are SERCURE VIEW as Hans mentions, but that is a performance nightmare.
Or to understand this problem and use TRY_TO_NUMBER or it's friends.
At the time this was made bad worse because JSON boolean values where not valid values to pass to TRY_TO_BOOLEAN..
One of the times we got burnt by this was after a snowflake release when code that had been running for a year, started getting this error, because it was complex enough the hoisting did not impact, and then after release it did. This is where Snowflake are rather responsive, and then rolled the release back, and we put TRY_TO on a chunk of already working SQL just to play it safe.
Please submit a support case for this issue.

MS Access, use query name as field default value

My department uses a software tool that can use a custom component library sourced from Tables or Queries in an MS Access database.
Table: Components
ID: AutoNumber
Type: String
Mfg: String
P/N: String
...
Query: Resistors
SELECT Components.*
FROM Components
WHERE Components.Type = "Resistors"
Query: Capacitors
SELECT Components.*
FROM Components
WHERE Components.Type = "Capacitors"
These queries work fine for SELECT. But when users add a row to the query, how can I ensure the correct value is saved to the Type field?
Edit #2:
Nope, can't be done. Sorry.
Edit #1:
As was pointed out, I may have misunderstood the question. It's not a wonky question after all, but perhaps an easy one?
If you're asking how to add records to your table while making sure that, for example, "the record shows up in a Resistors query if it's a Resistor", then it's a regular append query, that specifies Resisitors as your Type.
For example:
INSERT INTO Components ( ID, Type, Mfg )
SELECT 123, 'Resistors', 'Company XYZ'
If you've already tried that and are having problems, it could be because you are using a Reserved Word as a field name which, although it may work sometimes, can cause problems in unexpected ways.
Type is a word that Access, SQL and VBA all use for a specific purpose. It's the same idea as if you used SELECT and FROM as field or table names. (SELECT SELECT FROM FROM).
Here is a list of reserved words that should generally be avoided. (I realize it's labelled Access 2007 but the list is very similar, and it's surprisingly difficult to find an recent 'official' list for Excel VBA.)
Original Answer:
That's kind a a wonky way to do things. The point of databases is to organize in such a way as to prevent duplication of not only data, but queries and codes as well
I made up the programming rule for my own use "If you're doing anything more than once, you're doing it wrong." (That's not true in all cases but a general rule of thumb nonetheless.)
Are the only options "Resistors" and "Capacitors"? (...I hope you're not tracking the inventory of an electronics supply store...) If there are may options, that's even more reason to find an alternative method.
To answer your question, in the Query Design window, it is not possible to return the name of the open query.
Some alternative options:
As #Erik suggested, constrain to a control on a form. Perhaps have a drop-down or option buttons which the user can select the relevant type. Then your query would look like:
SELECT * FROM Components WHERE Type = 'Forms![YourFormName]![NameOfYourControl]'
In VBA, have the query refer to the value of a variable, foe example:
Dim TypeToDel as String
TypeToDel = "Resistor"
DoCmd.RunSQL "SELECT * FROM Components WHERE Type = '" & typeToDel'"
Not recommended, but you could have the user manually enter the criteria. If your query is like this:
SELECT * FROM Components WHERE Type = '[Enter the component type]'
...then each time the query is run, it will prompt:
Similarly, you could have the query prompt for an option, perhaps a single-digit or a code, and have the query choose the the appropriate criteria:
...and have an IF statement in the query criteria.
SELECT *
FROM Components
WHERE Type = IIf([Enter 1 for Resistors, 2 for Capacitors, 3 for sharks with frickin' laser beams attached to their heads]=1,'Resistors',IIf([Enter 1 for Resistors, 2 for Capacitors, 3 for sharks with frickin' laser beams attached to their heads]=2,'Capacitors','LaserSharks'));
Note that if you're going to have more than 2 options, you'll need to have the parameter box more than once, and they must be spelled identically.
Lastly, if you're still going to take the route of a separate query for each component type, as long as you're making separate queries anyway, why not just put a static value in each one (just like your example):
SELECT * FROM Components WHERE Type = 'Resistor'
There's another wonky answer here but that's just creating even more duplicate information (and more future mistakes).
Side note: Type is a reserved word in Access & VBA; you might be best to choose another. (I usually prefix with a related letter like cType.)
More Information:
Use parameters in queries, forms, and reports
Use parameters to ask for input when running a query
Microsoft Access Tips & Tricks: Parameter Queries
 • Frickin' Lasers

Relational database structure design advice

This is a textual description of data for which I need to create a database design (using SQLite) for an application.
The application needs to keep a record of operations. Each operation has a Name and its list of parameters. Each parameter has its Name and a Value. However, the values of the parameters will change over the lifetime of the app (in fact the user will be able to changes them using GUI) and we want to keep a history of the values which a certain parameter has had. Furthermore, each operation can have multiple parameter sets. A parameter set is like an envelope which encompasses a set of parameter values (which all belong to the same operation) and gives this envelope a unique Number and a non-unique Description.
This is what I have so-far:
[Database model image][1]
The database model should allow me to perform these actions on the database data:
Show a list of operations - I know how to do this.
Show a list of parameters for a given operation - I know how to do this.
For a given operation, show all its parameters as columns and show the values of the parameters as rows - each row represents a different parameter value from the history of values. I'm stuck at this one.
For a given operation, show a list of all parameter sets which belong to that operation. I'm stuck at this one too.
For a given operation and for a given parameter set, get the latest values of its parameters. Stuck at this.
I'm not sure if I should re-work my database model or if I should look for proper SQL statements to accomplish the tasks above with the model that I have. Any help is greatly appreciated. Thank you.
EDIT 1
I have re-worked my database model according to a helpful advice from #Marek Herman. Thanks to that I am able to accomplish tasks 1) 2) 4).
Now I'm trying to accomplish 5) which should not be that difficult with the current database model. I have this SQL statement:
SELECT Parameter.ParameterIdentifier, ParameterValue.ParameterValue,
ParameterValueVersion.VersionNumber, ParameterValueVersion.ChangedOn
FROM ParameterValueVersion INNER JOIN
(((Operation INNER JOIN Parameter ON Operation.OperationPLC_ID = Parameter.OperationPLC_ID)
INNER JOIN ParameterSet ON Operation.OperationPLC_ID = ParameterSet.OperationPLC_ID)
INNER JOIN ParameterValue ON (ParameterSet.ID = ParameterValue.ParameterSetID) AND
(Parameter.ID = ParameterValue.ParameterID)) ON ParameterValueVersion.ID = ParameterValue.ParameterValueVersionID
WHERE (Operation.OperationPLC_ID=[opID] AND
ParameterSet.ParameterSetNumber=[parSetNum]);
where [opID] and [parSetNum] are the input parameters. This SQL statement actually only joins all these tables together on their PK->FK relationship: Operation, Parameter, ParameterSet, ParameterValue, ParameterValueVersion and filters the rows by specified OperationPLC_ID and ParameterSetNumber.
Here is an example of an output of this SQL statement. Each row shows a name of a parameter, its value, a version number of the value and date of change of that value. Some parameters only have one value (only one version -e.g., "OFFSET"). Some parameters have two values. For example "PREFILLING" has a value of "3" which was input on Oct 20, 2016 (and has a version number 1) and it also has a value of "3.5" which was input on Oct 21, 2016 and has a version number of 2. So I'd like to show only the latest versions of the values of the parameters. Any advice how to modify the SQL statement is much appreciated. Thank you.
EDIT 2
I guess I figured out how to perform 5). I had to study a bit how GROUP BY works. This did the trick:
SELECT Parameter.ParameterIdentifier, last(ParameterValue.ParameterValue) AS ParameterValue, last(ParameterValueVersion.ChangedOn) AS ChangedOn, max(ParameterValueVersion.VersionNumber) AS VersionNumber
FROM ParameterValueVersion INNER JOIN
(((Operation INNER JOIN Parameter ON Operation.OperationPLC_ID = Parameter.OperationPLC_ID)
INNER JOIN ParameterSet ON Operation.OperationPLC_ID = ParameterSet.OperationPLC_ID)
INNER JOIN ParameterValue ON (ParameterSet.ID = ParameterValue.ParameterSetID) AND
(Parameter.ID = ParameterValue.ParameterID)) ON ParameterValueVersion.ID = ParameterValue.ParameterValueVersionID
WHERE (((Operation.OperationPLC_ID)=[opID]) AND ((ParameterSet.ParameterSetNumber)=[parSetNum]))
GROUP BY Parameter.ParameterIdentifier
ORDER BY Parameter.ParameterIdentifier
Now I still need to figure out how to perform task no. 3. I'm gonna study the suggested COALESCE function. Thank you.
0) I would connect ParameterSet to Operation and Parameter and not to ParameterValue.
1) okay!
2) okay!
3) I think you can use the COALESCE() function to display the columns and then it should be possible to show all parameters with matching OperationID
4) you can do that if you do point #0
5) same as above I think

Selective PostgreSQL database querying

Is it possible to have selective queries in PostgreSQL which select different tables/columns based on values of rows already selected?
Basically, I've got a table in which each row contains a sequence of two to five characters (tbl_roots), optionally with a length field which specifies how many characters the sequence is supposed to contain (it's meant to be made redundant once I figure out a better way, i.e. by counting the length of the sequences).
There are four tables containing patterns (tbl_patterns_biliteral, tbl_patterns_triliteral, ...etc), each of which corresponds to a root_length, and a fifth table (tbl_patterns) which is used to synchronise the pattern tables by providing an identifier for each row—so row #2 in tbl_patterns_biliteral corresponds to the same row in tbl_patterns_triliteral. The six pattern tables are restricted such that no row in tbl_patterns_(bi|tri|quadri|quinqui)literal can have a pattern_id that doesn't exist in tbl_patterns.
Each pattern table has nine other columns which corresponds to an identifier (root_form).
The last table in the database (tbl_words), contains a column for each of the major tables (word_id, root_id, pattern_id, root_form, word). Each word is defined as being a root of a particular length and form, spliced into a particular pattern. The splicing is relatively simple: translate(pattern, '12345', array_to_string(root, '')) as word_combined does the job.
Now, what I want to do is select the appropriate pattern table based on the length of the sequence in tbl_roots, and select the appropriate column in the pattern table based on the value of root_form.
How could this be done? Can it be combined into a simple query, or will I need to make multiple passes? Once I've built up this query, I'll then be able to code it into a PHP script which can search my database.
EDIT
Here's some sample data (it's actually the data I'm using at the moment) and some more explanations as to how the system works: https://gist.github.com/823609
It's conceptually simpler than it appears at first, especially if you think of it as a coordinate system.
I think you're going to have to change the structure of your tables to have any hope. Here's a first draft for you to think about. I'm not sure what the significance of the "i", "ii", and "iii" are in your column names. In my ignorance, I'm assuming they're meaningful to you, so I've preserved them in the table below. (I preserved their information as integers. Easy to change that to lowercase roman numerals if it matters.)
create table patterns_bilateral (
pattern_id integer not null,
root_num integer not null,
pattern varchar(15) not null,
primary key (pattern_id, root_num)
);
insert into patterns_bilateral values
(1,1, 'ya1u2a'),
(1,2, 'ya1u22a'),
(1,3, 'ya12u2a'),
(1,4, 'me11u2a'),
(1,5, 'te1u22a'),
(1,6, 'ina12u2a'),
(1,7, 'i1u22a'),
(1,8, 'ya1u22a'),
(1,9, 'e1u2a');
I'm pretty sure a structure like this will be much easier to query, but you know your field better than I do. (On the other hand, database design is my field . . . )
Expanding on my earlier answer and our comments, take a look at this query. (The test table isn't even in 3NF, but the table's not important right now.)
create table test (
root_id integer,
root_substitution varchar[],
length integer,
form integer,
pattern varchar(15),
primary key (root_id, length, form, pattern));
insert into test values
(4,'{s,ş,m}', 3, 1, '1o2i3');
This is the important part.
select root_id
, root_substitution
, length
, form
, pattern
, translate(pattern, '12345', array_to_string(root_substitution, ''))
from test;
That query returns, among other things, the translation soşim.
Are we heading in the right direction?
Well, that's certainly a bizarre set of requirements! Here's my best guess, but obviously I haven't tried it. I used UNION ALL to combine the patterns of different sizes and then filtered them based on length. You might need to move the length condition inside each of the subqueries for speed reasons, I don't know. Then I chose the column using the CASE expression.
select word,
translate(
case root_form
when 1 then patinfo.pattern1
when 2 then patinfo.pattern2
... up to pattern9
end,
'12345',
array_to_string(root.root, '')) as word_combined
from tbl_words word
join tbl_root root
on word.root_id = root.root_id
join tbl_patterns pat
on word.pattern_id = pat.pattern_id
join (
select 2 as pattern_length, pattern_id, pattern1, ..., pattern9
from tbl_patterns_biliteral bi
union all
select 3, pattern_id, pattern1, pattern2, ..., pattern9
from tbl_patterns_biliteral tri
union all
...same for quad and quin...
) patinfo
on
patinfo.pattern_id = pat.pattern_id
and length(root.root) = patinfo.pattern_length
Consider combining all the different patterns into one pattern_details table with a root_length field to filter on. I think that would be easier than combining them all together with UNION ALL. It might be even easier if you had multiple rows in the pattern_details table and filtered based on root_form. Maybe the best would be to lay out pattern_details with fields for pattern_id, root_length, root_form, and pattern. Then you just join from the word table through the pattern table to the pattern detail that matches all the right criteria.
Of course, maybe I've completely misunderstood what you're looking for. If so, it would be clearer if you posted some example data and an example result.

SQL Query Notifications and GetDate()

I am currently working on a query that is registered for Query Notifications. In accordance w/ the rules of Notification Serivces, I can only use Deterministic functions in my queries set up for subscription. However, GetDate() (and almost any other means that I can think of) are non-deterministic. Whenever I pull my data, I would like to be able to limit the result set to only relevant records, which is determined by the current day.
Does anyone know of a work around that I could use that would allow me to use the current date to filter my results but not invalidate the query for query notifications?
Example Code:
SELECT fcDate as RecordDate, fcYear as FiscalYear, fcPeriod as FiscalPeriod, fcFiscalWeek as FiscalWeek, fcIsPeriodEndDate as IsPeriodEnd, fcPeriodWeek as WeekOfPeriod
FROM dbo.bFiscalCalendar
WHERE fcDate >= GetDate() -- This line invalidates the query for notification...
Other thoughts:
We have an application controls table in our database that we use to store application level settings. I had thought to write a small script that keeps a record up to date w/ teh current smalldatetime. However, my join to this table is failing for notificaiton as well and I am not sure why. I surmise that it has something to do w/ me specifitying a text type (the column name), which is frustrating.
Example Code 2:
SELECT fcDate as RecordDate, fcYear as FiscalYear, fcPeriod as FiscalPeriod, fcFiscalWeek as FiscalWeek, fcIsPeriodEndDate as IsPeriodEnd, fcPeriodWeek as WeekOfPeriod
FROM dbo.bFiscalCalendar
INNER JOIN dbo.xApplicationControls ON fcDate >= acValue AND acName = N'Cache_CurrentDate'
Does anyone have any suggestions?
EDIT: Here is a link on MSDN that gives the rules for Notification Services
As it turns out, I figured out the solution. Basically, I was invalidating my query attempts because I was casting a value as a DateTime which marks it as Non-Deterministic. Even though you don't specifically call out a cast but do something akin to:
RecordDate = 'date_string_value'
You still end up w/ a Date Cast. Hopefully this will help out someone else who hits this issue.
This link helped me quite a bit.
http://msdn.microsoft.com/en-us/library/ms178091.aspx
A good way to bypass this is simply to create a view that just says "SELECT GetDate() AS Now", then use the view in your query.
EDIT : I see nothing about not using user-defined functions (which is what I've used the 'view today' bit in). So can you use a UDF in the query that points at the view?

Resources