Merge vs Synchronize databases - database

I know, I know... I'm not the first to ask this question but I have gone through numerous posts (on SO as well), but I'm still dissatisfied. I need to merge tables in two databases that have identical schema/structure. See example below. I used a trial of Redgate's SQL Data Compare, but it seems to me that that software only synchronizes Database B to look like A, and often clobbers the data in Database B. If you know any other software that can do a "true DB merge" (note I DO have Foreign Key relationships set up) effectively, then fine. Otherwise how can I do this as quickly and reliably in SQL?
Database A:
PK   Rank
1     Private  
5     Sergeant
ID---Name        RankID
54---Joe           1
60---Frank        1
63---Robert       5
Database B:
PK  Rank
2     Private
3     Corporal
4     Sergeant
6     Lieutenant
ID---Name          RankID
40---Moe             2
45---Steve           2
67---Max             3
78---Tom             4
80---Peter           6
Ideal Merged Database:
PK   Rank Description
1     Private
5     Sergeant
10   Corporal
11 Lieutenant
ID----Name-----RankID
54----Joe--------1
60----Frank------1
63----Robert-----5
100---Moe-------1
101---Steve-----1
102---Max-------10
103---Tom-------5
104---Peter-----11
Sorry for the formatting (had a rough time aligning columns). If it's still not clear what I'm looking for, please let me know

Related

How to use temp table for ETL during populating Star schema?

I have the scripts to pull data from RDBMS and populate the Data warehouse. It works. I was wondering how Temp table comes in between. What are the steps to ETL exactly? Even though my data warehouse is populated, my teacher says that we need to use a temp table. why is it important?
Please help me. I am very confused right now. Thank you.
We need to pull the data from the databases of two different offices in two different locations. I will give the details of the table below.
lds_job_role
job_role_id:integer job_role_desc:varchar
key_skill_1:INEGER key_skill_2:INEGER key_skill_3:INEGER
recommended_sal:INTEGER
lds_account
account_id:INTEGER acc_name:VARCHAR acc_postcode:VARCHAR
lds_placement
placement_id:INTEGER plt_short_desc: VARCHAR
plt_required_start_date:DATE plt_estimated_end_date:DATE
plt_actual_start_date:DATE plt_renewal_no:INTEGER
plt_to_permanent:VARCHAR max_salary: INTEGER
min_salary:INTEGER actual_salary:INTEGER
mch_job_role
job_role_id:INTEGER job_role_desc:VARCHAR
recommended_sal:INTEGER
mch_account
account_id:INTEGER acc_name:VARCHAR acc_postcode:VARCHAR
mch_placement
placement_id:INTEGER plt_short_desc:INTEGER
plt_required_start_date:DATE plt_estimated_end_date:DATE
plt_actual_start_date:DATE plt_actual_end_date:DATE
plt_renewal_no:INTEGER plt_to_permanent:VARCHAR
max_salary:INTEGER min_salary:INTEGER
actual_salary:INTEGER supervisor_name:VARCHAR
Below are the facts and dimensions of the Star schema:
job_roe_dim
job_role_id job_role_desc
time_dim
time_id year
account_dim
account_id account_name
fact_accounts
Report_id no_of_placements salary FK1_time_id
FK2_account_id FK3_job_role_id
The exercise tells us to "deal with data quality issues, measures for FACTs, identifiers etc ". I THINK I have already done that with the scripts, but I do not know how to show it. Perhaps that is why the the temp table is required?

Best databasing practice for tracing non-homogenous events

Suppose I have the the following event data scheme:
event_record_unique_id: long
event_timestamp: long
session_id: long
event_id: int
event_data: data # concrete type depends on event_id
... so, the contents of the data may depend on, let's say 500, event_ids, leading to 200 different concrete data types for "data". For example:
{
event_record_unique_id: 17126721
event_timestamp: 1234
session_id: 3452
event_id: 50
event_data: {
user_id: 123
page_id: 789
}
}
{
event_record_unique_id: 1712672123
event_record_unique_id: 17126723
event_timestamp: 1234
session_id: 3454
event_id: 51
event_data: {
user_id: 124
button_id: 789
}
}
{
event_timestamp: 1234
session_id: 3454
event_id: 51
event_data: {
crash_report: "text"
device_id: "12312"
}
}
Also:
many of the event_data attributes appear in many of the concrete event_data objects
I need to perform indexed searches on some of the event_data attributes (e.g. find me all the records where user_id=X )
there's a continuing need to keep on adding event types and new attributes
the above data structure is always trivially flattened so that a single record can be represented equivalently as a row with N columns where (and attributes name/type collision
are solved by renaming attributes).
The naive RDBMS approach would involved making ~500 tables (one per concrete type of "data"). I've discounted this approach (= excessive waste of human effort in modelling). Plus, I cannot easily search all records over user_id (since user_id appears in very many tables).
Flattening the structure in an RDBMS is also quite costly (N-8 of the elements are NULL and contain no information).
Mongodb-type document database solutions appear to be a good, however, space costs seems quite high if attribute names are held with each record, not much better than an RDBMS. However, this does allow me to index by fields in the data object.
For me, an ideal data representation of this would be a table that is optimized to allow rows with many null elements (e.g. by keeping an active column bitmask per row). Or a document DB in which a document collection maintains a library of document schemas used enable compacting the data (and each document having reference to its schema).
What kind of database would people recommend for the above example case?
MS SQL Server 2008 and up have Sparse Columns. Up to 30,000 can be added in a table, and they can be indexed (filtered indexes are recommended). Or so says BOL, I have not used them myself. This would results in a single very large table that might support what you need.
With that said, I don't know it would be particularly efficient. Some math:
Assume 10 rows a second
becomes 10*60*60*24 = 864,000 rows a day
or 315,360,000 rows a year
with a very rough over-estimate of 50 bytes a row
is about 14GB a year
for how many years do you have to keep the data?
and double that if it's more like 20 rows per second
So storage doesn't seem too way out of line... but I don't know, you want to work up some serious size projection factors. And that's just storage, what do you want or need to do with the data? Is retrieval time for specified rows important? What about analysis and data mining? I'm a SQL guy through and through, and I think it could be done, but this pretty much is the kind of problem that Hadoop and NoSQL solutions were devised for, and it could well be worth your time to thoroughly investigating those options.

Are there any modern languages which allow record level database access?

Folks who are familiar with COBOL and languages of that era may remember writing code in the style
While records exist in table A
Read a record from table A
If some condition
Read records in table B until match found
If some condition in record B
Read a record in table C
repeat ad nauseum
Our company is just starting to talk about updating our COBOL codebase to something more modern and any conversion would be much easier if we can continue to use record-level access, at least during the transition. Rewriting everything in a new language and converting everything to SQL might be too much to undertake.
Is there any modern language/database combination out there that will give us record-level access to our data?
The answer depends on specific details of your situation. Some form of record-level access should be available in most modern languages. Here is an example in Python, assuming the data files are what Cobol would call "organization line sequential". Notice the syntax is not too different from your example. Depending on how the data files are structured, you might need to use something like tableB.seek(0) to restart searching at the start of the file.
tableA = open('tableA.txt', 'r')
tableB = open('tableB.txt', 'r')
tableC = open('tableC.txt', 'r')
for rowA in tableA:
if some_Condition():
for rowB in tableB:
if rowA_rowB_match():
if some_condition_in_record_B():
for rowC in tableC:
repeat ad nauseum
tableA.close()
tableB.close()
tableC.close()
def some_Condition():
if x:
return True
else:
return False
etc.

Multi-valued Database (UniVerse) -- SM (MV) vs SM (VS) and ASSOC()

I have a question taken from pg 16 of IBM's Nested Relational Database White Paper, I'm confused why in the below CREATE command they use MV/MS/MS rather than MV/MV/MS, when both ORDER_#, and PART_# are one-to-many relationships.. I don't understand what value, vs sub-value means in non-1nf database design. I'd also like to know to know more about the ASSOC () clause.
Pg 16 of IBM's Nested Relational Database White Paper (slight whitespace modifications)
CREATE TABLE NESTED_TABLE (
CUST# CHAR (9) DISP ("Customer #),
CUST_NAME CHAR (40) DISP ("Customer Name"),
ORDER_# NUMBER (6) DISP ("Order #") SM ("MV") ASSOC ("ORDERS"),
PART_# NUMBER (6) DISP (Part #") SM ("MS") ASSOC ("ORDERS"),
QTY NUMBER (3) DISP ("Qty.") SM ("MS") ASSOC ("ORDERS")
);
The IBM nested relational databases implement nested tables as repeating attributes and
repeating groups of attributes that are associated. The SM clauses specify that the attribute is either repeating (multivalued--"MV") or a repeating group (multi-subvalued--"MS"). The ASSOC clause associates the attributes within a nested table. If desired, the IBM nested relational databases can support several nested tables within a base table. The following standard SQL statement would be required to process the 1NF tables of Figure 5 to produce the report shown in Figure 6:
SELECT CUSTOMER_TABLE.CUST#, CUST_NAME, ORDER_TABLE.ORDER_#, PART_#, QTY
FROM CUSTOMER_TABLE, ORDER_TABLE, ORDER_CUST
WHERE CUSTOMER_TABLE.CUST_# = ORDER_CUST.CUST_# AND ORDER_CUST.ORDER_# =
ORDER _TABLE.ORDER_#;
Nested Table
Customer # Customer Name Order # Part # Qty.
AA2340987 Zedco, Inc. 93-1123 037617 81
053135 36
93-1154 063364 32
087905 39
GV1203948 Alphabravo 93-2321 006776 72
055622 81
067587 29
MT1238979 Trisoar 93-2342 005449 33
036893 52
06525 29
93-4596 090643 33
I'll go ahead and answer my own question, while pursuing IBM's UniVerse SQL Administration for DBAs I came across code for CREATE TABLE on pg 55.
ACT_NO INTEGER FORMAT '5R' PRIMARY KEY
BADGE_NO INTEGER FORMAT '5R' PRIMARY KEY
ANIMAL_ID INTEGER FORMAT '5L' PRIMARY KEY
(see distracting side note below) This amused me at first, but essentially I believe this to be a column directive the same as a table directive like PRIMARY ( ACT_NO, BADGE_NO, ANIMAL_ID )
Later on page 5-19, I saw this
ALTER TABLE LIVESTOCK.T ADD ASSOC VAC_ASSOC (
VAC_TYPE KEY, VAC_DATE, VAC_NEXT, VAC_CERT
);
Which leads me to believe that tacking on ASSOC (VAC_ASSOC) to a column would be the same... like this
CREATE TABLE LIVESTOCK.T (
VAC_TYPE ... ASSOC ("VAC_ASSOC")
VAC_DATE ... ASSOC ("VAC_ASSOC")
VAC_NEXT ... ASSOC ("VAC_ASSOC")
VAC_cERT ... ASSOC ("VAC_ASSOC")
);
Anyway, I'm not 100% sure I'm right, but I'm guessing the order doesn't matter, and that rather than these being an intransitive association they're just a order-insensitive grouping.
Onward! With the second part of the question pertaining to MS and MV, I for the life of me can not figure out where the hell IBM got this syntax from. I believe it to be imaginary. I don't have access to a dev machine I can play on to test this out, but I can't find it (the term MV) in the old 10.1 or the new UniVerse 10.3 SQL Reference
side note for those not used to UniVerse the 5R and 5L mean 5 characters right or left justified. That's right a display feature built into the table meta data... Google for UniVerse FORMAT (or FMT) for more info.
Just so you know, Attribute, Multivalue and Sub-Multivalue comes from the way they structure their data.
Essentially, all data is stored in a tree of sorts.
UniVerse is a Multivalue Database. Generally, it does not work in the say way as Relational DBs of the SQL work function.
Each record can have multiple attributes.
Each attribute can have multiple multivalues.
Each multivalue can have multiple sub-multivalues.
So, if I have a record called FRED
Then, FRED<1,2,3> refers to the 1st attribute, 2 multivalue position and 3 subvalue position.
To read more about it, you need to learn more about how UniVerse works. The SQL section is just a side part of it. I suggest you read the other manuals to understand what you are working with.
EDIT
Essentially, the code above is telling you that:
There may be multiple orders per client. These are stored at an MV level in the 'table'
There may be multiple parts per order. These are stored at the MS level in the 'table'
There may be multiple qtys per order. These are stored at the MS level in the 'table'. since they are at the same level, although they are 1-n for orders, they are 1-1 in regards to parts.

SSIS/VB.NET Equivalent of SQL IN (anonymous array.Contains())

I've got some SQL which performs complex logic on combinations of GL account numbers and cost centers like this:
WHEN (#IntGLAcct In (
882001, 882025, 83000154, 83000155, 83000120, 83000130,
83000140, 83000157, 83000010, 83000159, 83000160, 83000161,
83000162, 83000011, 83000166, 83000168, 83000169, 82504000,
82504003, 82504005, 82504008, 82504029, 82530003, 82530004,
83000000, 83000100, 83000101, 83000102, 83000103, 83000104,
83000105, 83000106, 83000107, 83000108, 83000109, 83000110,
83000111, 83000112, 83000113, 83100005, 83100010, 83100015,
82518001, 82552004, 884424, 82550072, 82552000, 82552001,
82552002, 82552003, 82552005, 82552012, 82552015, 884433,
884450, 884501, 82504025, 82508010, 82508011, 82508012,
83016003, 82552014, 81000021, 80002222, 82506001, 82506005,
82532001, 82550000, 82500009, 82532000))
Overall, the whole thing is poorly performing in a UDF, especially when it's all nested and the order of the steps is important etc. I can't make it table-driven just yet, because the business logic is so terribly convoluted.
So I'm doing a little exploratory work in moving it into SSIS to see about doing it in a little bit of a different way. Inside my script task, however, I've got to use VB.NET, so I'm looking for an alternative to this:
Select Case IntGLAcct = 882001 OR IntGLAcct = 882025 OR ...
Which is obviously a lot more verbose, and would make it terribly hard to port the process.
Even something like ({90605, 90607, 90610} AS List(Of Integer)).Contains(IntGLAcct) would be easier to port, but I can't get the initializer to give me an anonymous array like that. And there are so many of these little collections, I'm not sure I can create them all in advance.
It really all NEEDS to be in one place. The business changes this logic regularly. My strategy was to use the udf to mirror their old "include" file, but performance has been poor. Now each of the functions takes just 2 or three parameters. It turns out that in a dark corner of the existing system they actually build a multi-million row table of all these results - even though the pre-calced table is not used much.
So my new experiment is to (since I'm still building the massive cross join table to reconcile that part of the process) go ahead and use the table instead of the code, but go ahead and populate this table during an SSIS phase instead of calling the udf 12 million times - because my udf version just basically stopped working within a reasonable time frame and the DBAs are not of much help right now. Yet, I know that SSIS can process these rows pretty efficiently - because each month I bring in the known good results dozens of multi-million row tables from the legacy system in minutes AND run queries to reconcile that there are no differences with the new versions.
The SSIS code would theoretically become the keeper of the business logic, and the efficient table would be built from that (based on all known parameter combinations). Of course, if I can simplify the logic down to a real logic table, that would be the ultimate design - but that's not really foreseeable at this point.
Try this:
Array.IndexOf(New Integer() {90605, 90607, 90610}, IntGLAcct) >-1
What if you used a conditional split transform on your incoming data set and then used expressions or something similar (I'm not sure if your GL Accounts are fixed or if you're going to dynamically pass them in) to apply to the results? You can then take the resulting data from that and process as necessary.

Resources