How to prevent symmetricds from sending outdated updates? - database

I’m using sync_on_incoming_batch option
All nodes are offline
Store1 updated price to 1.00
Store2 updated price to 2.00
Store1 sent 1.00 to server
Server sent 1.00 to Store2 (now Store2 has 1.00 but on sym_data has 2.00 to send)
Store2 sent 2.00 to server
Server sent 2.00 to Store1
In the end it was like this
Store1 has 2.00
Server has 2.00
Store2 has 1.00
Everyone should have 1.00, cause the Store1 made the update first.
I wanted a way disregard changes with dates prior to the
new data received...

I think that the conflict resolution NEWER_WINS based on a column with a time stamp of the update could help in this scenario: https://www.symmetricds.org/doc/3.13/html/user-guide.html#_conflicts

Related

Optimisation of Fact Read & Write Snowflake

All,
I am looking at some tips with regards to optimizing a ELT on a Snowflake Fact table with approx. 15 billion rows.
We get approximately 30,000 rows every 35 mins like the one below, we always will get 3 Account Dimension Key values i.e. Sales, Cogs & MKT.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
Y
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
This is then PIVOTED based on Finance Key and Date Key and loaded into another table for reporting, like the one below
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
IsCurrent
001
2019-01-01
001
0012
100
300
200
Y
At times there is an adjustment made and for 1 Account key.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
001
2019-01-01
001
0012
SALES_001
50
Hence we have to do this
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
X
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
001
2019-01-01
001
0012
SALES_001
50
Y
And the resulting value should be
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
001
2019-01-01
001
0012
50
300
200
However my question is how do I go about optimizing the query to scan and update the Pivoted table for approx. 15 billion rows in Snowflake.
This is more of a optimization the read and write .
Any pointers
Thanks
So a longer form answer:
I am going to assume the Date_Key values being very is the past, is just a feature of the example date, because if you have 15B rows, and every 35 minutes you have ~10K (30K /3) updates to apply, and they span the whole date range of your data, then there is very little you can do to optimize it. Snowflake Query Acceleration Service might help with the IO.
But the primary way to improve total processing time, is process less.
For example in my old job we had IoT data, that could have messages up to two weeks old. And we duplicated all messages on load (which is effectively a similar process), as part of our pipeline error handling. We found that handling batches with a min date of -2 week against of full message tables (that also had billions of rows) used most of the time, reading/writing the tables. By altering the ingestion to sideline message older than a couple of hours, and deferring their processing until "midnight" we could get the processing of all the timely points in the batch done in under 5 minutes (we did a large amount of other processing, in that interval) for every batch, and we could turn the warehouse cluster size down out of core data hours, and use the saved credits to run a bigger instance at the midnight "catchup" to bring all the days worth of sidelined data on board. This eventually consistent approach worked very well for our application.
So if your destination table is well clustered, so the read/writes is just a fraction of the data then this should be less of a problem for you. But by the nature of asking the question I assume this is not the case.
So if the tables natural clustering is unaligned with the loading data, is that because the destination table needs to be a different shape for the critical read the tables is handling, at which point which is more cost/time sensitive. Another option is to have the destination table clustered by the date (again assuming the fresh data is a small window of time) and have a materialize view/table on top of that table, with a different sort, so that the rewrite is being done for you by Snowflake, now this in not a free-lunch. But it might allow faster upsert times, and faster usage performance. Assuming both a time sensitive.

Is normalization always necessary and more efficient?

I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example.
Every day I collect data which would look like this in a single table:
TABLE: CAR_ALL
ID
DATE
CAR
LOCATION
FUEL
FUEL_USAGE
MILES
BATTERY
123
01.01.2021
Toyota
New York
40.3
3.6
79321
78
520
01.01.2021
BMW
Frankfurt
34.2
4.3
123232
30
934
01.01.2021
Mercedes
London
12.7
4.7
4321
89
123
05.01.2021
Toyota
New York
34.5
3.3
79515
77
520
05.01.2021
BMW
Frankfurt
20.1
4.6
123489
29
934
05.01.2021
Mercedes
London
43.7
5.0
4400
89
In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:
TABLE: CAR_CONSTANT
ID
CAR
LOCATION
123
Toyota
New York
520
BMW
Frankfurt
934
Mercedes
London
TABLE: CAR_MEASUREMENT
GUID
ID
DATE
FUEL
FUEL_USAGE
MILES
BATTERY
1
123
01.01.2021
40.3
3.6
79321
78
2
520
01.01.2021
34.2
4.3
123232
30
3
934
01.01.2021
12.7
4.7
4321
89
4
123
05.01.2021
34.5
3.3
79515
77
5
520
05.01.2021
20.1
4.6
123489
29
6
934
05.01.2021
43.7
5.0
4400
89
I have two questions:
Does it make sense to create an extra table for DATE?
It is possible that new cars will be included through the collected data.
For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it.
But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.
The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.
Pros:
With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.
Cons:
As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.
To answer your questions directly:
I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.

Oracle Service bus 11g- Scenerio

For OSB 11G,I have to manage 2 endpoints(let it be URL1 and URL2) to support 24x7 availability(where URL1 works from 8.00 hrs till 20.00 hrs and URL2 works from 20.00 hrs to 8.00hrs).
I have handled this in business service under transport configuration :
1) create 2 endpoints (URL1 and URL2)
2) set retry count to 1
This worked okay but when switch happens from URL1 to URL2 and vice versa OSB-client(s) experiences delay, is there any way by which we can make URL1 offline from 20.00 hrs to 8.00 hrs and OSB do not try to attempt to hit URL1 for that duration?
I think you might need to mark the inactive URI "offline" - as with your current setup, you rely on the retry count which makes the OSB first access the inactive endpoint and then fail over to the active one.
See:
https://docs.oracle.com/middleware/1221/osb/administer/GUID-C49400DC-26DD-4175-972A-19DCAE5BCDD0.htm#OSBAG605
You might be able to find a way to do this using a WLST script and schedule it to run at appropriate times automatically

SSIS Inserting Records into table in the same order in flat file

I have a flatfile that looks like the first set. I have a table with an auto incrementing primary key field. Using SSIS how can I guarantee when I import that data that it keeps the record order as specified in the flatfile? I'm assuming that when SSIS reads the file that it will keep that order as it inserts into the database. Is this true?
In File:
RecordType | Amount
5 1.00
6 2.00
6 3.00
5 .5
6 1.5
7 .8
5 .5
In a Database Table
ID | RecordType | Amount
1 5 1.00
2 6 2.00
3 6 3.00
4 5 .5
5 6 1.5
6 7 .8
7 5 .5
Just to be safe, I'd add a Sort Transformation in your SSIS package, you can choose the column you want sorted and how it's sorted. This should ensure it reads it the way you want.
Thew order doesn't matter in a Table. It only matters in a Query.
In my experience it will always load in the order of the input file if you are using an autoincrement ID that is also the clustered index.
Here is a similar discussion that has a couple ideas. Particularly preprocessing the file or using a script component as the source. You may want to take one of those routes because the fact that it may behave the way you want by default does not mean it always will.
http://www.sqlservercentral.com/Forums/Topic1300952-364-1.aspx

Store Curve Data in SQL Server table

I have a SQL Server table in which I need to store daily interest rate data.
Each day, rates are published for multiple time periods. For example, a 1 day rate might be 0.12%, 180 day rate might be 0.070% and so on.
I am considering 2 options for storing the data.
One option is to create columns for date, "days" and rate:
Date | Days | Rate
=========================
11/16/2015 | 1 | 0.12
11/16/2015 | 90 | 0.12
11/16/2015 | 180 | 0.7
11/16/2015 | 365 | 0.97
The other option is to store the "days" and rate via a JSON string(or XML.)
Date | Rates
=============================================================
11/16/2015 | { {1,0.12}, {90,0.12}, {180, 0.7}, {365, 0.97} }
Data only will be imported via bulk insert; when we need to delete we'll just delete all the records and re-import; there is no need for updates. So my need is mostly to read rates for an specified date or range of dates into a .NET application for processing.
I like option 2 (JSON) - it will be easier to create objects in my application; but I also like option 1 because I have more control over the data - data types and constraints.
Any similar experience out there on what might be the best approach or does anyone care to chime in with their thoughts?
I would do option 1. MS SQL Server is a relational database, and storing key:value pairs as in option 2 is not normalized, and is not an efficient for SQL Server to deal with. If you really want option 2, I would use something other than SQL Server.

Resources