Keeping Data Fresh via Azure Data Factory - sql-server

What is the proper way to keep data slices refreshed? Imagine I have a table with various column, but importantly a DATE_CREATED and a DATE_MODIFIED column.
If my data slicing strategy is based on DATE_CREATED, I could periodically reprocess old slices. This follows the ADF guidance of "repeatability". I don't think ADF has a way of doing this automatically, but I could externally trigger the refresh via the API (I'm guessing.) This seems like perhaps the most correct way, but given that ADF doesn't seem to support this as a feature, it makes me feel like there's a better way of doing it ... it also seems mildly wasteful.
If my data slicing strategy is based on DATE_MODIFIED, I run into issues with the ADF activities not being repeatable. An old slice, when refreshed, would give different results because rows that were within the window may have moved to a different window. On the other hand, the latest slice would always catch rows that have changed. The other issue is preventing row duplication. The pre-activity clean up actions would need to somehow be able to clean up records in the destination table prior to the copy. Or some type of UPSERT method must be employed.
The final option is to TRUNCATE the destination table every day. This is fine for smaller tables but has its own downsides, (1) we're not really "slicing" at all anymore. This is just scorched earth. (2) Any time any slice is being processed, all downstream slices from all dates are in danger of failing, due to the table being blown away. (3) practically impossible if your table has any respectable amount of data in it.
No option seems excellent but the first option seems better. Looking for advice from someone who has solved this problem or is experienced with ADF.

Related

Insert New Record Before Existing Recordds in MS SQL Server

I'm working with an existing MS SQL database and ASP.NET web application. An update is needed in a table, but in order to add the new data and have it display correctly in the site, I need to be able to take a series of existing records, essentially "push them down", and then add the new data in the open space created.
Is there a cleaner, more efficient method than by just creating a new record that's a copy of the last related record, and then essentially doing a copy-and-paste for the remaining records until I reach the insertion point? There are quite a few records to move and I'd prefer something that isn't as mind-numbing and potentially error-prone as that.
I know the existing site and database isn't designed optimally for inserting new data into this table unless it's added to the end, but reconfiguring the database and stored procedures is not an option I presently have.
-- EDIT --
For additional requested information...
Screen shot of table definition:
Screen shot of some table data (filtered by TemplateID):
When looking at the table data, there are a couple other template ID values that bring back a bit more complex data. The issue is that this data needs to maintain this order, which happens to be the order in which it has been entered, since it gets returned and displayed in the shown order. The new data needs to be entered prior to one of these lettered subject headers. Honestly, I think this is not the best way to do this, but I had no hand in the design. It was created by a different company, and mine was hired to handle updates and maintenance after the creators became unpleasant to work with. A different template ID value brings back two levels of headers, which doesn't make my task any easier or alterations much cleaner considering the CS code that calls the stored procedures is completely separated from the code that builds the contents of the pages, and the organizational structure i tough to follow. There are some very poor naming conventions in places.
At any rate, there needs to be an insertion into this group of data under the "A" header value. The same needs to occur with another chunk associated with a different template ID, and there is another main header below the insertion point:

To use or not to use computed columns for performance and maintainability

I have a table where am storing a startingDate in a DateTime column.
Once i have the startingDate value, am supposed to calculate the
number_of_days,
number_of_weeks
number_of_months and
number_of_years
all from the startingDate to the current date.
If you are going to use these values in two or more places in the application and you do care much about the applications response time, would you rather make the calculations in a view or create computed columns for each so you can query the table directly?
Computed columns are easy to maintain and provide an ideal solution to your problem – I have used such a solution recently. However, be aware the values are calculated when requested (when they are SELECTed), not when the row is INSERTed into the table – so performance might still be an issue. This might be acceptable if you can off-load work from the application server to the database server. Views also don’t exist until they are requested (unless they are materialised) so, again, there will be an overhead at runtime, but, again it’s on the database server, not the application server.
Like nearly everything: It depends.
As #RedX suggest it probably not much of a performance difference either way, so it becomes a question of how will use them. To me this is more of a feel thing.
Using them more than once doesn't wouldn't necessary drive me immediately to either a view or computed columns. If I only use them in a few places or low volume code paths I might calc them in-line in those places or use a CTE. But if the are in wide spread or heavy use I would agree with a view or computed column.
You would also want them in a view or cc if you want them available via ORM tools.
Am I using those "computed columns" individual in places or am I using them in sets? If using them in sets I probably want a view of the table that shows included them all.
When i need them do I usually want them associated with data from a particular other table? If so that would suggest a view.
Am I basing updates on the original table of those computed values? If so then I want computed columns to avoid joining the view in these case.
Calculated columns may seem an easy solution at first, but I have seen companies have trouble with them because when they try to do ETL with CDC for real-time Change Data Capture with tools like Attunity it will not recognize the calculated columns since the values are not there permanently. So there are some issues. Also if the columns will be retrieve many, many times by users, you will save time in the long run by putting that logic in the ETL tool or procedure and write it once to the database instead of calculating it many times for each request.

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.

How to keep historic details of modification in a database (Audit trail)?

I'm a J2EE developer & we are using hibernate mapping with a PostgreSQL database.
We have to keep track of any changes occurs in the database, in others words all previous & current values of any field should be saved. Each field can be any type (bytea, int, char...)
With a simple table it is easy but we a graph of objects things are more difficult.
So we have, speaking in a UML point of view, a graph of objects to store in the database with every changes & the user.
Any idea or pattern how to do that?
A common way to do this is by storing versions of objects.
If add a "version" and a "deleted" field to each table that you want to store an audit trail on, then instead of doing normal updates and deletes, follow these rules:
Insert - Set the version number to 0 and insert as normal.
Update - Increment the version number and do an insert instead.
Delete - Increment the version number, set the deleted field to true and do an insert instead.
Retrieve - Get the record with the highest version number and return that.
If you follow this pattern, every time you update you will create a new record rather than overwriting the old data, so you will always be able to track back and see all the old objects.
This will work exactly the same for graphs of objects, just add the new fields to each table within the object graph, and handle each insert/update/delete for each table as described above.
If you need to know which user made the modification, you just add a "ModifiedBy" field as well.
(You can either do this processing in your DA layer code, or if you prefer you can use database triggers to catch your update/delete/retrieve calls and re-process them following the rules.)
Obviously, you need to consider space requirements, as every single update will result in a fully new record. If your application is update heavy, you are going to generate a lot of data. It's common to also include a "last modified time" fields so you can process the database off line and delete data older than required.
Current RDBMS implementations are not very good at handling temporal data. That's one reason why maintaining separate journalling tables through triggers is the usual approach. (The other is that audit trails frequently have different use cases to regular data, and having them in separate tables makes it easier to manage access to them). Oracle does a pretty slick job of hiding the plumbing in its Total Recall product, but being Oracle it charges $$$ for this.
Scott Bailey has published a presentation on temporal data in PostgreSQL. Alas it won't help you right now but it seems like some features planned for 8.5 and 8.6 will enable the transparent storage of time-related data. Find out more.

Resources