How to handle deleted records (from source) in the Data Vault model?

How to handle deleted records (from source) in the Data Vault model? - data-modeling

We are building a Data Vault (2.0) model to capture SalesForce data. Like many other sources, the records in the source are soft deleted. While we are sourcing data to the Data Model, we do not want to filter any data & also capture deleted records in the target system. Searched for the best practice to handle deleted records in the Data Vault model, however no luck. Can someone please throw some light here? Should we add IsDeleted flag in Hub or Satellite considering the future expansion of the model & best design practices to follow. Also, any reference material links here would be of great help. Thank you.

In the DV2.0 specification, there's a "Record source tracking satellite" that does exactly what you want to do, keep tracks of inserted, updated and deleted records per source (p.143 if you have the book).
Basically, this is a satellite with a hashkey, loaddate, record source and a status (I/U/D). You insert a record each time a record is added, updated or deleted in a source system.
In DV1.0, there was an "last seen" field, but has been removed for performance issue ( https://danlinstedt.com/allposts/datavaultcat/end_of_updates/ )

Related

Performance issue, flat file vs database storage

I am working on a project that asked me to develop a system to provide JSON output, the following is the flow:
1a) Some tables will be updated via the administration panel (my company side)
1b) Some related tables will be updated via the administration panel (partner side)
-- Let say SuperHeroes & Males were updated in 1a), Studios & Years were updated in 1b)
2) Client browse our site and request an information set which:
Has an enabled and not deleted row in SuperHeroes (Ant-man)
Has an enabled and not deleted row in Males linked to SuperHeroes (Scott Lang)
Joined the above records then look if they are linked to and exists in Studios (Marvel)
Linked to an existing row in Years (2015)
3) A very small data will be outputted to a JSON string as the following: { id:1,type:marvel },{ id:1,type:dc }
All rows in the above 4 tables will be updated/deleted at anytime without notification, [No Foreign Key as well]
I am thinking to update the information in a flat file every time 1a is performed (since we can update the system of my company side but not the partner side, and they are rejected to save some extra information into a flat file, so the situation is we have no easy way to know if the Studios or Years tables are modified)
Then while the JSON request will first load the information from the flat file (all outputting data will be stored in this file), then use a simple SQL statement to filter if a linked record exists in Studios & Years
I have done my research and getting confused, I concluded when the data amount is small then flat file will be great but beware the file comes larger and larger (The flat file we are talking will noway more than 50 rows at 1 time, and that should not be modified frequently)
Some answer said database is good at Query data (I think so and the requirement will perform SQL check too)
So I don't know if its good when my data amounts are small but still need some communication with the database..
I appreciate your time and your help, all idea & hints are welcome, thanks!

Your conclusion regarding amount of data is absolutely correct, and file should handle those 50 rows, but..
Using database as storage should give you more options in the future, e.g.:
you'll be able to produce any output due to separation of data
representation (today it's JSON, but what if you have to produce XML
at some point? will you add files that store XML's next to JSON
files? and then CSV or any other?)
transactions will guarantee the ACID
scalability and performance - if your dataset get bigger (you never know :)), many
DB's offer you many possibilities like table partitioning, partitioning-based clustering
or replication
Wrong choices regarding architecture and technology made at the beginning of the project will always backfire.

Set up relation on two existing Salesforce objects

I have a custom object in Salesforce which I need to setup a Master Detail relationship from Accounts. Accounts being the Master and CompHist being the Detail. The problem I am running into is that I need to set the relation to work off of custom fields within the objects. Example:
1.) Accounts has a custom field called CustomerId.
2.) CompHist also has custom field called CustomerId.
3.) I need to be able to have this linked together by CustomerId field for report generation.
About 2,000 records are inserted into CompHist around the 8th of each month. This is done from a .NET application that kicks off at the scheduled time, collects info from our databases and then uploads that data to salesforce via the SOAP API.
Maybe I'm misunderstanding how Salesforce relationships work as I am fairly new (couple months) to salesforce development.
Thanks,
Randy

There is a way to get this to work without triggers that will link the records or pre-querying the SF to learn Account Ids in .NET before you'll push the CompHistories.
Setup
On Account: set the "External ID" checkbox on your CustomerId field. I'd recommend setting "Unique" too.
On CompHist: you'll need to make decision whether it's acceptable to move them around or when the relation to Account is set - it'll stay like that forever. When you've made that decision tick / untick the "reparentable master-detail" in the definition of your lookup / m-d to Account.
And if you have some Id on these details, something like "line item number" - consider making an Ext. Id. for them too. Might save your bacon some time in future when end user questions the report or you'll have to make some kind of "flush" and push all lines from .NET (will help you figure out what's to insert, what's to update).
At this point it's useful to think how are you going to fill the missing data (all the nulls in the Ext. Id) field.
Actual establishing of the relationship
If you have the external ids set it's pretty easy to tell salesforce to figure out the linking for you. The operation is called upsert (mix between update and insert) and can be used in 2 flavours.
"Basic" upsert is for create/update solving; means "dear Salesforce, please save this CompHist record with MyId=1234. I don't know what's the Id in your database and frankly I don't care, go figure this out will ya?"
If there was no such record - 1 will be created.
If there was exactly 1 match - it will be updated.
If there were more than 1 found - SF won't know which one to update and throw error back at you (that's why marking as "unique" is a good idea. There's a chance you'll spot errors sooner).
"Advanced" upsert is for maintaining foreign keys, establishing lookups. "Dear SF, please hook this CompHist up to Account which is marked as "ABZ123" in my DB. Did I mention I don't care about your Ids and I can't be bothered to query your database first prior to me uploading my stuff?"
Again - exact match - works as expected.
0 or 2 Accounts with same ext. id value = error.
Code plz
I'd recommend you to play with Data Loader or similar tool first to get a grasp. of what exactly happens, how to map fields and how to not be confused (these 2 flavours of upsert can be used at same time). Once you'll manage to push the changes the way you want you can modify your integration a bit.
SOAP API upsert: http://www.salesforce.com/us/developer/docs/api/Content/sforce_api_calls_upsert.htm (C# example at the bottom)
REST API: http://www.salesforce.com/us/developer/docs/api_rest/Content/dome_upsert.htm
If you'd prefer an Salesforce Apex example: Can I insert deserialized JSON SObjects from another Salesforce org into my org?

Importing CSV to database (duplicate entries)

My job requires that I look up information on a long spreadsheet that's updated and sent to me once or twice a week. Sometimes the newest spreadsheet leaves off information that was in the last spreadsheet causing me to have to look through several different spreadsheets to find the info I need. I recently discovered that I could convert the spreadsheet to a CSV file and then upload it to a database table. With a few lines of script all I have to do is type in what I'm looking for and Voila! Now I just got the newest spreadsheet and I'm wondering if I can just Import it on top of the old one. There is a unique number for each row that I have set to primary in the database. If I try to import it on top of the current info will it just skip the rows where the primary would be duplicated or would it just mess up my database?
Thought I'd ask the experts before I tried it. Thanks for your input!
Details:
the spreadsheet consists of clients of ours. Each row contains the client's name, a unique id number, their address and contact info. I can set the row containing the unique ID to primary, then upload it. My concern is that there is nothing to signify a new row in a csv file (i think). when I upload it it it gives me the option to skip duplicates but will it skip the entire row or just that cell causing my data to be placed in the wrong rows.. It's apache server IDK what versions of mysql. I'm using 000webhost for this.

Higgs,
This issue in database/ETL terminology is called deduplication strategy.
There is not a template answer for this, but I suggest these helpful readings:
Academic paper - Joint Deduplication of Multiple Record Types
in Relational Data
Deduplication article
Some open source tools:
Duke tool
Data cleaner

there's a little checkbox when you click on import near the bottom that says 'ignore duplicates' or something like that. simpler than i thought.

Recommended way of adding daily data to database

I receive new data files every day. Right now, I'm building the database with all the required tables to import the data and perform the required calculations.
Should I just append each new day's data to my current tables? Each file contains a date column, which would allow for a "WHERE" query in the future if I need to analyze data for one particular day. Or should I be creating a new set of tables for every day?
I'm new to database design (coming from Excel). I will be using SQL Server for this.

Assuming that the structure of the data being received is the same, you should only need one set of tables rather than creating new tables each day.
I'd recommend storing the value of the date column from your incoming data in your database, and also having a 'CreateDate' column in your tables, with a default value of 'GetDate()' so that it automatically gets populated with the current date when the row is inserted.
You may also want to have another column to store the data filename that the row was imported from, but if you're already storing the value of the date column and the date that the row was inserted, this shouldn't really be necessary.
In the past, when doing this type of activity using a custom data loader application, I've also found it useful to create log files to log success/error/warning messages, including some type of unique key of the source data and target database - ie. if coming from an Excel file and going into a database column, you could store the row index from Excel and the primary key of the inserted row. This helps tracking down any problems later on.

You might want to consider having a look at SSIS (SqlServer Integration Services). It's the SqlServer tool for doing ETL activities.

yes, append each day's data to the tables; 1 set of tables for all data.
yes, use a date column to identify the day that the data was loaded.
maybe have another table with a date column and a clob column. The date to contain the load date and the clob to contain the file that you imported.

Good question. You most definitely should have a single set of tables and append the data daily. Consider this: if you create a new set of tables each day, what would, say, a monthly report query look like? A quarterly report query? It would be a mess, with UNIONs and JOINs all over the place.
A single set of tables with a WHERE clause makes the querying and reporting manageable.
You might do a little reading on relational database theory. Wikipedia is a good place to start. The basics are pretty straightforward if you have the knack for it.

I would have the data load into a stage table regardless and append to the main tables after. Once a week i would then refresh all data in the main table to ensure that the data remains correct as per the source.
Marcus

Recommended method to handle record editing with approval in CakePHP?

I haven't been able to find any information regarding the best way to handle record editing with approval in CakePHP.
Specifically, I need to allow users to edit data in a record, but the edited data should not overwrite the original record data until administrators have approved the change. I could put the edited records in a new table and then overwrite the originals when I approve them but I wonder if there is an easier way since this idea doesn't seem to play well with the cake philosophy so to speak.

You are going to need somewhere to store that data until an administrator can approve it.
I'm not sure how this can be easier than creating another table with the new edits and the original post id. Then when an administrator approves the edit, the script overwrites the old record with the edited version.

I'm working on a similar setup and I'm going with storing the draft record in the same table but with a flag set on the record called "draft". Also, the original record has a "draft_id" field that has the id of the draft record stored in it.
Then in the model when the original record is loaded by the display engine it shows it normally. But when the edit or preview actions try to load the record, it checks the "draft_id" field and then loads the other record if it's set.
The "draft" flag is used to keep list and other group find type actions from grabbing the draft records too. This might also be solved by a more advanced SQL query but I'm not quite that good with SQL.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight