What steps are performed by a 'Rescan'? - cloudant

To automatically warehouse documents from Cloudant to dashDB, there is a schema discovery process (SDP) that automates the data migration for you. When using the SDP to warehouse documents from Cloudant to dashDB, there is an option 'Rescan'.
I have used 'Rescan' a number of times, but am unclear on the steps it actually performs. What steps are performed by a 'Rescan'? E.g.
Drop tables in the dashDB target schema? Which tables?
Scan Cloudant source database?
Recreate the target schema?
...
...

The steps are pretty much as you suggested. Rescan will
Inspect the previously discovered JSON schema and remove all tables from the dashDB instance created for that load (leaving any user defined tables untouched)
Re-discover the JSON schema again using the current settings (including sample size, type of discovery algorithm etc.)
Create the new tables into the same dashDB target
Ingest the newly created tables with data from Cloudant
Subscribe to the _changes feed from Cloudant to continuously synchronize document changes with dashDB
All steps (except for the first) are identical for the initial load as well as the rescan function.
The main motivation for a rescan is to support schema evolution. Whenever the document structure in a Cloudant source database changes, a user can make a conscious decision to drop and re-create the dashDB tables using this rescan function. SDP won't automate that process to avoid potential conflicts with applications depending on the existing dashDB tables.

Related

How to move data from S3 to Snowflake

I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali
Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/
Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.
Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html

What the Process to transfer the staging table data to Fact tables in Snowflake by Custom Validations

good Day.
I need help. I want to transfer the data in Snowflake from Staging tables to Fact tables automatically, when data is available in Stage table. While moving data from Staging table to Fact tables, I have couple of Custom validations on each column and row.
Any idea how to do this in Snowflake.
If any one knows could you please suggest me...!
Thanks in Advance...!
There are many ways to do this and how you go about it depends on what tools you have available. The simplest way to do this without using tools outside of the Snowflake ecosystem would be:
On each of the staging tables you have, set up a stream on these tables (here is the Snowflake documentation on streams)
Create a task that runs on a schedule (here is the Snowflake doc on tasks) to pull from the streams and write into the fact table.
This is really a general data warehousing question rather than a Snowflake one. Here is some more documentation on building SCD type 2 dimensions also written by someone at Snowflake
Assuming "staging tables" refers to a Snowflake table and not a file in a Snowflake stage, I would recommend using a Stream and Task for this. A stream will identify the delta of data that needs to be loaded, and a Task can execute on a schedule and will only actually run something if there is data in the stream. Create a stored procedure that is executed in the Task to run your validations and Merge the outcome of those into your Fact.

Extracting Data from SAP to SQL Server

I am using SSIS packages to extract data from SAP database tables into SQL Server tables. I am using OLEDB source/destination connections to achieve this.
The problem now is that a table in SAP has 5 Million records and its taking around 2 hours to extract this data into my SQL Server table. I have used the trunc-dump method (truncating the table in sql server and dumping data into it from SAP table) and also tried using Multiple Hash key to bring in the updated/new records.
The problem with Hash key is that it still has to scan the entire table to look for changed/new records and hence takes almost the same time as the trunc-dump method.
I am looking for a new way or changing the existing way to reduce the time taken to complete this extraction.
As you mentioned you were using OLEDB source connection to access SAP, if that means you were accessing SAP's underlying database directly, you should pause doing that for three reasons till there are explicit IT approvals:
You skipped SAP's application layer security. There can be an enterprise security compliance issue;
Your company's SAP license may not allow you to do that. If your company only has SAP indirect access license, then you may have to stay on application layer;
You will not get SAP's official support by accessing the underlying database directly.
You have multiple options to fetch data using SSIS through SAP application layer:
Use commercial SSIS custom components for this job (disclaimer: AecorSoft is one of the leading vendors offering such connectivity components);
Look into SAP's own OData Gateway interface to consume data.
Request your SAP ABAP team to write custom ABAP programs to dump SAP data into CSV files, and then use SSIS to fetch them.
Let's now look at the performance side:
SAP ETL Performance depends on many factors, but in general, even for the SAP transactional tables with 100+ columns, it's considered very slow to extract 5 millions rows per a couple of hours. For example, we've seen cases of extracting standard SAP General Ledger header table BKPF (almost 100 columns) at consistent performance of 1M rows every 1-2 minutes. Of course such performance is achieved through commercial component and SSIS, but you should expect at least 1M per 10 minutes even for the #3 option above, going through an intermediate CSV file. Under the hood, through SAP application layer, all the 3 options would leverage SAP Open SQL (in contrast to the "Native SQL" which the underlying database offers) to access SAP tables, therefore, if you experience application layer performance issue, you can analyze the Open SQL side.
As you also mentioned about update/new records scenario, it's a typical delta extraction problem. Normally, in SAP transactional tables, there are Create Date and Changed Date fields which can help you capture delta. In this case, in order to avoid full table scan, apply indices through SAP application layer on those "delta fields". For example, if you need to extract Sales Document Header VBAK table, you can filter by ERDAT (Created on) and AEDAT (Changed on). Delta is a complex subject in SAP. There is no simple statement to describe the delta solution, as SAP data models are complex and very different across functional modules. The delta analysis is always a case-by-case effort. Some people may also simply recommend using "delta extractors", but don't treat that as silver bullet, because extractor has its own problem. In short, if you look into table based extraction, focus on that, and try to work with your SAP functional team to determine the suitable delta fields. Try avoiding doing full table scan and hashing. Do incremental load with some optional overlap of previous extract (e.g. loading today and yesterday's records), and do MERGE to absorb the changes.
There are few cases you may not be able to find any delta field, and it is not practical to do full load all the time. One great example is the Address Master data table ADRC. In this case, if you are required to do delta load on such table, you ether have to request your SAP function team to figure out delta for you (meaning they inject custom logic to every place where Address master can be created, updated, or deleted), or you have to request your SAP Basis team to create DB trigger on the underlying database table, and expose the trigger table at application layer. This way, you can create an application layer view on the main table and the trigger table to do delta. Still, there is no direct database access through your solution. The DB layer trigger is fully managed and controlled by your SAP Basis team who also supports the database.
Hope this helps!

Replicated database for storing historical data

Only part of the data in the database is being processed by the application, the rest is necessary for reporting purposes, but it causes poor application performance. I would like to archive historical data without modifying database schema.
Is there a possibility to replicate database, delete old data from primary instance and regularly synchronise new changes into replicated database? That way primary "transactional" database will be lightweight and replicated database will contain full set of both current and historical data for reporting purposes.
Could you recommend some tools or give some tips to achieve that on Oracle?
edit:
I'm wondering if I could use streams and somehow make DML handler to ignore DELETE operations on rows (docs.oracle.com/cd/B28359_01/server.111/b28321/…) so that during data replication historical rows will be preserved despite being deleted from transactional db.
You don't need to create two separate databases. Just create one transactional database where you will save all your transactions and then create views based on these tables to show required data. In this way you just have to maintain only one database.

How to partially migrate a database to a new system over time?

We are in the process of a multi-year project where we're building a new system and a new database to eventually replace the old system and database. The users are using the new and old systems as we're changing them.
The problem we keep running into is when an object in one system is dependent on an object in the other system. We've been using views, but have run into a limitation with one of the technologies (Entity Framework) and are considering other options.
The other option we're looking at right now is replication. My boss isn't excited about the extra maintenance that would cause. So, what other options are there for getting dependent data into the database that needs it?
Update:
The technologies we're using are SQL Server 2008 and Entity Framework. Both databases are within the same sql server instance so linked servers shouldn't be necessary.
The limitation we're facing with Entity Framework is we can't seem to create the relationships between the table-based-entities and the view-based-entities. No relationship can exist in the database between a view and a table, as far as I know, so the edmx diagram can't infer it. And I cannot seem to create the relationship manually without getting errors. It thinks all columns in the view are keys.
If I leave it that way I get an error like this for each column in the view:
Association End key property [...] is
not mapped.
If I try to change the "Entity Key" property to false on the columns that are not the key I get this error:
All the key properties of the
EntitySet [...] must be mapped to all
the key properties [...] of table
viewName.
According to this forum post it sounds like a limitation of the Entity Framework.
Update #2
I should also mention the main limitation of the Entity Framework is that it only supports one database at a time. So we need the old data to appear to be in the new database for the Entity Framework to see it. We only need read access of the old system data in the new system.
You can use linked server queries to leave the data where it is, but connect to it from the other db.
Depending on how up-to-date the data in each db needs to be & if one data source can remain read-only you can:
Use the Database Copy Wizard to create an SSIS package
that you can run periodically as a SQL Agent Task
Use snapshot replication
Create a custom BCP in/out process
to get the data to the other db
Use transactional replication, which
can be near-realtime.
If data needs to be read-write in both database then you can use:
transactional replication with
update subscriptions
merge replication
As you go down the list the amount of work involved in maintaining the solution increases. Using linked server queries will work best if its the right fit for what you're trying to achieve.
EDIT: If they're the same server then as suggested by another user you should be able to access the table with servername.databasename.schema.tablename Looks like it's an entity-framework issues & not a db issue.
I don't know about EntityToSql but I know in LinqToSql you can connect to multiple databases/servers in one .dbml if you prefix the tables with:
ServerName.DatabaseName.SchemaName.TableName
MyServer.MyOldDatabase.dbo.Customers
I have been able to click on a table in the .dbml and copy and paste it into the .dbml of the alternate project prefix the name and set up the relationships and it works... like I said this was in LinqToSql, though have not tried it with EntityToSql. I would give it shot before you go though all the work of replication and such.
If Linq-to-Entities cannot cross DB's then Replication or something that emulates it is the only thing that will work.
For performance purposes you probably want either Merge replication or Transactional with queued (not immediate) updating.
Thanks for the responses. We're going to try adding triggers to the old database tables to insert/update/delete records in the new tables of the new database. This way we can continue to use Entity Framework and also do any data transformations we need.
Once the UI functions move over to the new system for a particular feature, we'll remove the table from the old database and add a view to the old database with the same name that points to the new database table for backwards compatibility.
One thing that I realized needs to happen before we can do this is we have to search all our code and sql for ##Identity and replace it with scope_identity() so the triggers don't mess up the Ids in the old system.

Resources