Design for importing definition data from Excel into SQL Server - sql-server

We have Restaurant Inventory Control system that uses SQL Server 2008 R2.
It takes a very long time to add all the definition data: stock items, yields, packsizes, recipes, categories etc. So, our clients have asked if they can upload it from Excel.
Before I just jump in and start, I want to find out if there is a best practice way to do this.
I know all the tools: SSIS, stored procedures etc. But I'm looking for advice/resources that can help with the design process. How best to setup the spreadsheet, validate the data, create the child/parent relationships etc.
This must be a fairly common project -- so it must have a standard design/approach and that's what I'm looking for.

I think the design will depend on the technologies you're most comfortable with. If you're comfortable with SSIS and stored procedures, this is the general pattern I would use:
Excel Template - I wouldn't spend too much time on this, add the headers and sheets necessary for the tables. You can lock down certain things and/or implement rules, but most of your validation would be done in stored procs.
SSIS - Have a package that loads the excel data into Staging tables, have rows with errors get added to an error log to be presented to the user along with the validation issues from the stored procedures.
Staging Tables - Have one staging table per sheet/production table, have an ExecutionId column in each staging table to allow parallel processing. Allow all columns to be NULL so you can get the data in the staging tables or set the proper null conditions and have SSIS redirect these rows on error. Don't have any primary key / foreign key relationships in the staging tables, these can be validated in the stored procedure
Stored Procedures - Validate the staging data, any issues found would be added to the error log to be presented to the user or person performing the import. If there are no issues, import the data into the production tables. If there is existing data in the production tables, you could do a comparison and update if applicable.

Related

IntelliJ database tool keeps indexing all tables when I modify one table

I am using the database tool in IntelliJ Idea 2016.2.1.
I have a database schema that has about 100 tables using DB2 server. Most of the tables are not related.
During development, I often need to add a new column to an table. However, each time intelliJ will start indexing all the tables and takes very long time. Even the table I modified is not related to any other tables. The table I modify is also a trivial table just for test purpose.
Question: Is there a way to avoid re-indexing all tables when not necessary?

SSIS Cross-DB "WHERE IN" Clause (or Equivalent) in Azure

I'm currently trying to build a data flow in SSIS to select all records from a mapping table where an ID column exists in the related Item table. There are two complications:
The two tables are currently in different databases on different servers.
The databases are in Azure, for which I've read Linked Servers are not supported.
To be more clear, the job to migrate data from Staging environment to Production. I only want to push lookup records into prod if the associated Item IDs are in there. Here's some psudo-TSQL to give a clear goal of what I'm trying to achieve:
SELECT *
FROM [Staging_Server].[SourceDB].[dbo].[Lookup] L
WHERE L.[ID] IN (
SELECT P.[Item]
FROM [Production_Server].[TargetDB].[dbo].[Item] P
)
I haven't found a good way to create this in SSIS. I think I've created a work-around that involves sorting both tables and performing a merge join, but sorting both sides is an unnecessary hit on performance. I'm looking for a more direct and intuitive design for this seemingly simple data flow.
Doing this in a data flow, you'd have your Source query, sans filter, fed into a Lookup Component which is the subquery.
The challenge with this is SSIS is likely on-premises so that means you are going to pull all of your data out of Stage Azure to the server running SSIS and push it back to the Prod Azure instance.
That's a lot of network activity and as I'm reading the Azure pricing guide, I guess as long as you have the appropriate DTUs, you'd be fine. Back in the day, you were charged for Reads and not Writes so the idiom was to just push all your data to target server and then do the comparison there, much as ElendaDBA mentions. Only suggestion I'd make on the implementation is to avoid temporary tables or ad-hoc creation/destruction of them. Just implement as a physical table and truncate and reload prior to transmission to production.
You could create a temp table on staging server to copy production data into. Then you could create a query joining those two tables. After SSIS package runs, you could delete the temp table on staging server

Database Problems

I have a database schema in an Oracle database. I also have data dumps from third party vendors. I load their data using sql loader scripts on a Linux machine.
We also have batch updates everyday.
The data is assumed to be free from data errors. E.g. if on the first day a data viz 'A' is inserted into the db and the data 'A' would not occur in the further loading (assumption). If we get a data named 'A' then we get a primary key violation.
Question: To avoid these violations should we build an analyzer to analyze the data errors or are there better solutions.
I built an ETL system for a company that had daily feeds of flat files containing line of business transaction data. The data was supposed to follow a documented schema but in practice there were lots of different types of violations from day to day and file to file.
We built SQL staging tables containing all nullable columns with bigger than ought to be needed varchars and loaded up the flat file data into these staging tables using efficient bulk-loading utilities. Then we ran a series of data consistency checks within the context of the database to ensure that the raw (staged) data could be cross-loaded to the proper production tables.
Nothing got out of the staging table environment until all of the edits were passed.
The advantage of loading the flat files into staging tables is that you can take advantage of the RDBMS to perform set actions and to easily compare new values with existing values from previous files, all without having to build special flat file handling code.

Grouping ETL Staging Tables With User Schemas?

I was thinking of putting staging tables and stored procedures that update those tables into their own schema. Such that when importing data from SomeTable to the datawarehouse, I would run a Initial.StageSomeTable procedure which would insert the data into the Initial.SomeTable table. This way all the procs and tables dealing with the Initial staging are grouped together. Then I'd have a Validation schema for that stage of the ETL, etc.
This seems cleaner than trying to uniquely name all these very similar tables, since each table will have multiple instances of itself throughout the staging process.
Question: Is using a user schema to group tables/procs/views together an appropriate use of user schemas in MS SQL Server? Or are user schemas supposed to be used for security, such as grouping permissions together for objects?
This is actually a recommended practice. Take a look at the Microsoft Business Intelligence ETL Design Practices from the Project Real. You will find (download doc from the first link) that they use quite a few schemata to group and identify objects in the warehouse.
In addition to dbo and etl, they also use admin, audit, part, olap and a few more.
I think it's appropriate enough, it doesn't really matter, you could use another database if you liked which is actually what we do.
I'm not sure why you would want a validation schema though, what are you going to do there?
Both the reasons you list (purpose/intent, security) are valid reasons to use schemas. Once you start using them, you should always specify schema when referencing an object (although I'm lazy and never specify dbo).
One trick we use is to have the same-named table in each of several schemas, combined with table partitioning (available in SQL 2005 and up). Load the data in first schema, then when it's validated "swap" the partition into dbo--after swapping the dbo partition into a "dumpster" schema copy of the table. Net Production downtime is measured in seconds, and it's all carefully wrapped in a declared transaction.

Copy Database Data from Many DBs to One. Data Replication (sort of)

This involves data replication, kind of:
We have many sites with SQL Express installed, there is an 'audit' database on each site that has one table in 1st normal form (to make life simple :)
Now I need to get this table from each site, and copy the contents (say, with a Date Time Value > 1/1/200 00:00, but this will change obviously) and copy it to a big 'super table' in sql server proper, that also has the primary key as the Site Name (That needs injecting in) and the current primary key from the SQL Express table)
e.g. Many SQL Express DBs with the following table columns
ID, Definition Name, Definition Type, DateTime, Success, NvarChar1, NvarChar2 etc etc etc
And the big super table needs to have:
SiteName, ID, Definition Name, Definition Type, DateTime, Success, NvarChar1, NvarChar2 etc etc etc
Where items in bold are the primary key(s)
Is there a Microsoft (or non MS I suppose) app/tool/thing to manager copying all this data accross already, or do we need to write our own?
Many thanks.
You can use SSIS (which comes with SQL Server) to populate, it can be set up with variables to change the connection string to the various databases. I have one that loops through the whole list and does the same process using three differnt files from three differnt vendors. You could so something simliar to loop through the different site databases. Put the whole list of database you want to copy the audit data from in a table and loop through it changing the connection string each time.
However, why on earth would you want one mega audit table per site? If every table in the database populates the audit table as changes happen, then the audit table eventually becomes a huge problem for performance. Every insert, update and delete has to hit this table and then you are proposing to add an export on top of that. This seems to me to be a guaranteed structure for locking and deadlocks and all sorts of nastiness. Do yourself a favor and limit each audit table to the table it is auditing.
Things to consider:
Linked servers and sp_msforeachdb as part of a do-it-yourself solution.
SQL Server Replication (by Microsoft) (which I believe can pull data from SQL Server Express)
SQL Server Integration Services which can pull data from SQL Server Express instances.
Personally, I would investigate Integration Services first.
Good luck.
You could do this with SymmetricDS. SymmetricDS is open source, web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time. The software was designed to scale for a large number of databases, work across low-bandwidth connections, and withstand periods of network outage.
As of right now, however, you would need to implement a custom IDataLoaderFilter extension point (in Java) to add the extra column. The metadata would be available though because your SiteName would be the external_id.

Resources