I am making a daily import script. It collects data from various sources and dumps it into a CSV. The CSV is used to make an external table and that external table is used populate the main table, with new data.
The trouble is that the schema of external table is designed to adapt the collected data. That is if there is a new column for the collected data, the generated external table will have that.
But the same is not true for transferring the data from external table to main table. For that I was wondering if there is some in-built function or maybe some procedure can be designed to do that.
Something like when the procedure is executed, it compares the schema of both the tables and add a column to the main table if the column is not present and then proceed with insertion of data.
Is something like this doable?
Related
In Snowflake, There is a concept named Snowpipe which will load the data automatically to configured tables from the different data sources.
We are trying to do the normalization while loading into the snowflake via Snowpipe.
Table A:
Id & EmployerName
Table B:
Id, Employeename & EmployerID
Value in the File
Name, EmployerName
Raj, Google
Kumar, Microsoft
We are unable to populate table A & table B in a same pipe as the pipe has only one copy statement.
Is there any concept like dependent PIPE & other ways to load the lookup table first & load the main table from the sample file?
Note:
If we have two pipes we are unable to specify the dependency.
Snowpipe should be used to load data into tables as soon as the source data is available in the cloud provider's blob storage location. You cannot set up a dependancy between Snowpipes, this would add a delay into the pipeline anyway.
Your best bet is to set up two Snowpipes to load both tables as soon as data arrives in blob storage and then use Snowflake tasks to handle the dependancies and business logic.
Just a few ideas:
Set up a Snowpipe to load into a single permanent staging area (PSA) table.
Use hash codes as the surrogate key for the two separated tables
(if you have to use surrogate keys, at all).
This way you don't have to do lookups for the surrogate key values.
Your tables will look like:
TableA - EmployerHash, EmployerName;
TableB - EmployeeHash, EmployeeName, EmployerHash;
Then create a Task with a stored procedure, that will issue a multi-table insert so that you will load into the two tables at the same time by using the same source query.
(https://docs.snowflake.net/manuals/sql-reference/sql/insert-multi-table.html#insert-multi-table)
If your real table structures & processing are more complex then you can try to use the Snowflake Streams & Tasks based on the PSA table.
For a detailed example, see here: https://community.snowflake.com/s/article/Building-a-Type-2-Slowly-Changing-Dimension-in-Snowflake-Using-Streams-and-Tasks-Part-1
HTH,
Gabor
I have an idea to do for multiple table copy:
You may create a stored procedure for copy the data from source location to target table by parameter zing the table name.
Use task in snowflake schedule your stored procedure in periodic interval.
This will populate your data in you target table in given interval. Using this option the file won't be copied immediately from your source location. Have to check the option on TASK hoe to get notified on each run.
My team is creating a high volume data processing tool. The idea is to take a 30,000 line batch file and bulk load it into a table and then process the records use parallel processing.
The part I'm stuck on is creating dynamic tables. We want to create a new physical table for each file that we receive. The tables will be purged from our system by a separate process after they are completed.
The part I'm stuck on is creating dynamic tables. For each batch file we receive I need to create a new physical file with a unique table name.
I have the base structure for the table and I intend to create unique table names using a combination of date/time stamp and a guid (dashes converted to underscore characters).
I could do this easily enough in a stored procedure but I'm wondering if there is a better way.
Here is what I have considered...
Templates in SQL Server Management Studio. This is a GUI tool built into Management Studio (from Management Studio Ctrl+Alt+T) that allows you to define different sql objects including a table and specify parameters. This seems like it would work, however it appears that this is a GUI tool and not something that I could call from a stored procedure.
Stored Procedure. I could put everything into a stored procedure and build my file name and schema into a nvarchar(max) string and use sp_executesql to create the table. This might be the way to accomplish my goal but I wonder if there is a better way.
Stored Procedure with an existing table as a template. I could define a base table and then query sys.columns & sys.dataypes to create a string representing the new table. This would allow me to add columns to the base table without having to update my stored procedure. I'm not sure if this is a better approach.
I'm wondering if any Stack Overflow folks have solved a similar requirements. What are your recommendations.
I'm using a SSIS script task to dynamically import and create staging tables on the fly from csvs as there are so many (30+.)
For example, a table in SQL server will be created called 'Customer_03122018_1305' based on the name of the csv file. How do I then insert into the actual 'real' 'Customer' table?
Please note -there are other tables - e.g. 'OrderHead_03122018_1310' that will need to go into a 'OrderHead' table. Likewise for 'OrderLines_03122018_1405' etc.
I know how to perform the SQL insert, but the staging tables will be constantly changing based on csv date timestamp. I'm guessing this will be a script task?
I'm think of using a control table when I originally import the csv's and then lookup the real table name?
Any help would be appreciated.
Thanks.
You can follow the below process, to dynamically load all the staging tables to the main Customer table by using a FOR loop as stated below,
While creating the staging tables dynamically, store all the staging table names in a separate single variable separated by commas.
Also store the count of staging tables created in another variable.
Use FOR loop container and loop the container by the number of staging tables created.
Inside the FOR loop, use a script task and fetch the value of 1st staging table name into separate variable.
After the script task, inside FOR loop container, add a DataFlow task and inside it, build the OLEDB Source task dynamically by using the variable that is used to store the 1st staging table name in step - 4.
Load the results of from staging table to Actual table.
Remove the staging table name from the variable that is created i step - 1 (which contains all the staging table names separated by comma).
We have a large production MSSQL database (mdf appx. 400gb) and i have a test database. All the tables,indexes,views etc. are same eachother. I need to make sure that tha datas in the tables of this two database consistent. so i need to insert all the new rows and update all the updated rows into test db from production every night.
I came up with idea of using SSIS packages to make the data consistent by checking updated rows and new rows in all the tables. My SSIS Flow is ;
I have packages in SSIS for each tables seperately because;
Orderly;
Im getting the timestamp value in the table in order to get last 1 day rows instead of getting whole table.
I get the rows of the table in the production
Then im using 'Lookup' tool to compare this data with the test database table data.
Then im using conditional sprit to get a clue whether the data is new or updated.
If the data is new, i insert this data to the destination
5_2. If the data is updated, then i update the data in the destination table.
Data flow is in the MTRule and STBranch package in the picture
The problem is, im repeating creating all this single flow for each table and i have more than 300 table like this. It takes hours and hours :(
What im asking is;
Is there any way in SSIS to do this dynamically ?
PS: Every single table has its own columns and PK values but my data flow schema is always same. . (Below)
You can look into BiMLScript, which lets you create packages dynamically based on metadata.
I believe the best way to achieve this is to use Expressions. They empower you to dynamically set the source and Destination.
One possible solution might be as follows:
create a table which stores all your table names and PK columns
define a package which Loops through this table and which parses a SQL Statement
Call your main package and pass the stmt to it
Use the stmt as Data Source for your Data Flow
if applicable, pass the Destination Table as Parameter as well (another column in your config table)
This is how I processed several really huge tables: the data had to be fetched from 20 tables and moved to one single table.
You are better off writing a stored procedure that takes the tablename as parameter and doing your CRUD there.
Then call the stored procedure in a FOR EACH component in SSIS.
Why do you need to use SSIS?
You are better off writing a stored procedure that takes the tablename as parameter and doing your CRUD there. Then call the stored procedure in a FOR EACH component in SSIS.
In fact you might be able to do everything using a Stored Procedure and scheduling it in a SQL Agent Job.
I have a desktop application through which data is entered and it is being captured in MS Access DB. The application is being used by multiple users(at different locations). The idea is to download data entered for that particular day into an excel sheet and load it into a centralized server, which is an MSSQL server instance.
i.e. data(in the form of excel sheets) will come from multiple locations and saved into a shared folder in the server, which need to be loaded into SQL Server.
There is a ID column with IDENTITY in the MSSQL server table, which is the primary key column and there are no other columns in the table which contains unique value. Though the data is coming from multiple sources, we need to maintain single auto-updating series(IDENTITY).
Suppose, if there are 2 sources,
Source1: Has 100 records entered for the day.
Source2: Has 200 records entered for the day.
When they get loaded into Destination(SQL Server), table should have 300 records, with ID column values from 1 to 300.
Also, for the next day, when the data comes from the sources, Destination has to load data from 301 ID column.
The issue is, there may be some requests to change the data at Source, which is already loaded in central server. So how to update the data for that row in the central server as the ID column value will not be same in Source and Destination. As mentioned earlier ID is the only unique value column in the table.
Please suggest some ides to do this or I've to take up different approach to accomplish this task.
Thanks in advance!
Krishna
Okay so first I would suggest .NET and doing it through a File Stream Reader, dumping it to the disconnected layer of ADO.NET in a DataSet with multiple DataTables from the different sources. But... you mentioned SSIS so I will go that route.
Create an SSIS project in Business Intelligence Development Studio(BIDS).
If you know for a fact you are just doing a bunch of importing of Excel files I would just create many 'Data Flow Task's or many Source to Destination tasks in a single 'Data Flow Task' up to you.
a. Personally I would create tables in a database for each location of an excel file and have their columns map up. I will explain why later.
b. In a data flow task, select 'Excel Source' as the source file. Put in the appropriate location of 'new connection' by double clicking the Excel Source
c. Choose an ADO Net Destination, drag the blue line from the Excel Source to this endpoint.
d. Map your destination to be the table you map to from SQL.
e. Repeat as needed for each Excel destination
Set up the SSIS task to automate from SQL Server through SQL Management Studio. Remember you to connect to an integration instance, not a database instance.
Okay now you have a bunch of tables right instead of one big one? I did that for a reason as these should be entry points and the logic to determinate dupes and import time I would leave to another table.
I would set up another two tables for the combination of logic and for auditing later.
a. Create a table like 'Imports' or similar, have the columns be the same except add three more columns to it: 'ExcelFileLocation', 'DateImported'. Create an 'identity' column as the first column and have it seed on the default of (1,1), assign it the primary key.
b. Create a second table like 'ImportDupes' or similar, repeat the process above for the columns.
c. Create a unique constraint on the first table of either a value or set of values that make the import unique.
c. Write a 'procedure' in SQL to do inserts from the MANY tables that match up to the excel files to insert into the ONE 'Imports' location. In the many inserts do a process similar to:
Begin try
Insert into Imports (datacol1, datacol2, ExcelFileLocation, DateImported) values
Select datacol1, datacol2, (location of file), getdate()
From TableExcel1
End try
-- if logic breaks unique constraint put it into second table
Begin Catch
Insert into ImportDupes (datacol1, datacol2, ExcelFileLocation, DateImported) values
Select datacol1, datacol2, (location of file), getdate()
From TableExcel1
End Catch
-- repeat above for EACH excel table
-- clean up the individual staging tables for the next import cycle for EACH excel table
truncate TableExcel1
d. Automate the procedure to go off
You now have two tables, one for successful imports and one for duplicates.
The reason I did what I did is two fold:
You need to know more detail than just the detail a lot of times like when it came in, from what source it came from, was it a duplicate, if you do this for millions of rows can it be indexed easily?
This model is easier to take apart and automate. It may be more work to set up but if a piece breaks you can see where and easily stop the import for one location by turning off the code in a section.