How to insert data into a table such that possible extra columns in data get added to the parent table? - sql-server

I'm trying to insert daily imported data into a SQL Server (2017) table. While most of the time the imported data has a fixed amount of columns, sometimes the client wants to add a new column to the data-to-be-imported.
I'm seeking for a solution that when the data gets imported (whether it is from another table, from R or from .csv's, don't mind this), SQL would automatically add the missing (extra) column to the parent table, providing the column name and assigning NULL to all previous entries.
I've tried with both UNION ALL and BULK INSERT, but both of these require the same # of columns. I'm working with SSMS2017, R3.4.1.
Next, I tried with a staging table and modifying the UNION clause as:
SELECT * FROM Table_new
UNION ALL
SELECT Tp.*, '' FROM Table_parent Tp;
But more often than not the extra column doesn't occur, so the column dimension problem occurs again.
I also thought about running the queries from R with DBI and odbc dbWriteTable() and handling the invalid column error with TryCatch(), parsing the column name from the error message and so on, but this would be a shakiest craft I've ever done and would prefer not to.
Ultimately I thought adding an if clause in R, and depending on the number of added new columns, loop and add the ', ""' part to the SQL query to create the extra columns. I'm convinced that this is too complex solution to this problem.
# Pseudo-R
#calculate the difference between lenght(colnames)
diff <- diff(length(colnames_new, colnames_parent)
if diff = 0 {
dbQuery(BULK INSERT INTO old SELECT * FROM new;)
} else if diff > 0 {
dbQuery(paste0(SELECT * FROM new
UNION ALL
SELECT T1.*, loop_paste(, '' /* for every diff */), FROM parent T1;))
} else if diff < 0 {
dbQuery(SELECT * FROM parent
UNION ALL
SELECT T2.*, loop_paste(, '' /* for every diff */), FROM new T2;))
}
To summarize: when inserting data to SQL table, how to (automatically) append the columns in the parent table, when necessary? Thanks!

The things in your database such as tables, columns, primary keys, foreign keys, check clauses are all part of the database schema. People design the schema before adding data to the database.
If you want to add new columns then you have to redesign your schema. When you do this you will also have to rewrite some of the CRUD procedures.

Related

Creating PL/SQL procedure to fill intermediary table with random data

As part of my classes on relational databases, I have to create procedures as part of package to fill some of the tables of an Oracle database I created with random data, more specifically the tables community, community_account and community_login_info (see ERD linked below). I succeeded in doing this for tables community and community_account, however I'm having some problems with generating data for table community_login_info. This serves as an intermediary table between the many to many relationship of community and community_account, linking the id's of both tables.
My latest approach was to create an associative array with the structure of the target table community_login_info. I then do a cross join of community and community_account (there's already random data in there) along with random timestamps, bulk collect that result into the variable of the associative array and then insert those contents into the target table community_login_info. But it seems I'm doing something wrong since Oracle returns error ORA-00947 'not enough values'. To me it seems all columns the target table get a value in the insert, what am I missing here? I added the code from my package body below.
ERD snapshot
PROCEDURE mass_add_rij_koppeling_community_login_info
IS
TYPE type_rec_communties_accounts IS RECORD
(type_community_id community.community_id%type,
type_account_id community_account.account_id%type,
type_start_timestamp_login community_account.start_timestamp_login%type,
type_eind_timestamp_login community_account.eind_timestamp_login%type);
TYPE type_tab_communities_accounts
IS TABLE of type_rec_communties_accounts
INDEX BY pls_integer;
t_communities_accounts type_tab_communities_accounts;
BEGIN
SELECT community_id,account_id,to_timestamp(start_datum_account) as start_timestamp_login, to_timestamp(eind_datum_account) as eind_timestamp_login
BULK COLLECT INTO t_communities_accounts
FROM community
CROSS JOIN community_account
FETCH FIRST 50 ROWS ONLY;
FORALL i_index IN t_communities_accounts.first .. t_communities_accounts.last
SAVE EXCEPTIONS
INSERT INTO community_login_info (community_id,account_id,start_timestamp_login,eind_timestamp_login)
values (t_communities_accounts(i_index));
END mass_add_rij_koppeling_community_login_info;
Your error refers to the part:
INSERT INTO community_login_info (community_id,account_id,start_timestamp_login,eind_timestamp_login)
values (t_communities_accounts(i_index));
(By the way, the complete error message gives you the line number where the error is located, it can help to focus the problem)
When you specify the columns to insert, then you need to specify the columns in the VALUES part too:
INSERT INTO community_login_info (community_id,account_id,start_timestamp_login,eind_timestamp_login)
VALUES (t_communities_accounts(i_index).community_id,
t_communities_accounts(i_index).account_id,
t_communities_accounts(i_index).start_timestamp_login,
t_communities_accounts(i_index).eind_timestamp_login);
If the table COMMUNITY_LOGIN_INFO doesn't have any more columns, you could use this syntax:
INSERT INTO community_login_info
VALUE (t_communities_accounts(i_index));
But I don't like performing inserts without specifying the columns because I could end up inserting the start time into the end time and vice versa if I haven't defined the columns in exactly the same order as the table definition, and if the definition of the table changes over time and new columns are added, you have to modify your procedure to add the new column even if the new column goes with a NULL value because you don't fill up that new column with this procedure.
PROCEDURE mass_add_rij_koppeling_community_login_info
IS
TYPE type_rec_communties_accounts IS RECORD
(type_community_id community.community_id%type,
type_account_id community_account.account_id%type,
type_start_timestamp_login community_account.start_timestamp_login%type,
type_eind_timestamp_login community_account.eind_timestamp_login%type);
TYPE type_tab_communities_accounts
IS TABLE of type_rec_communties_accounts
INDEX BY pls_integer;
t_communities_accounts type_tab_communities_accounts;
BEGIN
SELECT community_id,account_id,to_timestamp(start_datum_account) as start_timestamp_login, to_timestamp(eind_datum_account) as eind_timestamp_login
BULK COLLECT INTO t_communities_accounts
FROM community
CROSS JOIN community_account
FETCH FIRST 50 ROWS ONLY;
FORALL i_index IN t_communities_accounts.first .. t_communities_accounts.last
SAVE EXCEPTIONS
INSERT INTO community_login_info (community_id,account_id,start_timestamp_login,eind_timestamp_login)
values (select community_id,account_id,start_timestamp_login,eind_timestamp_login
from table(cast(t_communities_accountsas type_tab_communities_accounts)) a);
END mass_add_rij_koppeling_community_login_info;

SSIS data flow - copy new data or update existing

I queried some data from table A(Source) based on certain condition and insert into temp table(Destination) before upsert into Crm.
If data already exist in Crm I dont want to query the data from table A and insert into temp table(I want this table to be empty) unless there is an update in that data or new data was created. So basically I want to query only new data or if there any modified data from table A which already existed in Crm. At the moment my data flow is like this.
clear temp table - delete sql statement
Query from source table A and insert into temp table.
From temp table insert into CRM using script component.
In source table A I have audit columns: createdOn and modifiedOn.
I found one way to do this. SSIS DataFlow - copy only changed and new records but no really clear on how to do so.
What is the best and simple way to achieve this.
The link you posted is basically saying to stage everything and use a MERGE to update your table (essentially an UPDATE/INSERT).
The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A would be to add a "last updated" timestamp to table A and enforcing that it will always be up to date.
One way to do this is with a trigger; see here for an example.
You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.
Edit: I just saw that you already have a modifiedOn column, so you could use that as described above.
Examples:
There are a few different ways you could do it:
ONE
Include the modifiedOn column on in your final destination table.
You can then build a dynamic query for your data flow source in a SSIS string variable, something like:
"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + #[User::MaxModifiedOnDate] + "')"
#[User::MaxModifiedOnDate] (string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:
SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable
The DATEADD part, as well as the CAST to a certain degree, represent your margin of safety.
TWO
If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:
CREATE TABLE DataLoadHistory
(
DataLoadID int PRIMARY KEY IDENTITY
, DataLoadStart datetime NOT NULL
, DataLoadEnd datetime
, Success bit NOT NULL
)
You would begin each data load with this (Execute SQL Task):
CREATE PROCEDURE BeginDataLoad
#DataLoadID int OUTPUT
AS
INSERT INTO DataLoadHistory
(
DataLoadStart
, Success
)
VALUES
(
GETDATE()
, 0
)
SELECT #DataLoadID = SCOPE_IDENTITY()
You would store the returned DataLoadID in a SSIS integer variable, and use it when the data load is complete as follows:
CREATE PROCEDURE DataLoadComplete
#DataLoadID int
AS
UPDATE DataLoadHistory
SET
DataLoadEnd = GETDATE()
, Success = 1
WHERE DataLoadID = #DataLoadID
When it comes to building your query for table A, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate would come from the following query:
SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1
So the DataLoadHistory table, rather than your destination table.
Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.
THREE
I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).
It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.

Get a list of columns and widths for a specific record

I want a list of properties about a given table and for a specific record of data from that table - in one result
Something like this:
Column Name , DataLength, SchemaLengthMax
...and for only one record (based on a where filter)
So what Im thinking is something like this:
- Get a list of columns from sys.columns and also the schema-based maxlength value
- populate column names into a temp table that includes (column_name, data_length, schema_size_max)
- now loop over that temp table and for each column name, fetch the data for that column based on a specific record, then update the temp table with the length of this data
- finally, select from the temp table
sound reasonable?
Yup. That way works. Not sure if it's the best, since it involves one iteration per column along with the where condition on the source table.
Consider this, instead :
Get the candidate records into a temporary table after applying the where condition. Make sure to get a primary key. If there is no primary key, get a rowid. (assuming SQL Server 2005 or above).
Create a temporary table (Say, #RecValueLens) that has three columns : Primary_key_Value, MyColumnName, MyValueLen
Loop through the list of column names (after taking only the column names into another temporary table) and build sql statement shown in Step 4.
Insert Into #RecValueLens (Primary_Key_Value, MyColumnName, MyValueLen)
Select Max(Primary_Key_Goes_Here), Max('Column_Name_Goes_Here') as ColumnName, Len(Max(Column_Name)) as ValueMyLen From Source_Table_Goes_Here
Group By Primary_Key_Goes_Here
So, if there are 10 columns, you will have 10 insert statements. You could either insert them into a temporary table and run it as a loop. If the number of columns is few, you could concatenate all statements into a single batch.
Run the SQL Statement(s) from above. So, you have Record-wise, column-wise, Value lengths. What is left is to get the column definition.
Get the column definition from sys.columns into a temporary table and join with the #RecValueLens to get the output.
Do you want me to write it for you ?

Merge query using two tables in SQL server 2012

I am very new to SQL and SQL server, would appreciate any help with the following problem.
I am trying to update a share price table with new prices.
The table has three columns: share code, date, price.
The share code + date = PK
As you can imagine, if you have thousands of share codes and 10 years' data for each, the table can get very big. So I have created a separate table called a share ID table, and use a share ID instead in the first table (I was reliably informed this would speed up the query, as searching by integer is faster than string).
So, to summarise, I have two tables as follows:
Table 1 = Share_code_ID (int), Date, Price
Table 2 = Share_code_ID (int), Share_name (string)
So let's say I want to update the table/s with today's price for share ZZZ. I need to:
Look for the Share_code_ID corresponding to 'ZZZ' in table 2
If it is found, update table 1 with the new price for that date, using the Share_code_ID I just found
If the Share_code_ID is not found, update both tables
Let's ignore for now how the Share_code_ID is generated for a new code, I'll worry about that later.
I'm trying to use a merge query loosely based on the following structure, but have no idea what I am doing:
MERGE INTO [Table 1]
USING (VALUES (1,23-May-2013,1000)) AS SOURCE (Share_code_ID,Date,Price)
{ SEEMS LIKE THERE SHOULD BE AN INNER JOIN HERE OR SOMETHING }
ON Table 2 = 'ZZZ'
WHEN MATCHED THEN UPDATE SET Table 1.Price = 1000
WHEN NOT MATCHED THEN INSERT { TO BOTH TABLES }
Any help would be appreciated.
http://msdn.microsoft.com/library/bb510625(v=sql.100).aspx
You use Table1 for target table and Table2 for source table
You want to do action, when given ID is not found in Table2 - in the source table
In the documentation, that you had read already, that corresponds to the clause
WHEN NOT MATCHED BY SOURCE ... THEN <merge_matched>
and the latter corresponds to
<merge_matched>::=
{ UPDATE SET <set_clause> | DELETE }
Ergo, you cannot insert into source-table there.
You could use triggers for auto-insertion, when you insert something in Table1, but that will not be able to insert proper Shared_Name - trigger just won't know it.
So you have two options i guess.
1) make T-SQL code block - look for Stored Procedures. I think there also is a construct to execute anonymous code block in MS SQ, like EXECUTE BLOCK command in Firebird SQL Server, but i don't know it for sure.
2) create updatable SQL VIEW, joining Table1 and Table2 to show last most current date, so that when you insert a row in this view the view's on-insert trigger would actually insert rows to both tables. And when you would update the data in the view, the on-update trigger would modify the data.

Copy table data and populate new columns

So I'm trying to copy some data from database table to another. The problem is though, the target database table has 2 new columns that are required. I wanted to use the export/import wizard on SQL Server Management Studio but if I use that I will need to write a query for each table and I can only execute 1 query at a time. Was wondering if there are a more efficient way of doing it.
Here's an example of 1 table:
dbase1.dbo.Appointment { id, name, description, createdate }
dbase2.dbo.Appointment { id, name, description, createdate, auditby, auditat}
I have a total of 8 tables with those 2 additional columns. and most of them are related to each other via fk, so I wanted to use the wizard as it figures out which table gets inserted first. The problem with that is, it only works if I do a "copy data from one or more tables " and not the "write a query to specify data" (I use this to populate those two new columns).
I've been doing this very slow process in copying data as I'm using MVC Code First for my application and I dont have access to the server to be able to drop and create the table at my leisure. So I have to resort to this to maintain the data that I already have.
An idea: temporarily disable the foreign key constraints in the destination database. Then it doesn't matter what order you run your inserts. In order to populate the two new and required columns, you just need to pick some stock values to put in there (since obviously these rows initially are not subject to initial auditing). For example:
INSERT dbase2.dbo.appointment
(id, name, description, createdate, auditby, auditat)
SELECT id, name, description, createdate,
auditby = 'me', auditat = GETDATE()
FROM dbo.appointment;
Since it seems the challenge is merely that the destination requires columns that aren't in the source, and that you need to determine what should be populated in these audit columns, this seems to solve multiple problems at once. You just need to figure out what to put in there instead of 'me' and GETDATE().
(To get the wizard to pull these 8 tables for you, you might be able to create a view similar to the select portion of the above query, but that's more work and it won't see the underlying FK constraints to generate them in the right order anyway.)
Write the sql query for each of the insert processes in the order you want it. That would be the simplest approach.
Set the Default values for these two columns
Like for AuditAt - Default Date i.e. GetDate()
For AuditBy - The Person ID/Name
Now, you can Insert into these tables without entering for these two columns

Resources