Bulk Insert MAXERRORS - sql-server

Is there any way to use the Bulk Insert statement and disable MAXERRORS?
I would like to allow for an infinite number of errors, as the number of errors can be high in the file I'm bulk inserting (I don't have control of this file, and am currently working with the vendor to fix their issues on certain rows).
If there isn't a way to disable it, what is the maximum number that MAXERRORS can handle? Is it 2147483647?

Normally, when I import data from external sources, I am very wary of problems in the data. SQL Server offers several solutions. Many people use SSIS. I avoid SSIS. One of the reasons is getting it to open an Excel file that is already opened by a user. It has a few other shortcomings as well.
In any case, once the data is in a text file, I create a staging table that has all the columns of the original table with the data type varchar(8000). This tends to be larger than necessary, but should be sufficient.
Then, I create a table with the proper columns, and population it using something like:
insert into RealTable (CharColumn, IntColumn, FloatColumn, DateTimeColumn)
select CharColumn,
(case when isnumeric(IntColumn) = 1 and IntColumn not like '%.%' then cast(IntColumn as int end),
(case when isnumeric(FloatColumn) = 1 then cast(FloatColumn as float) end),
(case when isdate(DateColumn) = 1 then cast(DzteColumn as date)
from StagingTable st
That is, I do the type checks in SQL code, using a case statement to avoid errors. The result are NULL values when the types don't match. I can then investigate the values in the database, using the StagingTable, to understand any issues.
Also, in the RealTable, I always have the following columns:
RealTableId int identity(1,1)
CreatedBy varchar(255) default system_user,
CreatedAt datetime default getdate()
These provide tracking information about the data that often comes in useful.

Related

SSIS data flow - copy new data or update existing

I queried some data from table A(Source) based on certain condition and insert into temp table(Destination) before upsert into Crm.
If data already exist in Crm I dont want to query the data from table A and insert into temp table(I want this table to be empty) unless there is an update in that data or new data was created. So basically I want to query only new data or if there any modified data from table A which already existed in Crm. At the moment my data flow is like this.
clear temp table - delete sql statement
Query from source table A and insert into temp table.
From temp table insert into CRM using script component.
In source table A I have audit columns: createdOn and modifiedOn.
I found one way to do this. SSIS DataFlow - copy only changed and new records but no really clear on how to do so.
What is the best and simple way to achieve this.
The link you posted is basically saying to stage everything and use a MERGE to update your table (essentially an UPDATE/INSERT).
The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A would be to add a "last updated" timestamp to table A and enforcing that it will always be up to date.
One way to do this is with a trigger; see here for an example.
You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.
Edit: I just saw that you already have a modifiedOn column, so you could use that as described above.
Examples:
There are a few different ways you could do it:
ONE
Include the modifiedOn column on in your final destination table.
You can then build a dynamic query for your data flow source in a SSIS string variable, something like:
"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + #[User::MaxModifiedOnDate] + "')"
#[User::MaxModifiedOnDate] (string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:
SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable
The DATEADD part, as well as the CAST to a certain degree, represent your margin of safety.
TWO
If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:
CREATE TABLE DataLoadHistory
(
DataLoadID int PRIMARY KEY IDENTITY
, DataLoadStart datetime NOT NULL
, DataLoadEnd datetime
, Success bit NOT NULL
)
You would begin each data load with this (Execute SQL Task):
CREATE PROCEDURE BeginDataLoad
#DataLoadID int OUTPUT
AS
INSERT INTO DataLoadHistory
(
DataLoadStart
, Success
)
VALUES
(
GETDATE()
, 0
)
SELECT #DataLoadID = SCOPE_IDENTITY()
You would store the returned DataLoadID in a SSIS integer variable, and use it when the data load is complete as follows:
CREATE PROCEDURE DataLoadComplete
#DataLoadID int
AS
UPDATE DataLoadHistory
SET
DataLoadEnd = GETDATE()
, Success = 1
WHERE DataLoadID = #DataLoadID
When it comes to building your query for table A, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate would come from the following query:
SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1
So the DataLoadHistory table, rather than your destination table.
Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.
THREE
I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).
It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.

Convert Date Stored as VARCHAR into INT to compare to Date Stored as INT

I'm using SQL Server 2014. My request I believe is rather simple. I have one table containing a field holding a date value that is stored as VARCHAR, and another table containing a field holding a date value that is stored as INT.
The date value in the VARCHAR field is stored like this: 2015M01
The data value in the INT field is stored like this: 201501
I need to compare these tables against each other using EXCEPT. My thought process was to somehow extract or TRIM the "M" out of the VARCHAR value and see if it would let me compare the two. If anyone has a better idea such as using CAST to change the date formats or something feel free to suggest that as well.
I am also concerned that even extracting the "M" out of the VARCHAR may still prevent the comparison since one will still remain VARCHAR and the other is INT. If possible through a T-SQL query to convert on the fly that would be great advice as well. :)
REPLACE the string and then CONVERT to integer
SELECT A.*, B.*
FROM TableA A
INNER JOIN
(SELECT intField
FROM TableB
) as B
ON CONVERT(INT, REPLACE(A.varcharField, 'M', '')) = B.intField
Since you say you already have the query and are using EXCEPT, you can simply change the definition of that one "date" field in the query containing the VARCHAR value so that it matches the INT format of the other query. For example:
SELECT Field1, CONVERT(INT, REPLACE(VarcharDateField, 'M', '')) AS [DateField], Field3
FROM TableA
EXCEPT
SELECT Field1, IntDateField, Field3
FROM TableB
HOWEVER, while I realize that this might not be feasible, your best option, if you can make this happen, would be to change how the data in the table with the VARCHAR field is stored so that it is actually an INT in the same format as the table with the data already stored as an INT. Then you wouldn't have to worry about situations like this one.
Meaning:
Add an INT field to the table with the VARCHAR field.
Do an UPDATE of that table, setting the INT field to the string value with the M removed.
Update any INSERT and/or UPDATE stored procedures used by external services (app, ETL, etc) to do that same M removal logic on the way in. Then you don't have to change any app code that does INSERTs and UPDATEs. You don't even need to tell anyone you did this.
Update any "get" / SELECT stored procedures used by external services (app, ETL, etc) to do the opposite logic: convert the INT to VARCHAR and add the M on the way out. Then you don't have to change any app code that gets data from the DB. You don't even need to tell anyone you did this.
This is one of many reasons that having a Stored Procedure API to your DB is quite handy. I suppose an ORM can just be rebuilt, but you still need to recompile, even if all of the code references are automatically updated. But making a datatype change (or even moving a field to a different table, or even replacinga a field with a simple CASE statement) "behind the scenes" and masking it so that any code outside of your control doesn't know that a change happened, not nearly as difficult as most people might think. I have done all of these operations (datatype change, move a field to a different table, replace a field with simple logic, etc, etc) and it buys you a lot of time until the app code can be updated. That might be another team who handles that. Maybe their schedule won't allow for making any changes in that area (plus testing) for 3 months. Ok. It will be there waiting for them when they are ready. Any if there are several areas to update, then they can be done one at a time. You can even create new stored procedures to run in parallel for any updated app code to have the proper INT datatype as the input parameter. And once all references to the VARCHAR value are gone, then delete the original versions of those stored procedures.
If you want everything in the first table that is not in the second, you might consider something like this:
select t1.*
from t1
where not exists (select 1
from t2
where cast(replace(t1.varcharfield, 'M', '') as int) = t2.intfield
);
This should be close enough to except for your purposes.
I should add that you might need to include other columns in the where statement. However, the question only mentions one column, so I don't know what those are.
You could create a persisted view on the table with the char column, with a calculated column where the M is removed. Then you could JOIN the view to the table containing the INT column.
CREATE VIEW dbo.PersistedView
WITH SCHEMA_BINDING
AS
SELECT ConvertedDateCol = CONVERT(INT, REPLACE(VarcharCol, 'M', ''))
--, other columns including the PK, etc
FROM dbo.TablewithCharColumn;
CREATE CLUSTERED INDEX IX_PersistedView
ON dbo.PersistedView(<the PK column>);
SELECT *
FROM dbo.PersistedView pv
INNER JOIN dbo.TableWithIntColumn ic ON pv.ConvertedDateCol = ic.IntDateCol;
If you provide the actual details of both tables, I will edit my answer to make it clearer.
A persisted view with a computed column will perform far better on the SELECT statement where you join the two columns compared with doing the CONVERT and REPLACE every time you run the SELECT statement.
However, a persisted view will slightly slow down inserts into the underlying table(s), and will prevent you from making DDL changes to the underlying tables.
If you're looking to not persist the values via a schema-bound view, you could create a non-persisted computed column on the table itself, then create a non-clustered index on that column. If you are using the computed column in WHERE or JOIN clauses, you may see some benefit.
By way of example:
CREATE TABLE dbo.PCT
(
PCT_ID INT NOT NULL
CONSTRAINT PK_PCT
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeChar VARCHAR(50) NOT NULL
, SomeCharToInt AS CONVERT(INT, REPLACE(SomeChar, 'M', ''))
);
CREATE INDEX IX_PCT_SomeCharToInt
ON dbo.PCT(SomeCharToInt);
INSERT INTO dbo.PCT(SomeChar)
VALUES ('2015M08');
SELECT SomeCharToInt
FROM dbo.PCT;
Results:

How to obtain max length of fields in huge pipe delimited file

I have a pipe delimited file that is too large to open in Excel. I'm trying to import this file into MSSQL using the import wizard in SSMS.
Normally when I do this, I open the file in Excel and use an array function =MAX(LEN(An:Annnn)) to get the max length of each column. Then I use that to specify the size of each field in my table.
This file is too large to open in Excel and SQL doesn't check all of the data to give an accurate suggestion (I think it's a crazy small sample like 200 records).
Anyone have a solution to this (I'm not opposed to doing something in Linux especially if it's free).
Thanks in advance for any help.
When I import text data into a database, typically I first read the data into a staging table where are the columns are long-enough character fields (say varchar(8000)).
Then, I load from the staging table into the final table:
create table RealTable (
RealTableId int identity(1, 1) primary key,
Column1 int,
Column2 datetime,
Column3 varchar(12),
. . .
);
insert into RealTable(<all columns but id>)
select (case when column1 not like '[^0-9]' then cast(column1 as int) end),
(case when isdate(column2) = 1 then cast(column2 as datetime),
. . .
I find it much easier to debug type issues inside the database rather than when inserting into the database.

Getting bulk data into a busy table

I am currently performing analysis on a client's MSSQL Server. I've already fixed many issues (unnecessary indexes, index fragmentation, NEWID() being used for identities all over the shop etc), but I've come across a specific situation that I haven't seen before.
Process 1 imports data into a staging table, then Process 2 copies the data from the staging table using an INSERT INTO. The first process is very quick (it uses BULK INSERT), but the second takes around 30 mins to execute. The "problem" SQL in Process 2 is as follows:
INSERT INTO ProductionTable(field1,field2)
SELECT field1, field2
FROM SourceHeapTable (nolock)
The above INSERT statement inserts hundreds of thousands of records into ProductionTable, each row allocating a UNIQUEIDENTIFIER, and inserting into about 5 different indexes. I appreciate this process is going to take a long time, so my issue is this: while this import is taking place, a 3rd process is responsible for performing constant lookups on ProductionTable - in addition to inserting an additional record into the table as such:
INSERT INTO ProductionTable(fields...)
VALUES(values...)
SELECT *
FROM ProductionTable (nolock)
WHERE ID = #Id
For the 30 or so minutes that the INSERT...SELECT above is taking place, the INSERT INTO times-out.
My immediate thought is that SQL server is locking the entire table during the INSERT...SELECT. I did quite a lot of profiling on the server during my analysis, and there are definitely locks being allocated for the duration of the INSERT...SELECT, though I fail remember what type they were.
Having never needed to insert records into a table from two sources at the same time - at least during an ETL process - I'm not sure how to approach this. I've been looking up INSERT table hints, but most are being made obsolete in future versions.
It looks to me like a CURSOR is the only way to go here?
You could consider BULK INSERT for Process-2 to get the data into the ProductionTable.
Another option would be to batch Process-2 into small batches of around 1000 records and use a Table Valued Parameter to do the INSERT. See: http://msdn.microsoft.com/en-us/library/bb510489.aspx#BulkInsert
It seems like table lock.
Try portion insert in ETL process. Something like
while 1=1
begin
INSERT INTO ProductionTable(field1,field2)
SELECT top (1000) field1, field2
FROM SourceHeapTable sht (nolock)
where not exists (select 1 from ProductionTable pt where pt.id = sht.id)
-- optional
--waitfor delay '00:00:01.0'
if ##rowcount = 0
break;
end

Sql Server Column with Auto-Generated Data

I have a customer table, and my requirement is to add a new varchar column that automatically obtains a random unique value each time a new customer is created.
I thought of writing an SP that randomizes a string, then check and re-generate if the string already exists. But to integrate the SP into the customer record creation process would require transactional SQL stuff at code level, which I'd like to avoid.
Help please?
edit:
I should've emphasized, the varchar has to be 5 characters long with numeric values between 1000 and 99999, and if the number is less than 10000, pad 0 on the left.
if it has to be varchar, you can cast a uniqueidentifier to varchar.
to get a random uniqueidentifier do NewId()
here's how you cast it:
CAST(NewId() as varchar(36))
EDIT
as per your comment to #Brannon:
are you saying you'll NEVER have over 99k records in the table? if so, just make your PK an identity column, seed it with 1000, and take care of "0" left padding in your business logic.
This question gives me the same feeling I get when users won't tell me what they want done, or why, they only want to tell me how to do it.
"Random" and "Unique" are conflicting requirements unless you create a serial list and then choose randomly from it, deleting the chosen value.
But what's the problem this is intended to solve?
With your edit/update, sounds like what you need is an auto-increment and some padding.
Below is an approach that uses a bogus table, then adds an IDENTITY column (assuming that you don't have one) which starts at 1000, and then which uses a Computed Column to give you some padding to make everything work out as you requested.
CREATE TABLE Customers (
CustomerName varchar(20) NOT NULL
)
GO
INSERT INTO Customers
SELECT 'Bob Thomas' UNION
SELECT 'Dave Winchel' UNION
SELECT 'Nancy Davolio' UNION
SELECT 'Saded Khan'
GO
ALTER TABLE Customers
ADD CustomerId int IDENTITY(1000,1) NOT NULL
GO
ALTER TABLE Customers
ADD SuperId AS right(replicate('0',5)+ CAST(CustomerId as varchar(5)),5)
GO
SELECT * FROM Customers
GO
DROP TABLE Customers
GO
I think Michael's answer with the auto-increment should work well - your customer will get "01000" and then "01001" and then "01002" and so forth.
If you want to or have to make it more random, in this case, I'd suggest you create a table that contains all possible values, from "01000" through "99999". When you insert a new customer, use a technique (e.g. randomization) to pick one of the existing rows from that table (your pool of still available customer ID's), and use it, and remove it from the table.
Anything else will become really bad over time. Imagine you've used up 90% or 95% of your available customer ID's - trying to randomly find one of the few remaining possibility could lead to an almost endless retry of "is this one taken? Yes -> try a next one".
Marc
Does the random string data need to be a certain format? If not, why not use a uniqueidentifier?
insert into Customer ([Name], [UniqueValue]) values (#Name, NEWID())
Or use NEWID() as the default value of the column.
EDIT:
I agree with #rm, use a numeric value in your database, and handle the conversion to string (with padding, etc) in code.
Try this:
ALTER TABLE Customer ADD AVarcharColumn varchar(50)
CONSTRAINT DF_Customer_AVarcharColumn DEFAULT CONVERT(varchar(50), GETDATE(), 109)
It returns a date and time up to milliseconds, wich would be enough in most cases.
Do you really need an unique value?

Resources