I have a big table in SQL Server 2008 R2. It contains billions of rows. I need to load the whole data set in our application. Query the whole table is very slow. I want to use bcp dump it into a file and load it. But the problem is there are string columns it contains all kinds of special characters like '\t', '\0', comma, and '\n'. I can't find a good field/row terminator. But long string terminator slows down the data file loading for my application. The question is:
Is there any API that load data faster then SQL query? I find there is a native importing API IRowsetFastLoad. But not lucky on exporting.
Is there any API for BCP native format? I can't find any document about the format of native bcp file.
From BOL:
-n
Performs the bulk copy operation using the native (database) data types of the data. This option does not prompt for each field; it uses the native values.
Billions of rows? Then you will also want to use :
-b batch_size
Specifies the number of rows per batch of data copied. Each batch is copied to the server as one transaction. SQL Server commits or rolls back, in the case of failure, the transaction for every batch.
Can't you access the two databases at once, perhaps through Linked Server? It would make things easier.
DECLARE #StartId BIGINT
DECLARE #NmbrOfRecords BIGINT
DECLARE #RowCount BIGINT
SET #StartId = 0
SET #NmbrOfRecords = 9999
SET #RowCount = 1
WHILE #RowCount > 0
BEGIN
BEGIN TRANSACTION
INSERT INTO DestinationDatabase.dbo.Mytable
SELECT * FROM SourceDatabase.dbo.Mytable
WHERE ID BETWEEN #StartId AND #StartId + #NmbrOfRecords
SET #RowCount = ##ROWCOUNT
SET #StartId = #StartId + #NmbrOfRecords + 1
COMMIT TRANSACTION
END
The Bulk Insert API is exposed to programmers by the SqlBulkCopy class:
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx
Import and Export Bulk Data by Using the bcp Utility (SQL Server)
http://technet.microsoft.com/en-us/library/aa337544.aspx
Bulk Copy Functions reference:
http://technet.microsoft.com/en-us/library/ms130922.aspx
Related
Software I'm working with uses a text field to store XML. From my searches online, the text datatype is supposed to hold 2^31 - 1 characters. Currently SQL Server is truncating the XML at 65,535 characters every time. I know this is caused by SQL Server, because if I add a 65,536th character to the column directly in Management Studio, it states that it will not update because characters will be truncated.
Is the max length really 65,535 or could this be because the database was designed in an earlier version of SQL Server (2000) and it's using the legacy text datatype instead of 2005's?
If this is the case, will altering the datatype to Text in SQL Server 2005 fix this issue?
that is a limitation of SSMS not of the text field, but you should use varchar(max) since text is deprecated
Here is also a quick test
create table TestLen (bla text)
insert TestLen values (replicate(convert(varchar(max),'a'), 100000))
select datalength(bla)
from TestLen
Returns 100000 for me
MSSQL 2000 should allow up to 2^31 - 1 characters (non unicode) in a text field, which is over 2 billion. Don't know what's causing this limitation but you might wanna try using varchar(max) or nvarchar(max). These store as many characters but allow also the regular string T-SQL functions (like LEN, SUBSTRING, REPLACE, RTRIM,...).
If you're able to convert the column, you might as well, since the text data type will be removed in a future version of SQL Server. See here.
The recommendation is to use varchar(MAX) or nvarchar(MAX). In your case, you could also use the XML data type, but that may tie you to certain database engines (if that's a consideration).
You should have a look at
XML Support in Microsoft SQL Server
2005
Beginning SQL Server 2005 XML
Programming
So I would rather try to use the data type appropriate for the use. Not make a datatype fit your use from a previous version.
Here's a little script I wrote for getting out all data
SELECT #data = N'huge data';
DECLARE #readSentence NVARCHAR (MAX) = N'';
DECLARE #dataLength INT = ( SELECT LEN (#data));
DECLARE #currIndex INT = 0;
WHILE #data <> #readSentence
BEGIN
DECLARE #temp NVARCHAR (MAX) = N'';
SET #temp = ( SELECT SUBSTRING (#data, #currIndex, 65535));
SELECT #temp;
SET #readSentence += #temp;
SET #currIndex += 65535;
END;
we have an system that creates and table in a database on our production server for each day/shift. I would like to somehow grab the data from that server and move it to our archive server and if the data is more than x days old remove it off the production server.
On the production server, the database is called "transformations" and the tables are named "yyyy-mm-dd_shift_table". I would like to move this into a database on another server running SQL 2012 into a database "Archive" with the same name. Each table contains about 30k records for the day.
The way i see it would be something like:
Get list of tables on Production Server
If table exists in Archive server, look for any changes (only really relevant for the current table) and sync changes
If table doesn't exist in Archive Server, create table and syn changes.
If date on table is greater and X days, delete table from archive server
Ideally i would like to have this as a procedure in SQL that can run either daily/hourly ect.
Suggestions on how to attack this would be great.
EDIT: Happy to do a select on all matching tables in the database and write them into a single table on my database.
A lot of digging today and i have come up with the flowing, This will load all the data from the remote server and insert it into the table on the local server. This requires a Linked server on your archive server which you can use to query the remote server. I'm sure you could reverse this and push the data but i didn't want to chew up cycles on the production server.
-- Set up the variables
--Tracer for the loop
DECLARE #i int
--Variable to hold the SQL queries
DECLARE #SQLCode nvarchar(300)
--Variable to hold the number of rows to process
DECLARE #numrows int
--Table to hold the SQL queries with and index for looping
DECLARE #SQLQueries TABLE (
idx smallint Primary Key IDENTITY(1,1)
, SQLCode nvarchar(300)
)
--Set up a table with the SQL queries that will need to be run on the remote server. This section creates an INSERT statment
--which is returning all the records in the remote table that do not exist in the local table.
INSERT INTO #SQLQueries
select 'INSERT INTO Local_Table_Name
SELECT S.* FROM [Remote_ServerName].[Transformations].[dbo].[' + name + '] AS S
LEFT JOIN Local_Table_Name AS T ON (T.Link_Field = S.Link_Field)
WHERE T.Link_Field IS Null'+
CHAR(13) + CHAR(10) + CHAR(13) + CHAR(10)
from [Remote_ServerName].[Transformations].sys.sysobjects
where type = 'U' AND name Like '%_Table_Suffix'
--Set up the loop to process all the tables
SET #i = 1
--Set up the number of rows in the resultant table
SET #numrows = (SELECT COUNT(*) FROM #SQLQueries)
--Only process if there are rows in the database
IF #numrows > 0
--Loop while there are still records to go through
WHILE (#i <= (SELECT MAX(idx) FROM #SQLQueries))
BEGIN
--Load the Code to run into a variable
SET #SQLCode = (SELECT SQLCode FROM #SQLQueries WHERE idx = #i);
--Execute the code
EXEC (#SQLCode)
--Increase the counter
SET #i = #i + 1
END
Initial ran over 45 tables inserted about 1.2 million records took 2.5 min. After that each run took about 1.5 min which only inserted about 50-100 records
I actually created a solution for this and have it posted on GitHub. It uses a library called EzAPI and will sync all the tables and columns from one server to another.
You're welcome to use it, but the basic process works by first checking the metadata between the databases and generating any changed objects. After making the necessary modifications to the destination server, it will generate one SSIS package per object and then execute the package. Can you choose to remove or keep the packages after they are generated.
https://github.com/thevinnie/SyncDatabases
I have a database on server X containing my source data.
I also have a database on server Y that contains data I need to augment with data on server X.
Currently we have a nightly job on server Y that calls a stored procedure on server X, inserts it into a table variable, then sends the data as xml to a stored procedure in the database on server Y.
Below is a basically what the code looks like:
--Get data from source
DECLARE #MySourceData TABLE
(
[ColumnX] VARCHAR(50),
[ColumnY] VARCHAR(50)
);
INSERT INTO #MySourceData EXECUTE [ServerX].SourceDatabase.dbo.[pListData];
DECLARE #XmlData XML;
SELECT
#XmlData =
(
SELECT
[ColumnX]
,[ColumnY]
FROM
#MySourceData
FOR XML RAW ('Item'), ROOT('Items'), ELEMENTS, TYPE
)
--Send data to target
EXEC TargetDatabase.dbo.pImportData #XmlData;
This approach keeps any server names or database names within the sql of the job step (which we think of as part of configuration), and allows us to abide by our in house development standards of using stored procedures for data access. While this particular solution is only processing a few thousand records, and the xml won't get that big, if we tried applying it in scenarios where the dataset was larger, how poorly it might scale. I'm curious if others have better suggestions.
I am trying to push binary data from SQL Server to an Oracle LONG RAW column. I have a linked server created on SQL Server that connects to the Oracle server. I have a stored procedure on the Oracle side that I am trying to call from SQL Server. I can't seem to get the binary to pass into the stored procedure. I've tried changing the from and to types; however, the data ultimately needs to end up in a LONG RAW column. I have control of the Oracle stored procedure and the SQL Server code, but I do not have control of the predefined Oracle table structure.
varbinary(max) -> long raw
ORA-01460: unimplemented or unreasonable conversion requested
varbinary(max) -> blob
PLS-00306: wrong number or types of arguments in call to 'ADDDOC'
varbinary -> long raw
No errors, but get data truncation or corruption
The varbinary(max) does work if I set the #doc = null.
Below is the Oracle procedure and the SQL Server.
Oracle:
CREATE OR REPLACE
PROCEDURE ADDDOC (param1 IN LONG RAW)
AS
BEGIN
-- insert param1 into a LONG RAW column
DBMS_OUTPUT.PUT_LINE('TEST');
END ADDDOC;
SQL Server:
declare #doc varbinary(max)
select top 1 #doc = Document from Attachments
execute ('begin ADDDOC(?); end;', #doc) at ORACLE_DEV
-- tried this too, same error
--execute ('begin ADDDOC(utl_raw.cast_to_raw(?)); end;', #doc) at ORACLE_DEV
I've also tried creating the record in the Oracle Documents table then updating the LONG RAW field from SQL Server without invoking a stored procedure, but the query just seems to run and run and run and run...
--already created record and got the Id of the record I want to put the data in
--hard coding for this example
declare #attachmentId, #documentId
set #attachmentId = 1
set #documentId = 1
update ORACLE_DEV..MYDB.Documents
set Document = (select Document from Attachments where Id = #attachmentId)
where DocumentId=#documentId
As noted in the comments, LONG RAW is very difficult to work with; unfortunately, our vendor is using the datatype in their product and I have no choice but to work with it. I found that I could not pass binary data from SQL Server to an Oracle stored procedure parameter. I ended up having to create a new record with a NULL value for the LONG RAW field then using an OPENQUERY update to set the field to the VARBINARY(MAX) field; I did try using an update with the four part identifier, as noted in my code sample, but it took over 11 minutes for a single update, this new approach completes in less than 3 seconds. I am using an Oracle stored procedure here because in my real world scenario I am creating multiple records in multiple tables and coded business logic that is not relevant then tying them together with the docId.
This feels more like a workaround than a solution, but it actually works with acceptable performance.
Oracle:
create or replace procedure ADDDOC(docId OUT Number)
as
begin
select docseq.nextval into docId from dual;
-- insert new row, but leave Document LONG RAW field null for now
insert into DOC (Id) values(docId);
end ADDDOC;
SQL Server:
declare #DocId float, #AttachmentID int, #Qry nvarchar(max)
set #AttachmentID = 123 -- hardcoded for example
execute('begin ADDDOC(?); end;', #DocId output) at ORACLE_DEV
-- write openquery sql that will update Oracle LONG RAW field from a SQL Server varbinary(max) field
set #Qry = '
update openquery(ORACLE_DEV, ''select Document from Documents where Id=' + cast(#DocId as varchar) + ''')
set Document = (select Document from Attachments where Id = ' + cast(#AttachmentID as varchar) + ')
'
execute sp_executesql #Qry
Assume that I have a table on my local which is Local_Table and I have another server and another db and table, which is Remote_Table (table structures are the same).
Local_Table has data, Remote_Table doesn't. I want to transfer data from Local_Table to Remote_Table with this query:
Insert into RemoteServer.RemoteDb..Remote_Table
select * from Local_Table (nolock)
But the performance is quite slow.
However, when I use SQL Server import-export wizard, transfer is really fast.
What am I doing wrong? Why is it fast with Import-Export wizard and slow with insert-select statement? Any ideas?
The fastest way is to pull the data rather than push it. When the tables are pushed, every row requires a connection, an insert, and a disconnect.
If you can't pull the data, because you have a one way trust relationship between the servers, the work around is to construct the entire table as a giant T-SQL statement and run it all at once.
DECLARE #xml XML
SET #xml = (
SELECT 'insert Remote_Table values (' + '''' + isnull(first_col, 'NULL') + ''',' +
-- repeat for each col
'''' + isnull(last_col, 'NULL') + '''' + ');'
FROM Local_Table
FOR XML path('')
) --This concatenates all the rows into a single xml object, the empty path keeps it from having <colname> </colname> wrapped arround each value
DECLARE #sql AS VARCHAR(max)
SET #sql = 'set nocount on;' + cast(#xml AS VARCHAR(max)) + 'set nocount off;' --Converts XML back to a long string
EXEC ('use RemoteDb;' + #sql) AT RemoteServer
It seems like it's much faster to pull data from a linked server than to push data to a linked server: Which one is more efficient: select from linked server or insert into linked server?
Update: My own, recent experience confirms this. Pull if possible -- it will be much, much faster.
Try this on the other server:
INSERT INTO Local_Table
SELECT * FROM RemoteServer.RemoteDb.Remote_Table
The Import/Export wizard will be essentially doing this as a bulk insert, where as your code is not.
Assuming that you have a Clustered Index on the remote table, make sure that you have the same Clustered index on the local table, set Trace flag 610 globally on your remote server and make sure remote is in Simple or bulk logged recovery mode.
If you're remote table is a Heap (which will speed things up anyway), make sure your remote database is in simple or bulk logged mode change your code to read as follows:
INSERT INTO RemoteServer.RemoteDb..Remote_Table WITH(TABLOCK)
SELECT * FROM Local_Table WITH (nolock)
The reason why it's so slow to insert into the remote table from the local table is because it inserts a row, checks that it inserted, and then inserts the next row, checks that it inserted, etc.
Don't know if you figured this out or not, but here's how I solved this problem using linked servers.
First, I have a LocalDB.dbo.Table with several columns:
IDColumn (int, PK, Auto Increment)
TextColumn (varchar(30))
IntColumn (int)
And I have a RemoteDB.dbo.Table that is almost the same:
IDColumn (int)
TextColumn (varchar(30))
IntColumn (int)
The main difference is that remote IDColumn isn't set up as as an ID column, so that I can do inserts into it.
Then I set up a trigger on remote table that happens on Delete
Create Trigger Table_Del
On Table
After Delete
AS
Begin
Set NOCOUNT ON;
Insert Into Table (IDColumn, TextColumn, IntColumn)
Select IDColumn, TextColumn, IntColumn from MainServer.LocalDB.dbo.table L
Where not exists (Select * from Table R WHere L.IDColumn = R.IDColumn)
END
Then when I want to do an insert, I do it like this from the local server:
Insert Into LocalDB.dbo.Table (TextColumn, IntColumn) Values ('textvalue', 123);
Delete From RemoteServer.RemoteDB.dbo.Table Where IDColumn = 0;
--And if I want to clean the table out and make sure it has all the most up to date data:
Delete From RemoteServer.RemoteDB.dbo.Table
By triggering the remote server to pull the data from the local server and then do the insert, I was able to turn a job that took 30 minutes to insert 1258 lines into a job that took 8 seconds to do the same insert.
This does require a linked server connection on both sides, but after that's set up it works pretty good.
Update:
So in the last few years I've made some changes, and have moved away from the delete trigger as a way to sync the remote table.
Instead I have a stored procedure on the remote server that has all the steps to pull the data from the local server:
CREATE PROCEDURE [dbo].[UpdateTable]
-- Add the parameters for the stored procedure here
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
--Fill Temp table
Insert Into WebFileNamesTemp Select * From MAINSERVER.LocalDB.dbo.WebFileNames
--Fill normal table from temp table
Delete From WebFileNames
Insert Into WebFileNames Select * From WebFileNamesTemp
--empty temp table
Delete From WebFileNamesTemp
END
And on the local server I have a scheduled job that does some processing on the local tables, and then triggers the update through the stored procedure:
EXEC sp_serveroption #server='REMOTESERVER', #optname='rpc', #optvalue='true'
EXEC sp_serveroption #server='REMOTESERVER', #optname='rpc out', #optvalue='true'
EXEC REMOTESERVER.RemoteDB.dbo.UpdateTable
EXEC sp_serveroption #server='REMOTESERVER', #optname='rpc', #optvalue='false'
EXEC sp_serveroption #server='REMOTESERVER', #optname='rpc out', #optvalue='false'
If you must push data from the source to the target (e.g., for firewall or other permissions reasons), you can do the following:
In the source database, convert the recordset to a single XML string (i.e., multiple rows and columns combined into a single XML string).
Then push that XML over as a single row (as a varchar(max), since XML isn't allowed over linked databases in SQL Server).
DECLARE #xml XML
SET #xml = (select * from SourceTable FOR XML path('row'))
Insert into TempTargetTable values (cast(#xml AS VARCHAR(max)))
In the target database, cast the varchar(max) as XML and then use XML parsing to turn that single row and column back into a normal recordset.
DECLARE #X XML = (select '<toplevel>' + ImportString + '</toplevel>' from TempTargetTable)
DECLARE #iX INT
EXEC sp_xml_preparedocument #ix output, #x
insert into TargetTable
SELECT [col1],
[col2]
FROM OPENXML(#iX, '//row', 2)
WITH ([col1] [int],
[col2] [varchar](128)
)
EXEC sp_xml_removedocument #iX
I've found a workaround. Since I'm not a big fun of GUI tools like SSIS, I've reused a bcp script to load table into csv and vice versa. Yeah, it's an odd case to have the bulk operation support for files, but tables. Feel free to edit the following script to fit your needs:
exec xp_cmdshell 'bcp "select * from YourLocalTable" queryout C:\CSVFolder\Load.csv -w -T -S .'
exec xp_cmdshell 'bcp YourAzureDBName.dbo.YourAzureTable in C:\CSVFolder\Load.csv -S yourdb.database.windows.net -U youruser#yourdb.database.windows.net -P yourpass -q -w'
Pros:
No need to define table structures every time.
I've tested and it worked way faster than inserting directly through
the LinkedServer.
It's easier to manage than XML (which is limited to
varchar(max) length anyway).
No need of an extra layout of abstraction (tools like SSIS).
Cons:
Using the external tool bcp through the xp_cmdshell interface.
Table properties will be lost after ex/im-poring csv (i.e. datatype, nulls,length, separator within value, etc).