I have to insert millions of data from MATLAB into Vertica. I tried using the datainsert function given in MATLAB but it seems slow as it takes about 6 seconds for 3000 records. The other functions fastinsert and insert are even slower. Is there a faster method to insert the data?
Do yourself a favor and export the data into csv format. See this link for more details.
Vertica performance on sequential insert statements is poor. You have to use Vertica native COPY command to load data from the exported csv file, will do about 1 million rows per sec in a small single node Cluster.
Related
I have a sample transformation setup for the purpose of this question:
Table Input step -> Table output step.
When running the transformation and looking at the live stats I see this:
The table output step loads ~11 rows per second which is extremely slow. My commit size in the Table Output step is set to 1000. The SQL input is returning 40k rows and returns in 10 seconds when run by itself without pointing to the table output. The input and output tables are located in the same database.
System Info:
pdi 8.0.0.0
Windows 10
SQL Server 2017
Table output is in general very slow.
If I'm not entirely mistaken, it does an insert for each incoming row, which takes a lot of time.
A much faster approach is using 'bulk load' which streams data from inside Kettle to a named pipe using "LOAD DATA INFILE 'FIFO File' INTO TABLE ....".
You can read more about how the bulk loading is working here: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
Anyways: If you are doing input from a table to another table, in the same database, then I would have created an 'Execute SQL script'-step and do the update with a single query.
If you take a look at this post, you can learn more about updating a table from another table in a single SQL-query:
SQL update from one Table to another based on a ID match
I have one UNIX script
In this we are creating the table, index and loading the date from file to this table using SQL Loader .
And doing near 70 direct update (not using for all or bulk collect) on this table.
At last we are inserting this new table Data to another table. Per day it's processing 500 000 records. all these update are very fast.
During inserting this data into another table taking 20 minutes. How can this be improved?
No problem in insert because on the same table we are inserting 500 000 rectors from another table that's working fine. Insert done in less than a minute.
Insert into tables () select () from tablex;
It's taking 20 min for 500 000 records
Tablex- created , loaded , 70 direct update done in the same shell script .
Checked the explain plan cost for select alone and with insert script both are same.
Insert into tables () select () from tabley;
The above statement executed less than a second.
I used parallel hint. Cost is reduced . And cpu utilisation is zero.
Shall I create one more table tablez then load the data from tablez to my final table?
Is stats gathering is required? This is daily run program.
When we do direct path insert using SQL Loader, the records are inserted above the HighWaterMark. After the load is completed and the HighWaterMark is moved up, there could be lots of empty blocks below the original/old HighWaterMark position. If your SELECT is going for a Full Table Scan, it will be reading all those empty blocks too. Check to see if your table has accumulated lots of empty blocks over the period of time. You may use Segment Advisor for this. Based on advisor recommendations, shrink the table and free unused space. This could speed up the execution. Hope this helps.
I have a huge text file that has 1 million rows and each row there is only 28 length number as text.
I want to import them into sql server that has table corresponding column. So that a million data will be inserted to one column DB table.
I used SSIS, it's kind of slow. (1 million data will be inserted in 4.5 hours or more) Are there any other ways to do that much faster ?
You can use BCP utility for fast import . See official documentation here : DOC
As a result, I decided to sptling up the huge data into parts and run more SSIS at the same time through insert same table. There will be no lock in inserting. I hope 6 SSIS finish this job nearly about an hour.
Thanks.
Dilemma:
I am about to perform population of data on MS SQL Server (2012 Dev Edition). Data is based on production data. Amount is around 4TB (around 250 million items).
Purpose:
To test performance on full text search and on regular index as well. Target number should be around 300 million items around 500K each.
Question:
What should I do before to speed up the process or consequences that I should worry about?
Ex.
Switching off statistics?
Should I do a bulk insert of 1k items per transaction instead of single transaction?
Simple recovery model?
Log truncation?
Important:
I will use sample of 2k of production items to create every random item that will be inserted into database. I will use near unique samples generated in c#. It will be one table:
table
(
long[id],
nvarchar(50)[index],
nvarchar(50)[index],
int[index],
float,
nvarchar(50)[index],
text[full text search index]
)
Almost invariably, in a situation like this, and I've had several of them, I've used SSIS. SSIS is the fastest way I know to import large amounts of data into a SQL Server database. You have complete control over batch (transaction size) and it will perform bulk inserting. In addition, if you have transformation requirements, SSIS will handle this with ease.
I've read many answers here about this topic, but everyone suggests the BCP || SqlBulkCopy class from .net
I have a query which inserts into targetTable the union of 5 selects from different tables.
I have correct indexes on the tables being selected. And only 1 clustered identity index on the targetTable. However this takes a long time (~25 min). I'm talking about 5M rows (x 20 columns).
When I look at sp_who2, most of the time, it is suspended...
I want to use bulk copy but not from .net (the db already fetches the data - so I don't need to go to C#).
Questions
How can I use bulk insert (no bcp) in my select command?
Also, why is it suspended most of time? How can I give my query a higher priority?
Thank you.
p.s. I can't use bcp here because of security restrictions... I don't have permission to run this.
You're right: This is taking longer than usual. You're getting 3k rows per second. You should get 10k or 20k per second easily. In the best case 200k per second per CPU core
I suspect you are inserting all over the table, not just at the end. In this case, 3k rows per second is not unusual.
In any case, bulk copy cannot help you. It does not insert faster than a server-only insert statement.
What you can do, though, is insert using multiple threads. Partition your row source into N distinct ranges and insert each range concurrently from a separate connection. This will help if you are CPU bound. It won't if you are IO bound.