I am trying to update around 0.2 million records with a constant value for a field. However using data loader it takes too much time & i get cpu time limit error. Is there any other way to achieve update these records?
You'll need to set your Data Loader batch size lower. It's common for CPU timeouts to result in orgs that use complex or recursive declarative customization, or inefficient Apex triggers, when the default batch size of 200 is in use.
Experiment with reducing the batch size in Data Loader settings until you find the point at which CPU timeouts stop occurring. I'd suggest halving it to 100, then 50, and so on. Presuming that you're able to update these records in the UI, there will be some batch size (potentially 1) that will work in Data Loader.
Related
I need to insert only even records data into the database through batch processing.
an example I have 100 records of data in that data only insert even records.
The use case you are trying to implement is badly aligned with the Batch features. It is one of the characteristics of batch processing that you have limited access to records:
Batch streaming and access to items: The biggest drawback to using
batch streaming is that you have limited access to the items in the
output. In other words, with a fixed-size commit, you get an
unmodifiable list, thus allowing you to access and iteratively process
its items; with streaming commit, you get a one-read, forward-only
iterator.
You could try setting a commit size of 2, then use only the second element of each aggregator. Note that performance will probably be bad because of the low commit size.
If you don't need a Batch component. You could settle for a For Each component and then have an Expression component with flowVars.counter % 2 == 0.
I am parsing thousands of csv files from my application and for each parsed row I am making an insert into Cassandra. It seems that after letting it run it stops at 2048 inserts and throws the BusyConnection error.
Whats the best way for me to make about 1 million inserts?
Should i export the inserts as strings into a file, then run that file directly from CQL to make these massive inserts so I dont actually do it over the network?
We solve such issues using script(s).
The script go through input data and...
At each time it takes a specific amount of data from input.
Wait for specific amount of time.
Continues in reading and inserting of data.
ad 1. For our configuration and data (max 10 columns with mostly numbers and short texts) we found from 500 to 1000 rows are optimal.
ad 2. We define wait time as n * t. Where n is number of rows processed in single run of script. And t is time constant in millisecond. Value of t strongly depends on your configuration; however, for us t = 70 ms is enough to make the process smooth.
1 million requests - it's not so big number really, you can load it from cqlsh using the COPY FROM command. But you can load this data via your Java code as well.
From the error message it looks like that you're using asynchronous API. You can use it for high-performance inserts, but you need to control how many requests are processed at the same time (so-called, in-flight requests).
There are several aspects here:
Starting with version 3 of the protocol, you may have up to 32k in-flight requests per connection instead of 1024 that is used by default. You can configure it when creating Cluster object.
You need to control how many requests are in-flight, by wrapping session.executeAsync with some counter, for example, like in this example (not the best because it limits on the total requests per session, not on the connections to individual hosts - this will require much more logic, especially around token-aware requests).
I'm doing a rather large import to a SQL Database, 10^8+ items and I am doing this with a bulk insert. I'm curious to know if the speed at which the bulk insert runs can be improved by importing multiple rows of data as a single row and splitting them once imported?
If the time to import data is defined by the sheer volume of data itself (ie. 10GB), then I'd expect that importing 10^6 rows vs 10^2 with the data consolidated would take about the same amount of time.
If the time to import however is limited more by row operations and logging each line and not by the data itself then I'd expect that consolidating data would have a performance benefit. I'm not sure however how this would carry over if one had tot then break up the data in DB later on.
Does anyone have experience with this and can shed some light on what specifically can be done to reduce bulk insert time without simply adding that time later to split the data in DB?
Given a 10GB import, is it better to import data on separate rows or consolidate and separate the rows in the DB?
[EDIT] I'm testing this on a Quad 2.5GH with 8GB or RAM and 300MB/sec of read/writes to disk (stripped array). The files are hosted n the same array and the average row size varies with some rows containing large amounts of data (> 100 KB) and many under 100 B.
I've chunked my data into 100 MB files and it takes about 40 seconds to import the file. Each file has 10^6 rows in it.
Where is the data that you are importing? If it is on another server, then the Network might be the bottleneck. This then depends on number of NIC'S and frame sizes.
If it is on the same server, things to play with are batch size and recovery model which effect the log file. In full recovery model, everything is written to a log file. Bulk copy recovery model is a little less overhead in the log.
Since this is staging data, maybe a full backup before the process, change the model to simple, then import might reduce the time. Of course, change the model back to full and do another backup.
As for importing non-normalized data, multiple rows at a time, I usually stay away from the extra coding.
Most of the time, I use SSIS packages. More packages, threads, means a fuller NIC pipe. I usually have at least a 4 GB back bone that is seldom full.
Other things that come to play are your disks. Do you have multiple files (path ways) to the RAID 5 array? If not, you might want to think about it.
In short, it really depends on your environment.
Use a DMAIC process.
1 - Define what you want to do
2 - Measure the current implementation
3 - Analyze ways to improve.
4 - Implement the change.
5 - Control the environment by remeasuring.
Did the change go in the positive direction?
If not, rollback the change and try another one.
Repeat the process until the desired result (timing) is achieve.
Good luck, J
If this is a one time thing and done in an offline change window.. you may want to consider to put the database in simple recovery model prior to inserting the data.
Keep in mind though this would break the log chain....
I'm a nodejs newbie and was wondering which way was better to insert huge number of rows into a DB. On the surface, it looks like inserting stuff one-at-a-time looks more like the way to go because I can free the event loop quickly and serve other requests. But, the code looks hard to understand that way. For bulk inserts, I'd have to prepare the data beforehand which would mean using loops for sure. This would cause less requests to be served during that period as the event loop is busy with the loop.
So, what's the preferred way ? Is my analysis correct ?
There's no right answer here. It depends on the details: why are you inserting a huge number of rows? How often? Is this just a one-time bootstrap or does your app do this every 10 seconds? It also matters what compute/IO resources are available. Is your app the only thing using the database or is blasting it with requests going to be a denial of service for other users?
Without the details, my rule of thumb would be bulk insert with a small concurrency limit, like fire off up to 10 inserts, and then wait until one of them finishes before sending another insert command to the database. This follows the model of async.eachLimit. This is how browsers handle concurrent requests to a given web site, and it has proven to be a reasonable default policy.
In general, loops on in-memory objects should be fast, very fast.
I know you're worried about blocking the CPU, but you should be considering the total amount of work to be done. Sending items one at time carries a lot of overhead. Each query to the DB has its own sequence of inner for loops that probably make your "batching" for loop look pretty small.
If you need to dump 1000 things in the DB, the minimum amount of work you can do is to run this all at once. If you make it 10 batches of 100 "things", you have to do all of the same work + you have to generate and track all of these requests.
So how often are you doing these bulk inserts? If this is a regular occurrence, you probably want to minimize the total amount of work and bulk insert everything at once.
The trade-off here is logging and retries. It's usually not enough to just perform some type of bulk insert and forget about it. The bulk insert is eventually going to fail (fully or partially) and you will need some type of logic for retries or consolidation.
If that's a concern, you probably want to manage the size of the bulk insert so that you can retry blocks intelligently.
I've got approximately 1000 rows (4 fields) in a MySQL database. I know for a fact that the data in the database is not going to change very often (they are GPS coordinates). Is it better for me to call this information from the database every time the appropriate script is loaded, or would it be better for me to "hard code" the data into the script, and when I do make a change to the database, simply update the hard coded data too?
I'm wondering if this improves performance, but part of me thinks this may not be best practice.
Thanks
Hard coding coordinates into a script is not a good idea.
I would read the 1000 coordinates at start into an array, either from SQL DB or from a File.
But do that reading only once at start up, and not at each caluclation step.
Given the fact that changes might occur once or twice per month, and the fact that 0.0063 seconds isn't very much (at least not from my point of view, if it would be a matter of life or death or very important Wall Street stock data that would be another matter), my recommendation is that you use the SQL. Of course, as long as you perform the query only once per script execution.
Indeed, it could improve performance with some milliseconds if you hard-code the data into your script. But ask yourself the question: How much extra work is needed to maintain the hard-corded data? If you really want to be sure, then make a version of the script where you hard-code the data and execute the script 1000 times and measure the time difference. (However, just making this test would probably take more time than it would save...)
If your script is run 5000 times per day and each time the SQL takes an extra 0.01 seconds compared to having hard-coded values, that's a sum of 50 seconds per day in total for your users. However, for each user they will most likely not notice any difference.