How to Handling Incremental Load with large datasets ssis

How to Handling Incremental Load with large datasets ssis - database

I have 2 tables (~ 4 million rows) that I have to do insert/update actions on matching and unmatching records. I am pretty confused about the method I have to use for incremental load. Should I use Lookup component or new sql server merge statement? and will there be too much performance differences?

I've run into this exact problem a few times and I've always had to resort to loading the complete dataset into SQLserver via ETL, and then manipulating with stored procs. It always seemed to take way, way too long updating the data on the fly in SSIS transforms.

The SSIS Lookup has three caching modes which are key to getting the best performance from it. If you are looking up against a large table, FULL Cache mode will eat up a lot of your memory and could hinder performance. If your lookup destination is small, keep it in memory. You've also got to decide if the data you are looking up against is changing as you process data. If it is, then you don't want to cache.
Can you give us some more info on what you are oding so I can formulate a more precise answer.

Premature optimization is the root of all evil, I don't know about ssis, but it's always to early to think about this.
4 million rows could be "large" or "small", depending on the type of data, and the hardware configuration you're using.

Related

How to improve ESRI/ArcGIS database performance while maintaining normalization?

I work with databases containing spatial data. Most of these databases are in a proprietary format created by ESRI for use with their ArcGIS software. We store our data in a normalized data model within these geodatabases.
We have found that the performance of this database is quite slow when dealing with relationships (i.e. relating several thousand records to several thousand records can take several minutes).
Is there any way to improve performance without completely flattening/denormalizing the database or is this strictly limited by the database platform we are using?

There is only one way: measure. Try to obtain a query plan, and try to read it. Try to isolate a query from the logfile, edit it to an executable (non-parameterised) form, and submit it manually (in psql). Try to tune it, and see where it hurts.
Geometry joins can be costly in term of CPU, if many (big) polygons have to be joined, and if their bounding boxes have a big chance to overlap. In the extreme case, you'll have to do a preselection on other criteria (eg zipcode, if available) or maintain cache tables of matching records.
EDIT:
BTW: do you have statistics and autovacuum? IIRC, ESRI is still tied to postgres-8.3-something, where these were not run by default.
UPDATE 2014-12-11
ESRI does not interfere with non-gis stuff. It is perfectly Ok to add PK/FK relations or additional indexes to your schema. The DBMS will pick them up if appropiate. And ESRI will ignore them. (ESRI only uses its own meta-catalogs, ignoring the system catalogs)

When I had to deal with spatial data, I tended to precalulate the values and store them. Yes that makes for a big table but it is much faster to query when you only do the complex calculation once on data entry. Data entry does take longer though. I was in a situation where all my spacial data came from a monthly load, so precalculating wasn't too bad.

Optimize database for web usage (lots more reading than writing)

I am trying to layout the tables for use in new public-facing website. Seeing how there will lots more reading than writing data (guessing >85% reading) I would like to optimize the database for reading.
Whenever we list members we are planning on showing summary information about the members. Something akin to the reputation points and badges that stackoverflow uses. Instead of doing a subquery to find the information each time we do a search, I wanted to have a "calculated" field in the member table.
Whenever an action is initiated that would affect this field, say the member gets more points, we simply update this field by running a query to calculate the new values.
Obviously, there would be the need to keep this field up to date, but even if the field gets out of sync, we can always rerun the query to update this field.
My question: Is this an appropriate approach to optimizing the database? Or are the subqueries fast enough where the performance would not suffer.

There are two parts:
Caching
Tuned Query
Indexed Views (AKA Materialized views)
Tuned table
The best solution requires querying the database as little as possible, which would require caching. But you still need a query to fill that cache, and the cache needs to be refreshed when it is stale...
Indexed views are the next consideration. Because they are indexed, querying against is faster than an ordinary view (which is equivalent to a subquery). Nonclustered indexes can be applied to indexed views as well. The problem is that indexed views (materialized views in general) are very constrained to what they support - they can't have non-deterministic functions (IE: GETDATE()), extremely limited aggregate support, etc.
If what you need can't be handled by an indexed view, a table where the data is dumped & refreshed via a SQL Server Job is the next alternative. Like the indexed view, indexes would be applied to make fetching data faster. But data change means cleaning up the indexes to ensure the query is running as best it can, and this maintenance can take time.

The least expensive database query is the one that you don't have to run against the database at all.
In the scenario you describe, using a high-performance cache technology (example: memcached) to store query results in your application can be a lot better strategy than trying to trick out the database to be highly scalable.

The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
Michael A. Jackson
If you are just designing the tables, I'd say, it's definitely premature to optimize.
You might want to redesign your database a few days later, you might find out that things work pretty fast without any clever hacks, you might find out they work slow, but in a different way than you expected. In either case you would waste your time, if you start optimizing now.
The approach you describe is generally fine; you could get some pre-computed values, either using triggers/SPs to preserve data consistency, or running a job to update these values time-to-time.

All databases are more than 85% read only! Usually high nineties too.
Tune it when you need to and not before.

30 million records a day, SQL Server can't keep up, other kind of database system needed?

Some time ago I thought an new statistics system over, for our multi-million user website, to log and report user-actions for our customers.
The database-design is quite simple, containing one table, with a foreignId (200,000 different id's), a datetime field, an actionId (30 different id's), and two more fields containing some meta-information (just smallints). There are no constraints to other tables. Furthermore we have two indexes each containing 4 fields, which cannot be dropped, as users are getting timeouts when we are having smaller indexes. The foreignId is the most important field, as each and every query contains this field.
We chose to use SQL server, but after implementation doesn't a relational database seem like a perfect fit, as we cannot insert 30 million records a day (it's insert only, we don't do any updates) when also doing alot of random reads on the database; because the indexes cannot be updated fast enough. Ergo: we have a massive problem :-) We have temporarily solved the problem, yet
a relational database doesn't seem to be suited for this problem!
Would a database like BigTable be a better choice, and why? Or are there other, better choices when dealing with this kind of problems?
NB. At this point we use a single 8-core Xeon system with 4 GB memory and Win 2003 32-bit. RAID10 SCSI as far as I know. The index size is about 1.5x the table size.

You say that your system is capable of inserting 3000 records per second without indexes, but only about 100 with two additional non-clustered indexes. If 3k/s is the maximum throughput your I/O permits, adding two indexes should in theory reduces the throughput at about 1000-1500/sec. Instead you see a degradation 10 times worse. The proper solution and answer is 'It Dependts' and some serious troubleshooting and bottleneck identification would have to be carried out. With that in mind, if I was to venture a guess, I'd give two possible culprits:
A. Th additional non-clustered indexes distribute the writes of dirty pages into more allocation areas. The solution would be to place the the clustered index and each non-clustered index into its own filegroup and place the three filegroups each onto separate LUNs on the RAID.
B. The low selectivity of the non-clustered indexes creates high contention between reads and writes (key conflicts as well as %lockres% conflicts) resulting in long lock wait times for both inserts and selects. Possible solutions would be using SNAPSHOTs with read committed snapshot mode, but I must warn about the danger of adding lot of IO in the version store (ie. in tempdb) on system that may already be under high IO stress. A second solution is using database snapshots for reporting, they cause lower IO stress and they can be better controlled (no tempdb version store involved), but the reporting is no longer on real-time data.
I tend to believe B) as the likely cause, but I must again stress the need to proper investigation and proper root case analysis.
'RAID10' is not a very precise description.
How many spindles in the RAID 0 part? Are they short-striped?
How many LUNs?
Where is the database log located?
Where is the database located?
How many partitions?
Where is tempdb located?
As on the question whether relational databases are appropriate for something like this, yes, absolutely. There are many more factors to consider, recoverability, availability, toolset ecosystem, know-how expertise, ease of development, ease of deployment, ease of management and so on and so forth. Relational databases can easily handle your workload, they just need the proper tuning. 30 million inserts a day, 350 per second, is small change for a database server. But a 32bit 4GB RAM system hardly a database server, regardless the number of CPUs.

It sounds like you may be suffering from two particular problems. The first issue that you are hitting is that your indexes require rebuilding everytime you perform an insert - are you really trying to run live reports of a transactional server (this is usually considered a no-no)? Secondly, you may also be hitting issues with the server having to resize the database - check to ensure that you have allocated enough space and aren't relying on the database to do this for you.
Have you considered looking into something like indexed views in SQL Server? They are a good way to remove the indexing from the main table, and move it into a materialised view.

You could try making the table a partitioned one. This way the index updates will affect smaller sets of rows. Probably daily partitioning will be sufficient. If not, try partitioning by the hour!

You aren't providing enough information; I'm not certain why you say that a relational database seems like a bad fit, other than the fact that you're experiencing performance problems now. What sort of machine is the RDBMS running on? Given that you have foreign ID's, it seems that a relational database is exactly what's called for here. SQL Server should be able to handle 30 million inserts per day, assuming that it's running on sufficient hardware.

Replicating the database for reporting seems like the best route, given heavy traffic. However, a couple of things to try first...
Go with a single index, not two indexes. A clustered index is probably going to be a better choice than non-clustered. Fewer, wider indexes will generally perform better than more, narrower, indexes. And, as you say, it's the indexing that's killing your app.
You don't say what you're using for IDs, but if you're using GUIDs, you might want to change your keys over to bigints. Because GUIDs are random, they put a heavy burden on indexes, both in building indexes and in using them. Using a bigint identity column will keep the index running pretty much chronological, and if you're really interested in real-time access for queries on your recent data, your access pattern is much better suited for monotonically increasing keys.

Sybase IQ seems pretty good for the goal as our architects/DBAs indicated (as in, they explicitly move all our stats onto IQ stating that capability as the reason). I can not substantiate myself though - merely nod at the people in our company who generally know what they are talking about from past experience.
However, I'm wondering whether you MUST store all 30mm records? Would it not be better to store some pre-aggregated data?

Not sure about SQL server but in another database system I have used long ago, the ideal method for this type activity was to store the updates and then as a batch turn off the indexes, add the new records and then reindex. We did this once per night. I'm not sure if your reporting needs would be a fit for this type solution or even if it can be done in MS SQL, but I'd think it could.

You don't say how the inserts are managed. Are they batched or is each statistic written separately? Because inserting one thousand rows in a single operation would probably be way more efficient than inserting a single row in one thousand separate operations. You could still insert frequently enough to offer more-or-less real time reporting ;)

ensuring database performance as data volume increases

I am currently into a performance tuning exercise. The application is DB intensive with very little processing logic. The performance tuning is around the way DB calls are made and the DB itself.
We did the query tuning, We put the missing indexes, We reduced or eliminated DB calls where possible. The application is performing very well and all is fine.
With smaller data volume (say upto 100,000 records), the performance is fantastic. My Question is, what needs to be done to ensure such good performance at higher data volumes ?
The data volumes are expected to reach 10 million records.
I can think of table and index partitioning, suggesting filesystems optimized for DB storage and periodic archiving to keep the number of rows in check. I would like to know what else could be done. Any tips/strategies/patterns would be very helpful.

Monitoring. Use some tools to monitor performance, and saturation of CPU, memory, and I/O. Make trend lines so you know where your next bottleneck will be before you get there.
Testing. Create mock data so you have 10 million rows on a testing server today. Benchmark the queries you have in your application and see how well they perform as the volume of data increases. You might be surprised at what breaks down first, or it may go exactly as predicted. The point is that you can find out.
Maintenance. Make sure your application and infrastructure support some downtime, because that's always necessary. You might have to defrag and rebuild your indexes. You might have to refactor some of the table structure. You might have to upgrade the server software or apply patches. To do this without interrupting continuous operation, you'll need some redundancy built in to the design.
Research. Find the best journals and blogs for the database brand you're using, and read them (e.g. http://www.mysqlperformanceblog.com if you use MySQL). You can ask good questions like the one you ask here, but also read what other people are asking, and what they're being advised to do about it. You can learn solutions to problems that you don't even have yet, so that when you have them, you'll have some strategies to employ.

Different databases need to be tuned in different ways. What RDBMS are you using?
Also, how do you know whether or not what you have done so far will result in poor performance with larger data sets? Have you tested your current optimisations with a large amount of test data?
When you did this, how did the performance change? If you are able to tune the database so that it performs with the data it has now, there's no reason to think that your methods won't work with a larger data set.
Depending on the RDBMS, the next type of solution is simple: get bigger, beefier hardware. More RAM, more disks, more CPUs.

You are on the right way:
1) Proper indexes
2) DBMS options tuning (memory caches, buffers, internal threads control and so on)
3) Queries tuning (especially log slow queries and then tune/rewrite them)
4) To tune your queries and indexes you may need to research your queries execution plans
5) Poweful dedicated server
6) Think about queries which your client applications send to the database. Are they always necessary? Do you need all the data you ask for? Is it possible to cache some data?

10 million records is probably too small to bother with partitioning. Typically partitioning will only be interesting if your data volumes are an order or magnitude or so bigger than that.
Index tuning for a database with 100,000 rows will probably get you 99% of what you need with 10 million rows. Keep an eye out for table scans or index range scans on the large tables in the system. On smaller tables they are fine and in some cases even optimal.
Archiving old data may help but this is probably overkill for 10 million rows.
One possible optimisation is to move reporting off onto a separate server. This will reduce the burden on the server - reports are often quite anti-social when run on operational systems as the schema tends not to be well optimised for it.
You can use database replication to do this or make a data mart for reporting. Replication is easier to implement but the reports will be less efficient, no more efficient than they were on the production system. Building a star schema data mart will be more efficient for reporting but incur additional development work.

Are stored procedures required for large data sets?

I've just started my first development job for a reasonably sized company that has to manage a lot of data. An average database is 6gb (from what I've seen so far). One of the jobs is reporting. How it's done currently is -
Data is replicated and transferred onto a data warehouse. From there, all the data required for a particular report is gathered (thousands of rows and lots of tables) and aggregated to a reports database in the warehouse. This is all done with stored procedures.
When a report is requested, a stored procedure is invoked which copies the data onto a reports database which PHP reads from to display the data.
I'm not a big fan of stored procs at all. But the people I've spoken to insist that stored procedures are the only option, as queries directly against the data via a programming language are incredibly slow (think 30 mins?). Security is also a concern.
So my question is - are stored procedures required when you have a very large data set? Do queries really take that long on such a large amount of data or is there a problem with either the DB servers or how the data is arranged (and indexed?). I've got a feeling that something is wrong.

The reasoning behind using a stored procedure is that the execution plan that is created in order to execute your procedure is cached by SQL Server in an area of memory known as the Plan Cache. When the procedure is then subsequently re-run at a later time, the execution plan has the possibility of being re-used.
A stored procedure will not run any faster than the same query, executed as a batch of T-SQL. It is the execution plans re-use that result in a performance improvement. The query cost will be the same for the actual T-SQL.
Offloading data to a reporting database is a typical pursuit however you may need to review your indexing strategy on the reporting database as it will likely need to be quite different from that of your OLTP platform for example.
You may also wish to consider using SQL Server Analysis Services in order to service your reporting requirements as it sounds like your reports contain lots of data aggregations. Storing and processing data for the purpose of fast counts and analytics is exactly what SSAS is all about. It sounds like it is time for your business to look as building a data warehouse.
I hope this helps but please feel free to request further details.
Cheers, John

In the context in which you are operating - large corporate database accessed in several places - it is virtually always best to place as much business logic inside the database as is possible.
In this case your immediate performance benefits are :
Firstly because if the the SP involves any processing beyond a simple select the processing of the data within the database can be orders of magnitude faster than sending rows across the network to your program for handling there.
You do acquire some benefits in that the SP is stored compiled. This is usually marginal compared to 1. if processing large volumes
However, and in my mind often more important than performance, is the fact that with corporate databases encapsulating the logic inside the database itself provides major management and maintenance benefits:-
Data structures can be abstracted away from program logic, allowing database structures to change without requiring changes to programs accessing the data. Anyone who has spent hours grep'ing a corporate codebase for SQL using [mytable] before making a simple database change will appreciate this.
SPs can provide a security layer, although this can be overused and overrelied on.
You say this is your first job for a company with a database of this type, so you can be forgiven for not appreciating how a database-centric approach to handling the data is really essential in such environments. You are not alone either - in a recent podcast Jeff Attwood said he wasn't a fan of putting code into databases. This is a fine and valid opinion where you are dealing with a database serving a single application, but is 100% wrong with a database used across a company by several applications, where the best policy is to screw down the data with a full complement of constraints and use SPs liberally for access and update.
The reason for this is if you don't such databases always lose data integrity and accumulate crud. Sometimes it's virtually impossible to imagine how they do, but in any large corporate database (tens of millions of records) without sufficient constraints there will be badly formed records - at best these force a periodic clean-up of data (a task I regularly used to get dumped with as a junior programmer), or worse will cause applications to crash due to invalid inputs, or even worse not cause them to crash but deliver incorrect business information to the end-users. And if your end user is your finance director then that's your job on the line :-)

It seems to me that there is an additional step in there that, based on your description, appears unneccessary. Here is what I am referring to -
When a report is requested, a stored
procedure is invoked which gathers the
data into a format required for a
report, and forwarded to another
stored procedure which transforms the
data into a view, and forwards THAT
off to a PHP framework for display.
A sproc transforms the data for a report, then another sproc transforms this data into another format for front-end presentation - is the data ever used in the format in which it is in after the first sproc? If not, that stage seems unneccessary to me.
I'm assuming that your reports database is a data warehouse and that data is ETL'ed and stored within in a format for the purposes of reporting. Where I currently work, this is common practice.
As for your question regarding stored procedures, they allow you to centralize logic within the database and "encapsulate" security, the first of which would appear to be of benefit within your organisation, given the other sprocs that you have for data transformation. Stored procedures also have a stored execution plan which, under some circumstances, can provide some improvement to performance.

I found that stored procedures help with large data sets because they eliminate a ton of network traffic, which can be a huge performance bottleneck depending on how large the data set actually is.

When processing large numbers of rows, where indexes are available and the SQL is relatively tuned, the database engine performing set-based operations directly on the data - through SQL, say - will almost always outperform row-by-row processing (even on the same server) in a client tool. The data is not crossing any physical or logical boudaries to leave the database server processes or to leave the database server and go out across the network. Even performing RBAR (row by agonizing row) on the server will be faster than performing it in a client tool, if only a limited amount of data really needs to ever leave the server, because...
When you start to pull more data across networks, then the process will slow down and limiting the number of rows at each stage becomes the next optimization.
All of this really has nothing to do with stored procedures. Stored procedures (in SQL Server) no longer provide much performance advantages over batch SQL. Stored procedures do provide a large number of other benefits like modularization, encapsulation, security management, design by contract, version management. Performance, however is no longer an advantage.

Generally speaking stored procedures have a number of advantages over direct queries. I can't comment on your complete end to end process, however, SPs will probably perform faster. For a start a direct query needs to be compiled and an execution plan worked out every time you do a direct query - SPs don't.
There are other reasons, why you would want to use stored procedure - centralisation of logic, security etc.

The end to end process does look a little complicated but there may be good reasons for it simply due to the data volume - it might well be that if you run the reports on the main database, the queries are slowing down the rest of the system so much that you'll cause problems for the rest of the users.
Regarding the stored procedures, their main advantage in a scenario like this is that they are pre-compiled and the database has already worked out what it considers to be the optimal query plan. Especially with the data volumes you are talking about, this might well result in a very noticeable performance improvement.
And yes, depending on the complexity of the report, a query like this can take half an hour or longer...

This reporting solution seems to have been designed by people that think the database is the centre of the world. This is a common and valid view – however I don’t always hold to it.
When moving data between tables/databases, it can be a lot quicker to use stored procs, as the data does not need to travel between the database and the application. However in most cases, I would rather not use stored proc as they make development more complex, I am in the ORM camp myself. You can sometimes get great speedups by loading lots into RAM and processing it there, however that is a totally different way of coding and will not allow the reuse of the logic that is already in the stored procs. Sorry I think you are stack with stored proc while in that job.
Giving the amount of data being moved about, if using SQL server I would look at using SSIS or DTS – oracle will have something along the same line. SSIS will do the data transformations on many threads while taking care of a lot of the details for you.
Remember the design of software has more to do with the history of the software and the people working it in, than it has to do with the “right way of doing it”. Come back in 100 years and we may know how to write software, at present it is mostly a case of the blind leading the blind. Just like when the first bridges were build and a lot of them fell down, no one could tell you in advance witch bridge would keep standing and why.

Unlike autogenerated code from an ORM product, stored procs can be performance tuned. This is critical in large production environment. There are many ways to tweak performance that are not available when using an ORM. Also there are many many tasks performed by a large database which have nothing to do with the user interface and thus should not be run from code produced from there.
Stored procs are also required if you want to control rights so that the users can only do the procedures specified in the proc and nothing else. Otherwise, users can much more easily make unauthorized changes to the databases and commit fraud. This is one reason why database people who work with large business critical systems, do not allow any access except through stored procs.
If you are moving large amounts of data to other servers though, I would consider using DTS (if using SQL Server 2000) or SSIS. This may speed up your processes still further, but it will depend greatly on what you are doing and how.
The fact that sps may be faster in this case doesn't preclude that indexing may be wrong or statistics out of date, but generally dbas who manage large sets of data tend to be pretty on top of this stuff.
It is true the process you describe seems a bit convoluted, but without seeing the structure of what is happening and understanding the database and environment, I can't say if maybe this is the best process.
I can tell you that new employees who come in and want to change working stuff to fit their own personal predjudices tend to be taken less than seriously and then you will have little credibility when you do need to suggest a valid change. This is particularly true when your past experience is not with databases of the same size or type of processing. If you were an expert in large systems, you might be taken more seriously from the start, but, face it, you are not and thus your opinion is not likely to sway anybody until you have been there awhile and they have a measure of your real capabilities. Plus if you learn the system as it is and work with it as it is, you will be in a better position in six months or so to suggest improvements rather than changes.

I could perhaps come up with more, but a few points.
Assuming a modern DB, stored procedures probably won't actually be noticeably faster than normal procedures due to caching and the like.
The security benefits of Stored procedures are somewhat overrated.
Change is evil. Consistency is king.
I'd say #3 trumps all other concerns unless stored procedures are causing a legitimate problem.

The faster way for reporting is to just read all data into memory (64 bit OS required) and just walk the objects. This is of course limited to ram size (affordable 32 GB) and reports where you hit a large part of the db. No need to make the effort for small reports.
In the old days I could run a report querying over 8 million objects in 1.5 seconds. That was in about a gigabyte of ram on a 3GHz pentium 4. 64 bit should be about twice as slow, but that is compensated by faster processors.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight