I recently completed a SSIS course.
One of the piece of best practice I came away with, was to ALWAYS use stored procedures in data flow tasks in SSIS.
I guess there is an argument around security, however the tutor said that as the stored procedures performed all of the work "native" on the SQL Server there was/is a significant performance boost.
Is there any truth to this or articles that debate the point?
Thanks
Remember - mostly courses are done by clueless people because people with knowledge earn money doing consulting which pays a LOT better than training. Most trainers live in a glass house that never spends 9 months working on a 21tb data warehouse ;)
This is wrong. Point.
It only makes sense when the SQL Statement does not pull data out of the database - for example merging tables etc.
Otherwise it is a question of how smart you set up the SSIS side. SSIS can write data not using SQL, using bulk copy mechanisms. SSIS is a lot more flexible, and if you pull data from a remote database then the argument of not leaving the database (i.e. processing native) is a stupid point to make. When I copy data from SQL Server A to SQL Server B, a SP on B can not process he data from A native.
In general, it is only faster when you take data FROM A and push it TO A and all the processing can be done in a simple SP - which is a degenerate edge case (i.e. a simplistic one).
The advantage of SSIS is the flexibility of processing data in an environment designed for data flow, which in many cases is needed in the project and doing that in stored procedures would turn nightmare.
Old thread, but a pertinent topic.
For a data source connection, I favor SPs over embedded queries when A) the logic is simple enough to be handled in both ways, and B) the support of the SP is easier than working with the package.
I haven't found much, if any, difference in performance for the data source if the SP returns a fairly straighforward result set.
Our shop has a more involved deploy process for packages, which makes SPs a preferred source.
I have not found very many applications for a SP being a data destination, except maybe an occasional logging SP call.
Related
I have a Sql Server database which is used to store data coming from a lot of different sources (writers).
I need to provide users with some aggregated data, however in Sql Server this data is stored in several different tables and querying it is too slow ( 5 tables join with several million rows in each table, one-to-many ).
I'm currently thinking that the best way is to extract data, transform it and store it in a separate database (let's say MongoDB, since it will be used only for read).
I don't need the data to be live, just not older that 24 hours compared to the 'master' database.
But what's the best way to achieve this? Can you recommend any tools for it (preferably free) or is it better to write your own piece of software and schedule it to run periodically?
I recommend respecting the NIH principle here, reading and transforming data is a well understood exercise. There are several free ETL tools available, with different approaches and focus. Pentaho (ex Kettle) and Talend are UI based examples. There are other ETL frameworks like Rhino ETL that merely hand you a set of tools to write your transformations in code. Which one you prefer depends on your knowledge and, unsurprisingly, preference. If you are not a developer, I suggest using one of the UI based tools. I have used Pentaho ETL in a number of smaller data warehousing scenarios, it can be scheduled by using operating system tools (cron on linux, task scheduler on windows). More complex scenarios can make use of the Pentaho PDI repository server, which allows central storage and scheduling of your jobs and transformations. It has connectors for several database types, including MS SQL Server. I haven't used Talend myself, but I've heard good things about it and it should be on your list too.
The main advantage of sticking with a standard tool is that once your demands grow, you'll already have the tools to deal with them. You may be able to solve your current problem with a small script that executes a complex select and inserts the results into your target database. But experience shows those demands seldom stay the same for long, and once you have to incorporate additional databases or maybe even some information in text files, your scripts become less and less maintainable, until you finally give in and redo your work in a standard toolset designed for the job.
It's obvious that hql query is slower that native. There are project on the mind which will use huge amount of small transactions to perform. So question is what will perform better:
native query in jpa
native query through jdbc
How much difference is? Because of jpa mapping capabilities and prepared statements prefer it. But according to performance requirements it could be that hibernate will not be fast enough...
EDIT
replaced "Huge amount of data to process" to "huge amount of small transactions to perform"
It is a way too little info you have given.
Huge amount of data to process
How huge is that? Like a few 10 millions of records of few 100 Gigabytes?
How do you plan to process that data? If you pull it to the Java side via Hibernate (or native query or jdbc) then probably you are on a wrong track. You shall keep the data in the database, and process it there with the tools what the database offers for that, and lightyears more performant than any client-side processing. Consider database side processing (PL/SQL of Oracle, Transact SQL of MS SQL Server).
What are you planning to optimize? Data insertion? Data retrieval? Are you planning to use select statements which dig trough a lots of data? Then consider an OLAP solution over classic OLTP. OLAP solutions are built for business intelligence and analyzing of huge gigabytes of data by a few clever tricks. Google for it (OLAP, Decision cubes)
Can you use any of the capabilities of the underlying SQL engine? For example, if you are using Oracle, you have 1000 times more features available what you can actually use in Hibernate. For example, you just simply can not make an Oracle Text query in Hibernate, and there are really a lots of things you can not do.
I could sum this up that the performance difference is not between using native SQL or HQL. Instead:
Hibernate is powerful in what it was built for: handing very few records of data, optimisitically caching it locally in the Java side as a groundwork for data processing systems built for databases which are not capable for data processing. (but only for some selects, inserts, updates, deletes)
Once you really have to move huge amount of data, Java side processing is not an option. Programmers in the '80s have invented the stored procedures exactly because of this reason. Pick a database which supports database side processing to shortcut all network roundtrips, and imperative data processing with for-loops in your java code. Prefer instead as much declarative SQL as you can, and run the processing on your database.
Once you are about to start using the features of your database, Hibernate will be pretty much in your way. It is really built as an ORM wrapper - however data processing problems are not CRUD and ORM-able problems all the time. For example, Hibernate is not too useful for an OLAP use case.
Huge amount of small transactions
In the case of having lots of small transactions (inserting/updating data to the database), Hibernate does not have any performance advantage (since no Java side caching can be utilized) disregarding of the database in use. However you may prefer hibernate since it is a nice tool for converting Java objects to SQL statements.
But with Hibernate, you are totally not in control on what is happening. For example, Hibernate + Oracle, inserting new entities to the database: this is the worst performance nightmare you can imagine.
This is what Hibernate does:
select one new id from a sequence
execute one insert
repeat zillion of times
This is very much not performing well. (The sequence reference shall be part of the insert statement, the whole insert statement shall use JDBC batching.)
I found that a JDBC based approach runs about 1000 times quicker than the Hibernate in this particular use case. (Prefetching next 100 sequences, use JDBC Batch mode for Oracle, bind variables to the batch, send down records in batches of 100, and use asynchronous commit (yet again something you can not control in Hibernate).
I found that if I want to squeeze out most of your tools, I need to learn them in depth. Unfortunately you can find lots of opinion-based comments especially in Hibernate vs. non-Hibernate wars on the net. Most of them are written by Java developers who have literally no idea about what happens behind the scenes. So, don't believe - measure :)
The debate over the use of inline code vs stored procedures pretty much always centres around simple CRUD operations and whether to call stored procedures or use inline SQL in a live application. However, it is also common for companies such as banks, hedge funds and insurance companies to do batch processing which is scheduled to occur after hours. These are not simple CRUD operations, we're talking about often transactional, specialised business logic. One example might be the calculation daily compounded interest.
The process needs to be efficient and scalable due to the volume of records to be processed. By processing overnight, the batches can utilise resources that would not be available to it during the day.
It is no surprise to me that this kind of logic is often implemented in the back-end using something like stored procedures in SQL server or its equivalent on other platforms. I would expect such an implementation to always be more efficient than inline code, even if that inline code was implemented as a service running on the database server (without network latency).
The TSQL implementation benefits from compiled query execution plans and does not have to parse data between processes via a connection.
Am I wrong about this? I would like to hear from people with experience in this area. Anyone who believes a back-end implementation means writing inline code via cursors in stored procedures need not comment.
We recently put a new production database into use. The schema of this database is optimized for OLTP. We're also getting ready to implement a reporting server to be used for reporting purposes. I'm not convinced we should just blindly use the same schema for our reporting database as we do for our production database, and replicate data over.
For those of you that have dealt with having separate production and reporting databases, have you chosen to use the same database schema for your reporting database, or a schema that is more efficient for reporting; for example, perhaps something more denormalized?
Thanks for thoughts on this.
There's really two sides to the story:
if you keep the schema identical, then updating the reporting database from the production is a simple copy (or MERGE in SQL Server 2008) command. On the other hand, the reports might get a bit harder to write, and might not perform optimally
if you devise a separate reporting schema, you can optimize it for reporting needs - then the creation of new reports might be easier and faster, and the reports should perform better. BUT: The updating is going to be harder
So it really boils down to: are you going to create a lot of reports? If so: I'd recommend coming up with a specific reporting schema optimized for reports.
Or is the main pain point the upgrade? If you can define and implement that once (e.g. with SQL Server Integration Services), maybe that's not really going to be a big issue after all?
Typically, chances are that you'll be creating a lot of reports of time, so there's a good chance it might be beneficial in the long run to invest a bit upfront in a separate reporting schema, and a data loading process (typically using SSIS) and then reap the benefit of having better performing reports and faster report creation time.
I think that the reporting database schema should be optimized for reporting - so you'll need a ETL Process to load your data. In my experience I was quickly at the point that the production schema does not fit my reporting needs.
If you are starting your reporting project I would suggest that you design your reporting database for your reports needs.
For serious reporting, usually you create data warehouse (Which is typically at least somewhat denormalized and certain types of calculations are done when the data is refreshed to save from averaging the values of 1.3 million records when you run the report. This is for the kind of reporting reporting that includes a lot of aggregate data.
If your reporting needs are not that great a replicated database might work. It may also depend on how up-to-date you need the data to be as data warehouses are typically updated once or twice a day so the reporting data is often one day behind, OK for monthly and quarterly reports not so good to see how many widgits have been ordered so far today.
The determinate of whether you need a data warehouse tends to be how long it would take to runthe reports they need. This is why datawarehouse pre-aggregate data on loading it. IF your reoports are running fine and you just want to get the worokload away from the input workload a replicated adatabase should do the trick. If you are trying to do math on all the records for the last ten years, you need a data warehouse.
You could do this in steps too. Do the replication now, to get reporting away from data input. That should be an immediate improvement (even if not as much as you want), then design and implement the datawarehouse (which can be a fairly long and involved project and which will take some time to get right).
It's easiest just to copy over.
You could add some views to that schema to simplify queries - to conceptually denormalize.
If you want to go the full Data Warehouse/Analysis Services route, it will be quite a bit of work. But it's very fast, takes up less space, and users seem to like it. If you're concerned about large amounts of data and response times, you should look into this.
If you have many many tables being joined, you might look into actually denormalizing the data. I'd do a test case just to see how much gain for pain you'll be getting.
Without going directly for the data warehouse solution you could always put together some views that rearrange data for better reporting access. This helps you in that you don't have to start a large warehouse project right away and could help scope out a warehouse project if you decide to go that way.
All the answers I've read here are good, I would just add that you do this in stages, stopping as soon as your goals for performance and functionality are met:
Keep the schema identical - this just takes contention and load off the OLTP server
Keep the schema identical - but add new indexed views OR index base tables differently
Build a partial data-warehouse style model (perhaps not keeping snapshot-style history or slowly changing dimensions or anything special not catered for in your normal database) from the copy-schema in another schema or database on the same reporting server. The benefits of star-schema models are huge for reporting, views flattened for users and data dictionaries etc. In this model, if your OLTP database loses changes (for instance customer name changes) due to overwrites, the data warehouse doesn't capture that information (often it's not that important if you stop at this spot). Effectively you are getting data warehouse-style organization for "current" data only. The benefits of retaining the copy of the original schema on your reporting server at this point are that you can pull from the source data in original SQL Server form instead of some kind of intermediate form (like text files) without affecting production OLTP, and you can migrate data models gradually, some in stars, some in normal form, all without affecting production. At some point later, you might be able to drop all or part of the copy.
Build a full data-warehouse including slowly changing dimensions where all the data is captured from the source system.
I need solution to pump data from Lotus Notes to SqlServer. Data will be transfered in 2 modes
Archive data transfer
Current data transfer
Availability of data in Sql is not critical, data is used for reports. Reports could be created daily, weekly or monthly.
I am considering to choose from one of those solutions: DESC and SSIS. Could You please give me some tips about prons and cons of both technologies. If You suggest something else it could be also taken into consideration.
DECS - Domino Enterprise Connection Services
SSIS - Sql Sever Integration Services
I've personally used XML frequently to get data out of Lotus Notes in a way that can be read easily by other systems. I'd suggest you take a look and see if that fits your needs. You can create views that emit XML or use NotesAgents or Java Servlets, all of which can be accessed using HTTP.
SSIS is a terrific tool for complex ETL tasks. You can even write C# code if you need to. There are lots of pre-written available data cleaning components already out there for you to download if you want. It can pretty much do anything you need to do. It does however have a fairly steep learning curve. SSIS comes free with SQL Server so that is a plus. A couple of things I really like about SSIS are the ability to log errors and the way it handles configuration so that moving the package from the dev environment to QA and Prod is easy once you have set it up.
We have also set up a meta data database to record a lot of information about our imports such as the start and stop time, when the file was recieved, the number of records processed, types of errors etc. This has really helped us in researching data issues and has helped us write some processes that are automatically stopped when the file exceeds the normal parameters by a set amount. This is handy if you normally recive a file with 2 million records and the file comes in one day with 1000 records. Much better than delting 2,000,000 potential customer records because you got a bad file. We also now have the ability to do reporting on files that were received but not processed or files that were expected but not received. This has tremendously improved our importing porcesses (we have hundreds of imports and exports in our system). If you are designing from sratch, you might want to take some time and think about what meta data you want to have and how it will help you over time.
Now depending on your situation at work, if there is a possibility that data will also be sent to the SQL Server database from sources other than Lotus Notes as well as the imports from Notes that you are developing for, I would suggest it might be worth your time to go ahead and start using SSIS as that is how the other imports are likely to be done. As a database person, I would prefer to have all the imports I support using the same technology.
I can't say anything about DECS as I have never used it.
Just a thought - but as Lotus Notes tends to behave a bit "different" than relational databases (or anything else), you might be safer going with a tool which comes out of the Notes world, versus a tool from the sql world.
(I have used DECS in the past (prior to Domino 8) and it has worked fine for pumping data out into a SQL Server database. I have not used SSIS).