Native query performance in Hibernate - database

It's obvious that hql query is slower that native. There are project on the mind which will use huge amount of small transactions to perform. So question is what will perform better:
native query in jpa
native query through jdbc
How much difference is? Because of jpa mapping capabilities and prepared statements prefer it. But according to performance requirements it could be that hibernate will not be fast enough...
EDIT
replaced "Huge amount of data to process" to "huge amount of small transactions to perform"

It is a way too little info you have given.
Huge amount of data to process
How huge is that? Like a few 10 millions of records of few 100 Gigabytes?
How do you plan to process that data? If you pull it to the Java side via Hibernate (or native query or jdbc) then probably you are on a wrong track. You shall keep the data in the database, and process it there with the tools what the database offers for that, and lightyears more performant than any client-side processing. Consider database side processing (PL/SQL of Oracle, Transact SQL of MS SQL Server).
What are you planning to optimize? Data insertion? Data retrieval? Are you planning to use select statements which dig trough a lots of data? Then consider an OLAP solution over classic OLTP. OLAP solutions are built for business intelligence and analyzing of huge gigabytes of data by a few clever tricks. Google for it (OLAP, Decision cubes)
Can you use any of the capabilities of the underlying SQL engine? For example, if you are using Oracle, you have 1000 times more features available what you can actually use in Hibernate. For example, you just simply can not make an Oracle Text query in Hibernate, and there are really a lots of things you can not do.
I could sum this up that the performance difference is not between using native SQL or HQL. Instead:
Hibernate is powerful in what it was built for: handing very few records of data, optimisitically caching it locally in the Java side as a groundwork for data processing systems built for databases which are not capable for data processing. (but only for some selects, inserts, updates, deletes)
Once you really have to move huge amount of data, Java side processing is not an option. Programmers in the '80s have invented the stored procedures exactly because of this reason. Pick a database which supports database side processing to shortcut all network roundtrips, and imperative data processing with for-loops in your java code. Prefer instead as much declarative SQL as you can, and run the processing on your database.
Once you are about to start using the features of your database, Hibernate will be pretty much in your way. It is really built as an ORM wrapper - however data processing problems are not CRUD and ORM-able problems all the time. For example, Hibernate is not too useful for an OLAP use case.
Huge amount of small transactions
In the case of having lots of small transactions (inserting/updating data to the database), Hibernate does not have any performance advantage (since no Java side caching can be utilized) disregarding of the database in use. However you may prefer hibernate since it is a nice tool for converting Java objects to SQL statements.
But with Hibernate, you are totally not in control on what is happening. For example, Hibernate + Oracle, inserting new entities to the database: this is the worst performance nightmare you can imagine.
This is what Hibernate does:
select one new id from a sequence
execute one insert
repeat zillion of times
This is very much not performing well. (The sequence reference shall be part of the insert statement, the whole insert statement shall use JDBC batching.)
I found that a JDBC based approach runs about 1000 times quicker than the Hibernate in this particular use case. (Prefetching next 100 sequences, use JDBC Batch mode for Oracle, bind variables to the batch, send down records in batches of 100, and use asynchronous commit (yet again something you can not control in Hibernate).
I found that if I want to squeeze out most of your tools, I need to learn them in depth. Unfortunately you can find lots of opinion-based comments especially in Hibernate vs. non-Hibernate wars on the net. Most of them are written by Java developers who have literally no idea about what happens behind the scenes. So, don't believe - measure :)

Related

Extract & transform data from Sql Server to MongoDB periodically

I have a Sql Server database which is used to store data coming from a lot of different sources (writers).
I need to provide users with some aggregated data, however in Sql Server this data is stored in several different tables and querying it is too slow ( 5 tables join with several million rows in each table, one-to-many ).
I'm currently thinking that the best way is to extract data, transform it and store it in a separate database (let's say MongoDB, since it will be used only for read).
I don't need the data to be live, just not older that 24 hours compared to the 'master' database.
But what's the best way to achieve this? Can you recommend any tools for it (preferably free) or is it better to write your own piece of software and schedule it to run periodically?
I recommend respecting the NIH principle here, reading and transforming data is a well understood exercise. There are several free ETL tools available, with different approaches and focus. Pentaho (ex Kettle) and Talend are UI based examples. There are other ETL frameworks like Rhino ETL that merely hand you a set of tools to write your transformations in code. Which one you prefer depends on your knowledge and, unsurprisingly, preference. If you are not a developer, I suggest using one of the UI based tools. I have used Pentaho ETL in a number of smaller data warehousing scenarios, it can be scheduled by using operating system tools (cron on linux, task scheduler on windows). More complex scenarios can make use of the Pentaho PDI repository server, which allows central storage and scheduling of your jobs and transformations. It has connectors for several database types, including MS SQL Server. I haven't used Talend myself, but I've heard good things about it and it should be on your list too.
The main advantage of sticking with a standard tool is that once your demands grow, you'll already have the tools to deal with them. You may be able to solve your current problem with a small script that executes a complex select and inserts the results into your target database. But experience shows those demands seldom stay the same for long, and once you have to incorporate additional databases or maybe even some information in text files, your scripts become less and less maintainable, until you finally give in and redo your work in a standard toolset designed for the job.

Architecting a high performing "inserting solution"

I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.

Hive vs SQL Server performance

1) I started using hive from last 2 months. I have a same task as that in SQL. I found that Hive is slow and takes more time to execute queries while SQL executes it in very few minutes/seconds.
After executing the task in Hive when I cross check the result in both (SQL and Hive), I found some difference in results (Not all but in some tables).
e.g. : I have one table which has 2012 records, when I executed a task in Hive in the same table in Hive I got 2007 records.
Why it is happening?
2) If I think to speed up my execution in Hive then what should I do for it?
(Currently I am executing all this stuff on single cluster only. If I think to increase the clusters then how many cluster should I need it to increase the performance)
Please suggest me some solution or some good practices so that I can do it keenly.
Thanks.
Hive and SQL Server are not comparable in any way other than the similarity in the syntax of the query language.
While SQL Server is built to be able to respond in realtime from a single machine, hive is for processing large data sets that may span hundreds or thousands of machines.
Hive (via hadoop) has a lot of overhead for starting up a job.
Hive and hadoop will not cache data in memory like sql server does.
Hive has only recent added indexes so most queries end up being a table scan.
If your dataset fits on a single computer you probably want to stick with SQL Server and not hive. Hive performance tuning is mostly based in Hadoop performance tuning although depending on the types of queries you run there can be free performance from using the LazyBinarySerDe.
Hive does have some differences from regular SQL that may be effecting your query. Without more details I can't speculate as to why.
Ignore the "they aren't comparable in any way" comment. If it stores data, it is comparable to any other method of storing data.
But be aware that SQL Server, 13 years ago, had 1000+ people being paid full-time to improve their product. So while that doesn't "Prove" anything, it does increase ones confidence that more work = more results.
More importantly, look for any non-trivial benchmark done on an open source and/or non-relational method of storing data vs one of the mainstream relational databases. You won't find them. That says a lot to me. (Also, mainstream isn't necessary since the current world's fastest data engine isn't even mainstream. But if that level is needed, look at ExoSol.)
If your need is to learn to work with technology at your job and that technology is Hive, my recommendation is to find someone who is really focused on getting the most out of Hive query performance as possible. If there is a Hive query guru out there, find them. But if you need a lot more than what they can give you, you're using the wrong technology.
And if Hive isn't a requirement, I would avoid it and other technologies lacking the compelling business model that will guarantee their survival past 5 years and move them out of niche category they currently exist in (currently 20 times less popular than any mainstream data engine - https://db-engines.com/en/ranking).

Many connections vs. big data queries

Hello I am creating a windows application that will be installed in 10 computers that will access the same database thru Entity Framework.
I was wondering what's better:
Spread the queries into packets (i.e. load contact then attach the included navigation properties - [DataContext.Contacts.Include("Phone"]).
Load everything in one query rather then splitting it out in individual queries.
You name it.
BTW I have a query that its trace string produced over 500 lines of sql, im doubting, maybe i should waive user-exprience for performance since performance is also a part of u.e.
You could put your SQL in stored procedures and write your Entity Framework logic to use the procedures instead of generating the SQL and sending it over the wire.
As with everything database related, it depends. Things like the connection type (LAN vs WAN), how you handle caching, database load level, type of database load (writes vs reads) etc, can all make a difference.
But in general, whenever you can reduce the number of round trips to the database that's a good thing. And remember: you can have more than one result set after executing a single SqlCommand.
Load everything in one query rather
then splitting it out in individual
queries.
This will normally be superior. You're usually better off writing chunkier queries than chatty ones. Fewer calls have less overhead - you need to obtain fewer connections, deal with less latency, etc..
Does the database server have to support other applications? For most business software applications, SQL server won't even break a sweat servicing ten clients - particularly performing basic entity lookups. It won't even really know you're there unless it's installed on a 486SX.

Swapping out databases?

It seems like the goal of a lot of ORM tools and custom data access layers (DAO pattern, etc.) is to abstract the database to the point where you could supposedly swap out the entire database system with minimal work.
Following the common DAL patterns is usually a good idea in code, but it seems like it would never be minimal work to swap out a database. (Cost, training, data migration, etc.)
Does anyone have any experience with swapping out one database for another in a large system, and dealing with the implications in code? Is it worth it to worry about abstracting the actual database from your code?
Question 1: Does anyone have any experience with
swapping out one database for another
in a large system, and dealing with
the implications in code?
Yes we tried it. Our customer is using a large MS Access based Delphi client server application. After about five years we considered switching to SQL Server. We analyzed the problem and concluded that swapping the database would be very costly and provide only a few advantages. Customer decided not to swap the database. The application is still running fine and the customer is still happy.
Note that:
MS Access is only being used for data storage and report generation.
The server application ensures that MS Access is only being accessed on the server. Normal multi-user MS Access applications will transfer large chunks of the Access database over the network - resulting in slow and unreliable database functionality. This is not the case for this application. Client <> Server <> MS Access. Only the server application communicates with the MS Access database. Actually the Server has exclusive access to the MS Access database. No other computer can open to the MS Access database. Conclusion: MS Access is being used as a true RDBMS, Relational DataBase Management System - please no flaming about MS Access being inferior and unstable - it has been running fine for more than 10 years.
The most important issues you will have to consider:
SQL statements: (SELECT, UPDATE, DELETE, INSERT, CREATE TABLE) and make sure they would be compatible with the SQL database. It's amazing how much all the RDBMS differ in the details (date formats, number formats, search formats, string formats, join syntax, create table syntax, stored procedures, user defined functions, (auto) primary keys, etc.)
Report generation: Depending on your database you might be using a different reporting tool. Our customer has over 200 complex reports. Converting all these reports is very time consuming.
Performance: all RDBMS have different performances in different environments. Normally performance optimalisations are very much RDBMS dependent.
Costs: the costs of tools, developers, server and user licenses varies greatly. It ranges from free to very expensive. Free does not mean cheap and expensive does not always equate to good. A cost/value comparison will have to be made.
Experience: making the best use of your RDBMS requires experience. If you have to develop for an "unknown" RDBMS your productivity will suffer.
Question 2: Is it worth it to worry about
abstracting the actual database from
your code?
Yes. In an ideal world, swapping a database would just be adjusting the data connection string. In the real world this is not possible because all databases are different. They all have tables and SQL support but the differences are in the details. If you can keep the differences of the databases shielded through abstraction - please do so. Make a list of the databases you need to support. Check the selected database systems for the differences. Provide centralized code to handle the differences. Support one RDBMS and provide stubs for future support of other RDBMS.
I disagree that the purpose is to be able to swap out databases, and I think you are correct in showing some suspicion about ORMs leading towards that goal.
However, I would still use an ORM, as it abstracts away the details of data access. Isn't this the goal of object oriented programming? Keep your concerns separated.
I think the primary use case for database abstraction (via ORM tools) is to be able to ship a product that works with multiple database brands. I believe it's a rarer occurrence for a company to switch between database vendors, but that's still one of the use cases.
I've worked jobs where we started out using MySQL for monetary reasons (think a startup) and, one we started making money, wanted to switch to Oracle. We didn't end up making the switch, but it was nice to have the option.
Still, ORM tools are not a completely leak-less abstractions and I know our migration still would have been painful and costly. It totally depends on what you are building, but it has been my experience that -- for performance reasons, usually -- you end up either working around your ORM solution or exploiting vendor-specific features at some point.
The only time I've seen a database switch was from HSQL during early development to Oracle as the project progressed. The ORM made this easy.
I often use the DAO pattern to swap out data services (from a database to web service or to swap a web service to a test stub).
For ORM I don't think the goal is to enable you to switch databases - it is to hide you from the complexities of different database implementations and removing the need to worry about the fine details of translating from relational to object represenations of your data.
By having someone smart write an ORM that handles caching, only updates fields that have changed, groups updates, etc I don't need to. Although in the cases where I need something special I can still revert to SQL if I want.

Resources