Hive vs SQL Server performance - sql-server

1) I started using hive from last 2 months. I have a same task as that in SQL. I found that Hive is slow and takes more time to execute queries while SQL executes it in very few minutes/seconds.
After executing the task in Hive when I cross check the result in both (SQL and Hive), I found some difference in results (Not all but in some tables).
e.g. : I have one table which has 2012 records, when I executed a task in Hive in the same table in Hive I got 2007 records.
Why it is happening?
2) If I think to speed up my execution in Hive then what should I do for it?
(Currently I am executing all this stuff on single cluster only. If I think to increase the clusters then how many cluster should I need it to increase the performance)
Please suggest me some solution or some good practices so that I can do it keenly.
Thanks.

Hive and SQL Server are not comparable in any way other than the similarity in the syntax of the query language.
While SQL Server is built to be able to respond in realtime from a single machine, hive is for processing large data sets that may span hundreds or thousands of machines.
Hive (via hadoop) has a lot of overhead for starting up a job.
Hive and hadoop will not cache data in memory like sql server does.
Hive has only recent added indexes so most queries end up being a table scan.
If your dataset fits on a single computer you probably want to stick with SQL Server and not hive. Hive performance tuning is mostly based in Hadoop performance tuning although depending on the types of queries you run there can be free performance from using the LazyBinarySerDe.
Hive does have some differences from regular SQL that may be effecting your query. Without more details I can't speculate as to why.

Ignore the "they aren't comparable in any way" comment. If it stores data, it is comparable to any other method of storing data.
But be aware that SQL Server, 13 years ago, had 1000+ people being paid full-time to improve their product. So while that doesn't "Prove" anything, it does increase ones confidence that more work = more results.
More importantly, look for any non-trivial benchmark done on an open source and/or non-relational method of storing data vs one of the mainstream relational databases. You won't find them. That says a lot to me. (Also, mainstream isn't necessary since the current world's fastest data engine isn't even mainstream. But if that level is needed, look at ExoSol.)
If your need is to learn to work with technology at your job and that technology is Hive, my recommendation is to find someone who is really focused on getting the most out of Hive query performance as possible. If there is a Hive query guru out there, find them. But if you need a lot more than what they can give you, you're using the wrong technology.
And if Hive isn't a requirement, I would avoid it and other technologies lacking the compelling business model that will guarantee their survival past 5 years and move them out of niche category they currently exist in (currently 20 times less popular than any mainstream data engine - https://db-engines.com/en/ranking).

Related

Native query performance in Hibernate

It's obvious that hql query is slower that native. There are project on the mind which will use huge amount of small transactions to perform. So question is what will perform better:
native query in jpa
native query through jdbc
How much difference is? Because of jpa mapping capabilities and prepared statements prefer it. But according to performance requirements it could be that hibernate will not be fast enough...
EDIT
replaced "Huge amount of data to process" to "huge amount of small transactions to perform"
It is a way too little info you have given.
Huge amount of data to process
How huge is that? Like a few 10 millions of records of few 100 Gigabytes?
How do you plan to process that data? If you pull it to the Java side via Hibernate (or native query or jdbc) then probably you are on a wrong track. You shall keep the data in the database, and process it there with the tools what the database offers for that, and lightyears more performant than any client-side processing. Consider database side processing (PL/SQL of Oracle, Transact SQL of MS SQL Server).
What are you planning to optimize? Data insertion? Data retrieval? Are you planning to use select statements which dig trough a lots of data? Then consider an OLAP solution over classic OLTP. OLAP solutions are built for business intelligence and analyzing of huge gigabytes of data by a few clever tricks. Google for it (OLAP, Decision cubes)
Can you use any of the capabilities of the underlying SQL engine? For example, if you are using Oracle, you have 1000 times more features available what you can actually use in Hibernate. For example, you just simply can not make an Oracle Text query in Hibernate, and there are really a lots of things you can not do.
I could sum this up that the performance difference is not between using native SQL or HQL. Instead:
Hibernate is powerful in what it was built for: handing very few records of data, optimisitically caching it locally in the Java side as a groundwork for data processing systems built for databases which are not capable for data processing. (but only for some selects, inserts, updates, deletes)
Once you really have to move huge amount of data, Java side processing is not an option. Programmers in the '80s have invented the stored procedures exactly because of this reason. Pick a database which supports database side processing to shortcut all network roundtrips, and imperative data processing with for-loops in your java code. Prefer instead as much declarative SQL as you can, and run the processing on your database.
Once you are about to start using the features of your database, Hibernate will be pretty much in your way. It is really built as an ORM wrapper - however data processing problems are not CRUD and ORM-able problems all the time. For example, Hibernate is not too useful for an OLAP use case.
Huge amount of small transactions
In the case of having lots of small transactions (inserting/updating data to the database), Hibernate does not have any performance advantage (since no Java side caching can be utilized) disregarding of the database in use. However you may prefer hibernate since it is a nice tool for converting Java objects to SQL statements.
But with Hibernate, you are totally not in control on what is happening. For example, Hibernate + Oracle, inserting new entities to the database: this is the worst performance nightmare you can imagine.
This is what Hibernate does:
select one new id from a sequence
execute one insert
repeat zillion of times
This is very much not performing well. (The sequence reference shall be part of the insert statement, the whole insert statement shall use JDBC batching.)
I found that a JDBC based approach runs about 1000 times quicker than the Hibernate in this particular use case. (Prefetching next 100 sequences, use JDBC Batch mode for Oracle, bind variables to the batch, send down records in batches of 100, and use asynchronous commit (yet again something you can not control in Hibernate).
I found that if I want to squeeze out most of your tools, I need to learn them in depth. Unfortunately you can find lots of opinion-based comments especially in Hibernate vs. non-Hibernate wars on the net. Most of them are written by Java developers who have literally no idea about what happens behind the scenes. So, don't believe - measure :)

SQL Server Database Optimization Strategy

I am starting a new project using SQL Server for a medical office. Their current database (SQL Server 2008) have over 500,000 rows that span across 15+ tables. Currently they are complaining that their data entry application is very slow to generate reports and insert new data.
For my new system I was thinking of developing a two tiered database approach where the primary used SQL Server 2012 will only contain 3 months worth of rows and the second SQL Server 2012 would maintain all the data for the system. This way when users insert new data it will be entered into a much smaller system and when they query recent data the query should execute much faster. This system will also have reporting, but I think the reports will have to be generated from the larger data set.
My questions are as follows
Will a solution like this improve the overall performance of the database
Are there any scalability concerns with this solution?
What is the best way to transfer that data between the two servers each night?
If my solution makes no sense please feel free to offer any other solutions.
Don't do this. Splitting your app into multiple databases will be a management nightmare. Plus, 500k records isn't that many, assuming that the records are of reasonable size.
Instead, go after the low-hanging fruit. Turn on logging and look at the access patterns. Which queries are slow? Figure out why. Do they lack indexes? Can the queries be simplified? Debug the problem.
Keep in mind that sometimes throwing hardware at the problem is the right solution. If you can solve the problem with an $800 server, do it. That's a lot cheaper than your time.
To chime in: 500K records is not so big. You ought to be able to make the db work very speedily as is with some tuning.

Best database for multi million row store/query

We have a database that has been growing for around 5 years. The main table has near 100 columns and 700 million rows (and growing).
The common use case is to count how many rows match a given criteria, that is:
select count(*) where column1='TypeA' and column2='BlockC'.
The other use case is to retrieve the rows that match a criteria.
The queries started by taking a bit of time, now they take a couple of minutes.
I want to find some DBMS that allows me to make the two use cases as fast as possible.
I've been looking into some Column store databases and Apache Cassandra but still have no idea what is the best option. Any ideas?
Update: these days I'd recommend Hive 3 or PrestoDB for big data analysis
I am going to assume this is an analytic (historical) database with no current data. If not, you should consider separating your dbs.
You are going to want a few features to help speed up analysis:
Materialized views. This is essentially pre-calculating values, and then storing the results for later analysis. MySQL and Postgres (coming soon in Postgres 9.3) do not support this, but you can mimic with triggers.
easy OLAP analysis. You could use Mondrian OLAP server (java), but then Excel doesn't talk to it easily, but JasperSoft and Pentaho do.
you might want to change the schema for easier OLAP analysis, ie the star schema. Good book:
http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247/ref=pd_sim_b_1
If you want open source, I'd go Postgres (doesn't choke on big queries like mysql can), plus Mondrian, plus Pentaho.
If not open source, then best bang for buck is likely Microsoft SQL Server with Analysis Services.

Which database is the best for my situation?

I have a simple plan: I have a one application to create records and insert to database (2 - 10 record per second) and 3 or more application (as clients) connect to DBMS server to query (SELECT statements for search and filter results) records rapidly
I don't use these SQL commands: delete, update. My records grows up to 2 billion!
With this specification, can someone suggest which database is best for my situation?
MySQL,
MS-SQL Server,
Oracle,
or any other?
MySQL is free, but actively supported/maintained. I have used it with about 10+ million rows in the main table of our application and it was fine. It also has some funky SQL extensions - you can do stuff with it that's simple and effective that you can't do in other DBs (or it's extremely painful to do).
MS-SQL is IMHO not worth paying for - like most (all?) MS products they start with a "small" mentality (departmental system size) then retrofit concurrency and scalability and wonder why stuff doesn't work.
Oracle is obscenely expensive, but it's easy to get contractor staff to help you (they follow the money) and it's a very mature rock-solid product.
Another option is postgres, which I have used a lot. I really like postgres - it's free, and very solid. There is one caveat - it only recently bundled real-time, failover replication into its offering - something that mysql has had for many years.
I would go MySQL and see what happens.
Our experiences with MySQL were not very nice when dealing with huge number of records. (2 million records is huge) I feel MySQL would not be the best choice with so frequent inserts either.
MS-SQL is better, says my colleagues at work. Then your DBMS server will have to be Windows based. It's your call to decide whether it is a problem or not.
You can not go wrong with Oracle. But it is very expensive.
You should have a seperate database access layer in your application, define an interface for all your database operations. If you use .NET, entity framework will make your life easier. Once you do these, you will be able to experiment with different databases relatively easily. At least in theory.
Just my two cents.
Edit: 2 Billion ??!
If you really need that many records, this is my one line answer: Hire an expert.

Database tuning advices

Possibly some of you don't even know about these features so you will learn a lot from this post which will in fact help me to optimize better and some of you probably use them on daily basis so you can help me and other less DBA proof users.
I'm using SQL-Server 2005 Standard
I run SQL Server Profiler a lot. Each time i find ad hoc queries or sps which execution time exceed my possible limits of under 100ms for complex queries and above 30ms for short ones (number does not mean a thing, just to make some sense). After i find possibly problematic queries i write them down so i can use Database Engine Tuning Advisor which executes overloaded queries on tables and at the result gives me indexes i need to build in order to improve performance. Each night i execute index rebuild function from Maintenance Plans.
Now question time!!!
1.if Database Engine Tuning Advisor gives me 10 indexes to create while improvement percentage is about 40% should i use it's advice or not? Better question is what is ratio number of indexes/improvement percentage i should follow. Indexes take space and time to rebuild.
2.If i create about 5-7 indexes for each problematic query, i can end up with 500 indexes per DB. How many indexes can i build so DB will perform normally? are there any limitations?
3.Is there any other way to optimize ( nor re-design ) your DB other than using my method or going sp by sp by your hands and eyes?
There's no right answer to this question as it depends heavily on your workload.
For workloads with a heavy ratio of reads (e.g. data warehouse) it might make sense to create an index which it would be positively counter productive to create for an environment with a greater amount of writes.
The DTA can help with this regard by assessing the impact on the overall workload but you would need to try and capture a representative sample (not just the poor performing queries). SQL Profiler is quite resource intensive so to do this with the least possible impact on your server you would need to use a server side SQL trace with appropriate filters to only log events related to the database of interest.
To identify the poorest performing queries in isolation If you have at least SQL2005 SP1 client tools installed you should be able to right click the database node in Management Studio and use the Reports -> Standard Reports menu to see the plans in the cache with highest CPU/IO.
If you are interested in this area I recommend the book SQL Server 2008 Query Performance Tuning Distilled (most of it applicable to SQL2005 as well)
You can get SQL Profiler to log to a table, so it will write the queries to a table you specify. If you can, leave it running for a few hours - Or however long it takes to cover as many queries/events as possible.
Next, use Database Engine Tuning Advisor - And get it to use this table of queries as its source input. You will find it looks at the whole pattern, and will recommend you create some indices, and remove others.
This is better than looking at queries one by one in isolation, although that still has its place.

Resources