HP Vertica database in concurrent query environment?

HP Vertica database in concurrent query environment? - database

I tried to using Vertica for an web application which handle about 3600 times simple database query per second, but the performance turned out to be very low for concurrent query. The machine is very powerful, 128G ram and 40 core cpu.
So i just want to know is Vertica simply designed for OLAP and not suitable for OLTP application？
Does anyone has hand-on experience on using vertica for OLTP situation?
All I find on the Web are about how powerful the vertica for analytic query.

Vertica was purpose built for analytic workloads, not transactional. That said, there are very narrow use cases where you can certainly tune the environment to achieve higher concurrency, but with a larger cluster (not just a single machine) and probably not to the number you've mentioned.
You should also better define if you're looking for parallelism or concurrency.

Related

Does snowflake has a workload management feature like RedShift?

In AWS Redshift we can manage query priority using WLM. Do we have any such feature for Snowflake or is it done using multi warehouse strategy?

I think you've got the right idea that warehouses are typically the best approach this problem in Snowflake.
If you have a high priority query/process/account, it's entirely reasonable to provide that with a dedicated warehouse. That will guarantee that your query won't be competing with any resources on other warehouses.
You can also then size that warehouse appropriately. if it's a Small query, or file copy query, for example, and it's just really important that it runs right away, then you can give it a dedicated Small/X-Small warehouse. If it's a big query that doesn't run very frequently, you can give it a larger warehouse. If you set it to auto-suspend then you won't even incur much extra cost for the extra dedicated compute.

Columnar Database Comparisons and DBA efforts

I'm trying to find a database solution and I came across Infobright and Amazon Redshift as potential solutions. Both are columnar databases. Infobright has been around for quite sometime whereas Amazon Redshift is newer.
What is the DBA effort between Infobright and Amazon Redshift?
How accessible is Infobright (API, query interface, etc.) vs AWS?
Where do both sit in your system architecture? Do the operate as a layer on top of your traditional RDBMS?
What is the DevOps effort to setting up both Infobright and Redshift?
I'm leaning a bit more towards Redshift because my application is hosted on AWS and I thought this would create tangible benefits in the long-run since everything is in AWS. Thank you in advance!

Firstly, I'll admit that I work for Infobright. I've done significant research into Redshift, and I feel I can give an honest opinion. I just wrote up a comparison between the two technologies; it can be found here: https://www.infobright.com/wp-content/plugins/download-monitor/download.php?id=37
DBA Effort - Infobright requires very little administration. You cannot index; you don't need to partition/etc. It's an SMP architecture and scales well. Thus, you won't be dealing with multiple nodes. Redshift is also fairly simple. You will need to maintain sorts as well as ensure Analyse is run enough.
Infobright uses a MySQL Shell. Thus, any tool that can utilize MySQL can utilize Infobright. Therefore, you have the same set of tools/interfaces/APIs for Infobright as you do with MySQL. AWS does have an SQL interface, and it does have some API capabilities. It does require that you load directly from S3. Infobright loads from flat files and named pipes from local or remote servers.
Both databases are analytic databases. You would not want to use either as a transactional database. Instead, you typically push data from your transactional system to your analytic database.
DevOps to setup Infobright will be lower than Redshift. However, Redshift is not that overly complicated either. Maintenance of the environment is more of a requirement for Redshift, though.
Infobright does have many AWS-specific installations. In fact, we have implementations that approach nearly 100TB of raw storage on one server. That said, Redshift with many nodes can achieve petabyte scale on an implementation.
There are other factors that can impact your choice. For example, Redshift has very nice failover/HA options already built-in. On the flipside, Infobright can support many concurrent queries and users; Redshift limits queries to 15 regardless of cluster size.
Take a look at the document, and feel free to contact me if you have any specific questions about either technology.

Hive vs SQL Server performance

1) I started using hive from last 2 months. I have a same task as that in SQL. I found that Hive is slow and takes more time to execute queries while SQL executes it in very few minutes/seconds.
After executing the task in Hive when I cross check the result in both (SQL and Hive), I found some difference in results (Not all but in some tables).
e.g. : I have one table which has 2012 records, when I executed a task in Hive in the same table in Hive I got 2007 records.
Why it is happening?
2) If I think to speed up my execution in Hive then what should I do for it?
(Currently I am executing all this stuff on single cluster only. If I think to increase the clusters then how many cluster should I need it to increase the performance)
Please suggest me some solution or some good practices so that I can do it keenly.
Thanks.

Hive and SQL Server are not comparable in any way other than the similarity in the syntax of the query language.
While SQL Server is built to be able to respond in realtime from a single machine, hive is for processing large data sets that may span hundreds or thousands of machines.
Hive (via hadoop) has a lot of overhead for starting up a job.
Hive and hadoop will not cache data in memory like sql server does.
Hive has only recent added indexes so most queries end up being a table scan.
If your dataset fits on a single computer you probably want to stick with SQL Server and not hive. Hive performance tuning is mostly based in Hadoop performance tuning although depending on the types of queries you run there can be free performance from using the LazyBinarySerDe.
Hive does have some differences from regular SQL that may be effecting your query. Without more details I can't speculate as to why.

Ignore the "they aren't comparable in any way" comment. If it stores data, it is comparable to any other method of storing data.
But be aware that SQL Server, 13 years ago, had 1000+ people being paid full-time to improve their product. So while that doesn't "Prove" anything, it does increase ones confidence that more work = more results.
More importantly, look for any non-trivial benchmark done on an open source and/or non-relational method of storing data vs one of the mainstream relational databases. You won't find them. That says a lot to me. (Also, mainstream isn't necessary since the current world's fastest data engine isn't even mainstream. But if that level is needed, look at ExoSol.)
If your need is to learn to work with technology at your job and that technology is Hive, my recommendation is to find someone who is really focused on getting the most out of Hive query performance as possible. If there is a Hive query guru out there, find them. But if you need a lot more than what they can give you, you're using the wrong technology.
And if Hive isn't a requirement, I would avoid it and other technologies lacking the compelling business model that will guarantee their survival past 5 years and move them out of niche category they currently exist in (currently 20 times less popular than any mainstream data engine - https://db-engines.com/en/ranking).

Best database engine for huge datasets

I do datamining and my work involves loading and unloading +1GB database dump files into MySQL. I am wondering is there any other free database engine that works better than MySQL on huge databases? is PostgreSQL better in terms of performance?
I only use basic SQL commands so speed is the only factor for me to choose a database

It is unlikely that substituting a different database engine will provide a huge increase in performance. The slow down you mention is more likely to be related to your schema design and data access patterns. Maybe you could provide some more information about that? For example, is the data stored as a time series? Are records written once sequentially or inserted / updated / deleted arbitrarily?

As long as you drop indexes before inserting huge data, should not be much difference between those two.

HDF is the storage choice of NASA's Earth Observing System, for instance. It's not exactly a database in the traditional sense, and it has its own quirks, but in terms of pure performance it's hard to beat.

If your datamining tool will support it, consider working from flat file sources. This should save most of your import/export operations. It does have some caveats, though:
You may need to get proficient with a scripting language like Perl or Python to do the data munging (assuming you're not already familiar with one).
You may need to expand the memory on your computer or go to a 64-bit platform if you need more memory.
Your data mining tool might not support working from flat data files in this manner, in which case you're buggered.
Modern disks - even SATA ones - will pull 100MB/sec or so off the disk in sequential reads. This means that something could inhale a 1GB file fairly quickly.
Alternatively, you could try getting SSDs on your machine and see if that improves the performance of your DBMS.

If you are doing datamining, perhaps you could use a document-oriented database.
These are faster than relational databases if you do not use my SQL.
MongoDB
and CouchDB are both good options. I prefer MongoDB because I don't know Java, and found CouchDB easier to get up and running.
Here are some articles on the topic:
Why we migrated from MySQL to MongoDB
MySQL vs. CouchDB vs. MongoDB

I am using PostgreSQL with my current project and also have to dump/restore databases pretty often. It takes less than 20 mins to restore 400Mb compressed dump.
You may give it a try, although some server configuration parameters need to be tweaked to comply with your hardware configuration. These parameters include, but not limited to:
shared_buffers
work_mem
temp_buffers
maintenance_work_mem
commit_delay
effective_cache_size

Your question is too ambiguous to answer usefully. "Performance" means many different things to different people. I can comment on how MySQL and PostgreSQL compare in a few areas that might be important, but without information it's hard to say which of these actually matter to you. I've written up a bunch more background information on this topic at Why PostgreSQL Instead of MySQL: Comparing Reliability and Speed. Which is faster certainly depends on what you're doing.
Is the problem that loading data into the database is too slow? That's one area that PostgreSQL doesn't do particularly well at, the COPY command in Postgres is not a particularly fastest bulk-loading mechanism.
Is the problem that queries run too slowly? Is so, how complicated are they? On complicated queries, the PostgreSQL optimizer can do a better job than the one in SQL, particularly if there are many table joins involved. Small, simple queries tend to run faster in MySQL because it isn't doing as much thinking about how to execute the query before beginning; smarter execution costs a bit of overhead.
How many clients are involved? MySQL can do a good job with a small number of clients, at higher client counts the locking mechanism in PostgreSQL might do a better job.
Do you care about transactional integrity? If not, it's easier to turn more of those features off in MySQL, which gives it a significant speed advantage compared to PostgreSQL.

Scaling a postgres server to multiple servers

Our postgres server is about hitting its capacity and we're looking into adding a second database server. Are there any scaling solutions that are particularly good for a postgres setup?

You are looking at a limited set of choices, very dependent on what your specific requirements are (read-to-write ratios and how tolerant your application is of occasional inconsistent reads [synchronous vs. asynchronous replication? master-slave vs. multi-master?], how strongly connected your tables are [clustering], etc.)
http://www.postgresql.org/download/products/3
http://pgpool.projects.postgresql.org/
http://www.slony.info/
UPDATE
Over six years have elapsed since the original answer. Please refer to the High Availability, Load Balancing, and Replication chapter in the PostgreSQL documentation for the latest solutions available to you.

Did you check what is your bottleneck? What are the queries that make your server work hard? Maybe it can be tuned better.
If tuning will not help it is often much easier to upgrade a server than to set replication. Adding some disks in RAID1 or RAID10, adding some RAM, more cores and faster processor. A good RAID controller with battery backed cache would make a big difference too.
Replication id good for high availability but often a bigger server will be more cost effective if you have performance problems.

There's Postgres Advanced Server, and Continuent Tungsten are also worth looking into for an enterprise class solution.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight