Processing large amounts of data quickly - database

I'm working on a web application where the user provides parameters, and these are used to produce a list of the top 1000 items from a database of up to 20 million rows. I need all top 1000 items at once, and I need this ranking to happen more or less instantaneously from the perspective of the user.
Currently, I'm using a MySQL with a user-defined function to score and rank the data, then PHP takes it from there. Tested on a database of 1M rows, this takes about 8 seconds, but I need performance around 2 seconds, even for a database of up to 20M rows. Preferably, this number should be lower still, so that decent throughput is guaranteed for up to 50 simultaneous users.
I am open to any process with any software that can process this data as efficiently as possible, whether it is MySQL or not. Here are the features and constraints of the process:
The data for each row that is relevant to the scoring process is about 50 bytes per item.
Inserts and updates to the DB are negligible.
Each score is independent of the others, so scores can be computed in parallel.
Due to the large number of parameters and parameter values, the scores cannot be pre-computed.
The method should scale well for multiple simultaneous users
The fewer computing resources this requires, in terms of number of servers, the better.
Thanks

A feasible approach seems to be to load (and later update) all data into about 1GB RAM and perform the scoring and ranking outside MySQL in a language like C++. That should be faster than MySQL.
The scoring must be relatively simple for this approache because your requirements only leave a tenth of a microsecond per row for scoring and ranking without parallelization or optimization.

If you could post query you are having issue with can help.
Although here are some things.
Make sure you have indexes created on database.
Make sure to use optimized queries and using joins instead of inner queries.

Based on your criteria, the possibility of improving performance would depend on whether or not you can use the input criteria to pre-filter the number of rows for which you need to calculate scores. I.e. if one of the user-provided parameters automatically disqualifies a large fraction of the rows, then applying that filtering first would improve performance. If none of the parameters have that characteristic, then you may need either much more hardware or a database with higher performance.

I'd say for this sort of problem, if you've done all the obvious software optimizations (and we can't know that, since you haven't mentioned anything about your software approaches), you should try for some serious hardware optimization. Max out the memory on your SQL servers, and try to fit your tables into memory where possible. Use an SSD for your table / index storage, for speedy deserialization. If you're clustered, crank up the networking to the highest feasible network speeds.

Related

Snowflake multi-cluster warehouse performance vs single warehouse with large warehouse size

I am very new to Snowflake and while working with snowflake I had conflict between the below 2 options.
Single warehouse with size X-Large (16 credits / hour)
Multi-cluster (with max clusters=2 & min clusters=2) with size Large (8 credits / hour)
Considering the above 2 options
Is there any advantage I can get by choosing 2nd option in terms of performance?
Note: I know the advantages of multi-cluster over a single warehouse. Please share your answer for this specific scenario (when min = max).
So the things that happen in running a query are.
belong I am going to just use single to mean the single instance and 'multi` to mean the multi instance cluster, of which when we run a query it is only ever on one instance.
Reading\Writing IO from your storage layer:
Here a single has twice the IO over the multi thus if your query is IO saturated the single is the better choice.
Parallel steps:
So if you have a GROUP BY over a high cardinality columns, both the single and multi should be equally good. If you have a low cardinality but billions of rows, the smaller instance might give better results as those complex steps cannot be broken over the larger cluster size of the single instance. But this is most likely lost in the wash if you have many concurrent queries.
Many queries / Noisy neighbour:
If you have hundreds of queries hitting in waves the larger single instance is worse at starting those queries, as it just has less concurrent tasks at once, and a single very large query which can flush caches, or just dominate cluster, means you stop handling normal/small queries. Where-as having the mutli cluster allow if only one "super heavy" query comes in, you only stall half your normal queries.
Other thoughts
It also really depends on your load patterns, at my last job, we had auto-scaling cluster of SMALL instances used to used to answer our read queries of dashboards, reports, and we allowed it to run a little over provisioned, so things were snappy.
Where-as our data loading ran on second auto-scaling cluster of MEDIUM instances, and which we overloaded on purpose, as we were trying to load data the fastest/cheapest. And in non-peak times we programmatically reduced the auto-scalling MAX to almost starve the load. But would do some expensive reprocessing on a LARGE instance via those saved credits in "the middle of the night" and also our loading tasks had the ability to spin up exclusive LARGE+ size warehouses to do one off rebuilds, as this was all IO bound work, and thus the smaller the window of "outage" the better the system was, and the IO scale linear, so the total cost was the same.
Which is all to say, "what is best" really depends on what you are doing, your budget, and the trade offs you are prepared for. The golden thing about snowflake it is not like a classic DB where you have to pick the size and get it right, pick one, and watch it, if it's struggling change it on the fly. We did this a number of times when a release of our code or snowflake changed the performance of some critical SQL, we would jump in, and double or triple the instance count, or size, to get past the situation, while trying to fix or work around SF issues, or awaiting SF to roll a release back. for a couple hours generally spending more credits is not budget braking. This flexibility also means you can just experiment, "what happens if we trying 4x smaller instance.." "oh nothing... look we just saved heaps of money"..
If you have min=max=2 then you permanently have 2 warehouses running (as long as they are not suspended). If you configure your multi-cluster warehouse like this then you lose a lot of the advantages but for your specific use case it might make sense, I suppose
Based on your comment, here is my answer:
In both scenarios, you will have the same resources to process your queries. The important difference would be about running single heavy queries. As you may know, a single query can not spawn to multiple clusters (yet), so when you run a query in your multi-cluster warehouse, it will be processed on one of the Large warehouses (and use max 8 nodes).
If you run the same query on your single XL warehouse, it can be executed by (max) 16 nodes. So if you will run heavy queries which requires more memory and CPU, using a single XL warehouse would be better for you.
About concurrency, there is a parameter named "MAX_CONCURRENCY_LEVEL". Its default value is 8, and it limits maximum number of concurrent executions per warehouse. If you do not change it, your single XL warehouse will execute a maximum of 8 queries concurrently, while your multi-cluster warehouse can execute 16 queries concurrently.
https://docs.snowflake.com/en/sql-reference/parameters.html#max-concurrency-level
You may increase this parameter, and provide same concurrency on both single XL and multi-cluster L warehouse. But in this case, you should be careful when you runn heavy and light queries together. Because one query may use most of the resources of the warehouse, and your light queries may have less resources and take a longer time. So I would recommend using a multi-cluster warehouse, if you will have "relatively" light/concurrent queries.

Query optimization for hive table

We have a table which is of size 100TB and we have multiple customers using the same table (i.e every customer uses different where conditions). Now the problem statement is every time a customer tries to query the table it gets scanned from top to bottom.
This creates lot of slowness for all the queries. We cannot even partition/bucket the table basing on any business keys. Can someone provide solution or point to similar problem statements and their resolution.
you can provide your suggestions as well as alternative technologies so that we can pick the best suitable one. Thanks.
My 2 cents: experiment with an ORC table with GZip compression (default) and clever partitioning / ordering...
every SELECT that uses a partition key in its WHERE clause will
do "partition pruning" and thus avoid to scan everything [OK, OK, you said you had no good candidate in your specific case, but in general it can be done so I had to mention it first]
then within each ORC file in scope, the min/max counters will be
checked for "stripe pruning", limiting the I/O further
With clever partitioning & clever ordering of the data at INSERT time, using the most-frequent filters, the pruning can be quite efficient.
Then you can look into optimizations such as using a non-default ORC stripe size, a non-default "bytes-per-reducer" threshold, etc.
Reference:
http://fr.slideshare.net/oom65/orc-andvectorizationhadoopsummit
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance
http://thinkbig.teradata.com/hadoop-performance-tuning-orc-snappy-heres-youre-missing/
One last thing: with 15 nodes for running queries and a replication factor of 3, each HDFS block is available "locally" on 3 the nodes (20%) and "remotely" in the rest (80%). A higher replication factor may reduce I/O and network bottlenecks -- at the cost of disk space, of course.

Performance of 100M Row Table (Oracle 11g)

We are designing a table for ad-hoc analysis that will capture umpteen value fields over time for claims received. The table structure is essentially (pseudo-ish-code):
table_huge (
claim_key int not null,
valuation_date_key int not null,
value_1 some_number_type,
value_2 some_number_type,
[etc...],
constraint pk_huge primary key (claim_key, valuation_date_key)
);
All value fields all numeric. The requirements are: The table shall capture a minimum of 12 recent years (hopefully more) of incepted claims. Each claim shall have a valuation date for each month-end occurring between claim inception and the current date. Typical claim inception volumes range from 50k-100k per year.
Adding all this up I project a table with a row count on the order of 100 million, and could grow to as much as 500 million over years depending on the business's needs. The table will be rebuilt each month. Consumers will select only. Other than a monthly refresh, no updates, inserts or deletes will occur.
I am coming at this from the business (consumer) side, but I have an interest in mitigating the IT cost while preserving the analytical value of this table. We are not overwhelmingly concerned about quick returns from the Table, but will occasionally need to throw a couple dozen queries at it and get all results in a day or three.
For argument's sake, let's assume the technology stack is, I dunno, in the 80th percentile of modern hardware.
The questions I have are:
Is there a point at which the cost-to-benefit of indices becomes excessive, considering a low frequency of queries against high-volume tables?
Does the SO community have experience with +100M row tables and can
offer tips on how to manage?
Do I leave the database technology problem to IT to solve or should I
seriously consider curbing the business requirements (and why?)?
I know these are somewhat soft questions, and I hope readers appreciate this is not a proposition I can test before building.
Please let me know if any clarifications are needed. Thanks for reading!
First of all: Expect this to "just work" if leaving the tech problem to IT - especially if your budget allows for an "80% current" hardware level.
I do have experience with 200M+ rows in MySQL on entry-level and outdated hardware, and I was allways positivly suprised.
Some Hints:
On monthly refresh, load the table without non-primary indices, then create them. Search for the sweet point, how many index creations in parallell work best. In a project with much less date (ca. 10M) this reduced load time compared to the naive "create table, then load data" approach by 70%
Try to get a grip on the number and complexity of concurrent queries: This has influence on your hardware decisions (less concurrency=less IO, more CPU)
Assuming you have 20 numeric fields of 64 bits each, times 200M rows: If I can calculate correctly, ths is a payload of 32GB. Trade cheap disks against 64G RAM and never ever have an IO bottleneck.
Make sure, you set the tablespace to read only
You could consider anchor modeling approach to store changes only.
Considering that there are so many expected repeated rows, ~ 95% --
bringing row count from 100M to only 5M, removes most of your concerns.
At this point it is mostly cache consideration, if the whole table
can somehow fit into cache, things happen fairly fast.
For "low" data volumes, the following structure is slower to query than a plain table; at one point (as data volume grows) it becomes faster. That point depends on several factors, but it may be easy to test. Take a look at this white-paper about anchor modeling -- see graphs on page 10.
In terms of anchor-modeling, it is equivalent to
The modeling tool has automatic code generation, but it seems that it currenty fully supports only MS SQL server, though there is ORACLE in drop-down too. It can still be used as a code-helper.
In terms of supporting code, you will need (minimum)
Latest perspective view (auto-generated)
Point in time function (auto-generated)
Staging table from which this structure will be loaded (see tutorial for data-warehouse-loading)
Loading function, from staging table to the structure
Pruning functions for each attribute, to remove any repeating values
It is easy to create all this by following auto-generated-code patterns.
With no ongoing updates/inserts, an index NEVER has negative performance consequences, only positive (by MANY orders of magnitude for tables of this size).
More critically, the schema is seriously flawed. What you want is
Claim
claim_key
valuation_date
ClaimValue
claim_key (fk->Claim.claim_key)
value_key
value
This is much more space-efficient as it stores only the values you actually have, and does not require schema changes when the number of values for a single row exceeds the number of columns you have allocated.
Using partition concept & apply partition key on every query that you perform will save give the more performance improvements.
In our company we solved huge number of performance issues with the partition concept.
One more design solutions is if we know that the table is going to be very very big, try not to apply more constraints on the table & handle in the logic before u perform & don't have many columns on the table to avoid row chaining issues.

Column Stores: Comparing Column Based Databases

I've really been struggling to make SQL Server into something that, quite frankly, it will never be. I need a database engine for my analytical work. The DB needs to be fast and does NOT need all the logging and other overhead found in typical databases (SQL Server, Oracle, DB2, etc.)
Yesterday I listened to Michael Stonebraker speak at the Money:Tech conference and I kept thinking, "I'm not really crazy. There IS a better way!" He talks about using column stores instead of row oriented databases. I went to the Wikipedia page for column stores and I see a few open source projects (which I like) and a few commercial/open source projects (which I don't fully understand).
My question is this: In an applied analytical environment, how do the different column based DB's differ? How should I be thinking about them? Anyone have practical experience with multiple column based systems? Can I leverage my SQL experience with these DBs or am I going to have to learn a new language?
I am ultimately going to be pulling data into R for analysis.
EDIT: I was requested for some clarification in what exactly I am trying to do. So, here's an example of what I would like to do:
Create a table that has 4 million rows and 20 columns (5 dims, 15 facts). Create 5 aggregation tables that calculate max, min, and average for each of the facts. Join those 5 aggregations back to the starting table. Now calculate the percent deviation from mean, percent deviation of min, and percent deviation from max for each row and add it to the original table. This table data does not get new rows each day, it gets TOTALLY replaced and the process is repeated. Heaven forbid if the process must be stopped. And the logs... ohhhhh the logs! :)
The short answer is that for analytic data, a column store will tend to be faster, with less tuning required.
A row store, the traditional database architecture, is good at inserting small numbers of rows, updating rows in place, and querying small numbers of rows. In a row store, these operations can be done with one or two disk block I/Os.
Analytic databases typically load thousands of records at a time; sometimes, as in your case, they reload everything. They tend to be denormalized, so have a lot of columns. And at query time, they often read a high proportion of the rows in the table, but only a few of these columns. So, it makes sense from an I/O standpoint to store values of the same column together.
Turns out that this gives the database a huge opportunity to do value compression. For instance, if a string column has an average length of 20 bytes but has only 25 distinct values, the database can compress to about 5 bits per value. Column store databases can often operate without decompressing the data.
Often in computer science there is an I/O versus CPU time tradeoff, but in column stores the I/O improvements often improve locality of reference, reduce cache paging activity, and allow greater compression factors, so that CPU gains also.
Column store databases also tend to have other analytic-oriented features like bitmap indexes (yet another case where better organization allows better compression, reduces I/O, and allows algorithms that are more CPU-efficient), partitions, and materialized views.
The other factor is whether to use a massively parallel (MMP) database. There are MMP row-store and column-store databases. MMP databases can scale up to hundreds or thousands of nodes, and allow you to store humungous amounts of data, but sometimes have compromises like a weaker notion of transactions or a not-quite-SQL query language.
I'd recommend that you give LucidDB a try. (Disclaimer: I'm a committer to LucidDB.) It is open-source column store database, optimized for analytic applications, and also has other features such as bitmap indexes. It currently only runs on one node, but utilizes several cores effectively and can handle reasonable volumes of data with not much effort.
4 million rows times 20 columns times 8 bytes for a double is 640 mb. Following the rule of thumb that R creates three temporary copies for every object, we get to around 2 gb. That is not a lot by today's standard.
So this should be doable in memory on a suitable 64-bit machine with a 'decent' amount of ram (say 8 gb or more). Installing Ubuntu or Debian (possibly in the server version) can be done in a few minutes.
I have some experience with Infobright Community edition --- column-or. db, based on mysql.
Pro:
you can use mysql interfaces/odbc mysql drivers, from R too
fast enough queries on big chunks of data selection (because of KnowledgeGrid & data packs)
very fast native data loader and connectors for ETL (talend, kettle)
optimized exactly that operations what I (and I think most of us) use (selection by factor levels, joining etc)
special "lookup" option for optimized storing R factor variables ;) (ok, char/varchar variables with relatively small levels number/rows number)
FOSS
paid support option
?
Cons:
no insert/update operations in Community edition (yet?), data loading only via native data loader/ETL connectors
no utf-8 official support (collation/sort etc), planned for q3 2009
no functions in aggregate queries f.e. select month (date) from ...) yet, planned for July(?) 2009, but because of column storage, I prefer simply create date columns for every aggregation levels (week number, month, ...) I need
cannot installed on existing mysql server as storage engine (because of own optimizer, if I understood correctly), but you may install Infobright & mysql on different ports if you need
?
Resume:
Good FOSS solution for daily analytical tasks, and, I think, your tasks as well.
Here is my 2 cents: SQL server does not scale well. We attempted to use SQL server to store financial data in real time (i.e. prices ticks coming in for 100 symbols). It worked perfectly for the first 2 weeks - then it went slower and slower as the database size increased, and finally ground to a halt, too slow to insert each price as it was received. We tried to work around it by moving data from the active database to offline storage every night, but ultimately the project was abandoned as it just didn't work.
Bottom line: if you're planning on storing a lot of data ( >1GB) you need something that scales properly, and that probably means a column database.
It looks like an implementation change (2-D array in column-major order, instead of row-major order), rather than an interface change.
Think "strategy" pattern, rather than being an entire paradigm shift. Of course, I've never used these products, so they may in fact force a paradigm shift down your throat. I don't know why, though.
We might be better able to help you reach an informed decision if you described [1] your specific goal and [2] the issues you're running into with SQL Server.

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Resources