I have a huge HBase table of about half a billion rows, with about 100 columns (varies per row) of data.
I would like to query this data, based on any column qualifier value, as fast as possible.
I know that HBase is optimized for fast reads when we know the ROW-KEY but I want to query based on different column values. But applying Column Filters (using JAVA API) leads to full table scans which slows the system down
What are my options?
INDEXING: The columns present in every row changes. Can I still do indexing?
Do I continue to use HBase to store data? Or use it along with Solr or ElasticSearch?
What sort of performance can I expect for random queries based on any column values with maybe a billion rows?
Any other suggestions are welcome.
Getting data from the row key is fast in Hbase, but since values are not indexed, querying with a value filter is sloooooooooow. If the number of columns to be indexed is small you can consider reversed table index.
But if you want more things, like multi-criteria queries, you should have a look to elasticsearch and use it to store only the index on your columns and keep your data in hbase. Don't forget to disable the source store with "_source" : {"enabled" : false} when creating your index, all your data is already in hbase, don't waste your HDD :)
Related
Can some explain when do we use search optimization and cluster key for table or do we use both ?
I see that we are losing credits if we enable both of them?
Thanks,
Sye
The Search Optimization is used when you need to access small number of rows (point lookup queries), like when you access an OLTP database.
Cluster Key is for partitioning your data. It's generally good for any kind of workloads unless you need to read whole table.
If you don't need to access a specific row in your large table, you don't need Search optimization service.
If your table is not large, or if you ingest "ordered" data to your table, you don't need auto-clustering (cluster keys).
When you load a table into snowflake, it creates 'micropartitions' based on the order of the rows at load time. When a SQL statement is run, the where clause is used to prune the search space of which partitions need to be scanned.
A Cluster Key in Snowflake simply reorders the data by the cluster key, so that it is co-located within the same micropartitions. This can result in massive performance improvements if your queries frequently use the the cluster key in the where clause to filter the results.
Search optimization is for finding 1 or a small number of records based on using '=' in the where clause.
So if you have a table with Product_ID, Transaction_Date, Amount.
Queries using 'Where Year(Transaction Date) >= 2017' would benefit from a cluster key on Transaction Date.
Queries using 'Where Product_ID = 111222333' would benefit from search optimization.
In either case, these are only needed of your table is large (think billions of rows). Otherwise, the native Snowflake micropartition approach will do a good job at optimization.
Please don't call Cluster Key "partitioning". Although the effect is similar, they are two distinct operations with different meanings. I will be publishing an article on partitioning and pruning shortly.
In Oracle 11g, say, I have a table Task which has a column ProcessState. The values of this column can be Queued, Running and Complete (can have couple more states in future). The table will have 50M+ data with 99.9% of rows having Complete as that column value. Only a few thousand rows will have value Queued/Running.
I read that although bitmap index is good for low cardinality column, but that is used largely for static tables.
So, what index can improve the query for Queued/Running tasks? bitmap or normal non-unique b-tree index?
Also, what index can improve the query for a binary column (NUMBER(1,0) with just yes/no values) ?
Disclaimer: I am an accidental dba.
A regular (b*tree) index is fine. Just make sure there is a histogram on the column. (See METHOD_OPT parameter in DBMS_STATS.GATHER_TABLE_STATS).
With a histogram on that column, Oracle will have the data it needs to make sure it uses the index when looking for queued/running jobs but use a full table scan when looking for completed job.
Do NOT use a bitmap index, as suggested in the comments. With lots of updates, you'll have concurrency and, worse, deadlocking issues.
Also, what index can improve the query for a binary column (NUMBER(1,0) with just yes/no values)
Sorry -- I missed this part of your question. If the data in the column is skewed (i.e., almost all 1 or almost all 0), then a regular (b*tree) index as above. If the data is evenly distributed, then no index will help. Reading 50% of your table's rows via an index will be slower than a full table scan.
I guess that you are interested in selecting rows with (Queued/Running) states for updating them. So it would be nice to separate the completed rows from the others because there is no much sence in indexing completed rows. You can use paritioning here or function-based index with function returning NULL for completed rows and actual values for the others, in this case only uncompleted rows appears in an index tree.
I'm using DSE for Cassandra/Solr integration so that data are stored in Cassandra and indexed in Solr. It's very natural to use Cassandra to handle CRUD operation and use Solr for full text search respectively, and DSE can really simplify data synchronization between Cassandra and Solr.
When it comes to query, however, there are actually two ways to go: Cassandra secondary/manual configured index vs. Solr. I want to know when to use which method and what's the performance difference in general, especially under DSE setup.
Here is one example use case in my project. I have a Cassandra table storing some item entity data. Besides the basic CRUD operation, I also need to retrieve items by equality on some field (say category) and then sort by some order (in my case here, a like_count field).
I can think of three different ways to handle it:
Declare 'indexed=true' in Solr schema for both category and like_count field and query in Solr
Create a denormalized table in Cassandra with primary key (category, like_count, id)
Create a denormalized table in Cassandra with primary key (category, order, id) and use an external component, such as Spark/Storm,to sort the items by like_count
The first method seems to be the simplest to implement and maintain. I just write some trivial Solr accessing code and the rest heavy lifting are handled by Solr/DSE search.
The second method requires manual denormalization on create and update. I also need to maintain a separate table. There is also tombstone issue as the like_count can possibly be updated frequently. The good part is that the read may be faster (if there are no excessive tombstones).
The third method can alleviate the tombstone issue at the cost of one extra component for sorting.
Which method do you think is the best option? What is the difference in performance?
Cassandra secondary indexes have limited use cases:
No more than a couple of columns indexed.
Only a single indexed column in a query.
Too much inter-node traffic for high cardinality data (relatively unique column values)
Too much inter-node traffic for low cardinality data (high percentage of rows will match)
Queries need to be known in advance so data model can be optimized around them.
Because of these limitations, it is common for apps to create "index tables" which are indexed by whatever column is desired. This requires either that data be duplicated from the main table to each index table, or an extra query will be needed to read the index table and then read the actual row from the main table after reading the main key from the index table. Queries on multiple columns will have to be manually indexed in advance, making ad hoc queries problematic. And any duplicated will have to be manually updated by the app into each index table.
Other than that... they will work fine in cases where a "modest" number of rows will be selected from a modest number of nodes, and queries are well specified in advance and not ad hoc.
DSE/Solr is better for:
A moderate number of columns are indexed.
Complex queries with a number of columns/fields referenced - Lucene matches all specified fields in a query in parallel. Lucene indexes the data on each node, so nodes query in parallel.
Ad hoc queries in general, where the precise queries are not known in advance.
Rich text queries such as keyword search, wildcard, fuzzy/like, range, inequality.
There is a performance and capacity cost to using Solr indexing, so a proof of concept implementation is recommended to evaluate how much additional RAM, storage, and nodes are needed, which depends on how many columns you index, the amount of text indexed, and any text filtering complexity (e.g., n-grams need more.) It could range from 25% increase for a relatively small number of indexed columns to 100% if all columns are indexed. Also, you need to have enough nodes so that the per-node Solr index fits in RAM or mostly in RAM if using SSD. And vnodes are not currently recommended for Solr data centers.
I'm currently using a MySQL table for an online game under LAMP.
One of the table is huge (soon millions of rows) and contains only integers (IDs,timestamps,booleans,scores).
I did everything to never have to JOIN on this table. However, I'm worried about the scalability. I'm thinking about moving this single table to another faster database system.
I use intermediary tables to calculate the scores but in some cases, I have to use SUM() or AVERAGE() directly on some filtered rowsets of this table.
For you, what is the best database choice for this table?
My requirements/specs:
This table contains only integers (around 15 columns)
I need to filter by certain columns
I'd like to have UNIQUE KEYS
It could be nice to have "INSERT ... ON DUPLICATE UPDATE" but I suppose my scripts can manage it by themselves.
i have to use SUM() or AVERAGE()
thanks
Just make sure you have the correct indexes on so selecting should be quick
Millions of rows in a table isn't huge. You shouldn't expect any problems in selecting, filtering or upserting data if you index on relevant keys as #Tom-Squires suggests.
Aggregate queries (sum and avg) may pose a problem though. The reason is that they require a full table scan and thus multiple fetches of data from disk to memory. A couple of methods to increase their speed:
If your data changes infrequently then caching those query results in your code is probably a good solution.
If it changes frequently then the quickest way to improve their performance is probably to ensure that your database engine keeps the table in memory. A quick calculation of expected size: 15 columns x 8 bytes x millions =~ 100's of MB - not really an issue (unless you're on a shared host). If your RDBMS does not support tuning this for a specific table, then simply put it in a different database schema - shouldn't be a problem since you're not doing any joins on this table. Most engines will allow you to tune that.
I have to design a database to store log data but I don't have experience before. My table contains about 19 columns (about 500 bytes each row) and daily grows up to 30.000 new rows. My app must be able to query effectively again this table.
I'm using SQL Server 2005.
How can I design this database?
EDIT: data I want to store contains a lot of type: datetime, string, short and int. NULL cells are about 25% in total :)
However else you'll do lookups, a logging table will almost certainly have a timestamp column. You'll want to cluster on that timestamp first to keep inserts efficient. That may mean also always constraining your queries to specific date ranges, so that the selectivity on your clustered index is good.
You'll also want indexes for the fields you'll query on most often, but don't jump the gun here. You can add the indexes later. Profile first so you know which indexes you'll really need. On a table with a lot of inserts, unwanted indexes can hurt your performance.
Well, given the description you've provided all you can really do is ensure that your data is normalized and that your 19 columns don't lead you to a "sparse" table (meaning that a great number of those columns are null).
If you'd like to add some more data (your existing schema and some sample data, perhaps) then I can offer more specific advice.
Throw an index on every column you'll be querying against.
Huge amounts of test data, and execution plans (with query analyzer) are your friend here.
In addition to the comment on sparse tables, you should index the table on the columns you wish to query.
Alternatively, you could test it using the profiler and see what the profiler suggests in terms of indexing based on actual usage.
Some optimisations you could make:
Cluster your data based on the most likely look-up criteria (e.g. clustered primary key on each row's creation date-time will make look-ups of this nature very fast).
Assuming that rows are written one at a time (not in batch) and that each row is inserted but never updated, you could code all select statements to use the "with (NOLOCK)" option. This will offer a massive performance improvement if you have many readers as you're completely bypassing the lock system. The risk of reading invalid data is greatly reduced given the structure of the table.
If you're able to post your table definition I may be able to offer more advice.