How should large SELECTs be done in Cassandra?

How should large SELECTs be done in Cassandra? - database

I'm investigating Cassandra as a possible alternative backing store for a data-intensive application, and I'm looking at ways to structure the schema and use CQL to do the kinds of queries we do today using MySQL.
The concrete problem I currently have is: I need to insert, say, 1 million rows into a table. However, if there already exists a row with the right identity (i.e. it's already in the system, identified by a hash), I want to reuse its id for relational reasons. But I only expect an overlap of, say, 10,000 IDs - but of course it could be all 1 million.
Suppose I have a table like this:
create table records_by_hash(hash text primary key, id bigint);
Is it enough to issue a select hash, id from records_by_hash where hash in (...) with all hashes in a multi-megabyte comma-separated list? Is this the best approach for Cassandra?
The way we do this in MySQL is like this:
create temporary table hashes(hash text);
-- file is actually JDBC OutputStream
load data infile '/dev/stdin' into table hashes -- csv format
select id, hash from records join hashes on records.hash = hashes.hash;
Since records is indexed on hashes, and the lookup data is now in MySQL (no more round trips), this is fairly quick and painless. load data is very fast, and there's only three logical round trips.

Using the in operator is most of the time not the best idea because you are hitting multiple partitions (located on random nodes) within same query. It is slow and puts a lot of work on current coordinator node. It not a good idea to have multi megabyte list there.
Check before set is rarely good idea because it doesn't really scale. Also cassandra does not provide you with joins. Depending on your needs you would have to have some sort of script that would check all this before doing the inserts. So you would need check and set etc.
Also an alternative approach for this would be to use spark.
The thing is cassandra won't mind if the hash is already there and you insert some new stuff over it. But this is not something you actually need because you want to keep the references. One possible approach is also to use lightweight transactions so you can use IF NOT EXISTS to perform the insertion only if the row does not already exist. Using IF NOT EXISTS incurs a performance hit associated with using Paxos internally.

In MySQL the ID is usually an AUTO_INCREMENT - there is no parallel for this in Cassandra. Its not clear to me if you are looking to have cassandra create the ID(s) as well or have some other system / db create them for you.
Another thing to note is that MySQL INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE is parallel to cassandra CQL INSERT, that is Cassandra CQL INSERT will update the record if one exists.
You may want to model information in a different manner in Cassandra

Related

Using temp tables in separate queries

I have a scenario where I have to insert into a table for thirteen different use cases. I am primarily getting the data from a "MainTable" and I have to only get records with "IsLocal" = 1 from this "MainTable".
I am just contemplating whether I should use the table directly with the "IsLocal" condition in all my thirteen different use cases or just use a temporary table and populate it with records from "MainTable" for the condition of "IsLocal" = 1. Which would be the better option to me?
This "MainTable" is expected to have around 1 million records with significant portion of them having "IsLocal=1".

It basically depends on your business logic, your infrastructure and the table definition itself.
If you are storing data in temporary table, it is stored in the tempdb. So, the question is can we afford to store such amount of data in the tempdb without affecting the general performance.
What's the amount of data? If you are just storing one millions BIGINT values we might be OK. But if we are storing one millions rows and many nvarchar(max) values?
How big are is our tempdb and is it on ram disk?
How often this temporary table is going to be populated? Once per day, or hundreds times every minute?
You need to think about the questions above and implement the solution. Then, after few days or weeks, you can find out that it was not good one and change it.
Without knowing your production environment details I can advice only that you can optimize your query using indexes. You are filtering by IsLocal = 1 - this seems to be a good match for filtering index (even most of the rows have this value, we are going to eliminate some of them on read).
Also, if you are getting only few of the columns from the table, you can try to create covering index of your query creating index with include columns. Having index with the columns we need and filtering predicate can optimize our query a lot. But, you have to test this again as creating the perfect index is not an easy task each time.

It's definitely better way to store frequently used data in temporary table and use. In your case store data from MainTable for the condition of IsLocal = 1. it will avoid to scan the whole table again and again for the particular set of data so its sure gain the performance which is noticeable. In addition I would like suggest you few things while following this approach:
1- Store data using INTO clause --Instead of INSERT INTO --It is much faster
SELECT a,b,c,......INTO #tmp_main_table FROM main_table
2- Index the columns in #tmp_main_table
Note: Storage and other issues are your own so be careful about that.

Oracle Materialized view VS Physical Table

Note: Oracle 11gR2 Standard version (so no partitioning)
So I have to build a process to build reports off a table containing about 27 million records. The dilemma I'm facing is the fact that I can't create my own indexes off this table as it's a 3rd party table that we can't alter. So, I started experimenting with the use of Materialized views where I can then create my own indexes, or a physical table that would basically just be a duplicate that I'd truncate and repopulate on demand.
The advantage with the MAT view is that it's basically pulling from the "Live" table, so I don't have to worry about discrepancies as long as I refresh it before use, the problem is the refresh seems to take a significant amount of time. I then decided to try the physical table approach, where I tried truncating and repopulating (Took around 10 min), then rebuild indexes (which takes another 10, give or take).... I also tried updating with only "new" record by performing a:
INSERT... SELECT where NOT Exists (Select 1 from Table where PK = PK)
Which almost takes 10 min also regardless of my index, parallelism, etc...
Has anyone had to deal with this amount of data (which will keep growing) and found an approach that performs well and works efficiently??
Seems a view won't do.... so I'm left with those 2 options because I can't tweak indexes on my primary table, so any tips suggestions would be greatly appreciated... The whole purpose of this process was to make things "faster" for reporting, but somehow where I'm gaining performance in some areas, I end up losing in others given the amount of data I need to move around. Are there other options aside from:
Truncate / Populate Table, Rebuild indexes
Populate secondary table from primary table where PK not exist
Materialized view (Refresh, Rebuild indexes)
View that pulls from Live table (No new indexes)
Thanks in advance for any suggestions.....
Does anyone know if doing a "Create Table As Select..." perform better than "Insert... Select" if I render my indexes and such unusable when doing my insert on the second option, or should it be fairly similar?

I think that there's a lot to be said for a very simple approach on this sort of task. Consider a truncate and direct path (append) insert on the duplicate table without disabling/rebuilding indexes, with NOLOGGING set on the table. The direct path insert has a index maintenance mechanism associated with it that is possibly more efficient than running multiple index rebuilds post-load, as it logs in temporary segments the data required to build the indexes and thus avoids subsequent multiple full table scans.
If you do want to experiment with index disable/rebuild then try rebuilding all the indexes at the same time without query parallelism, as only one physical full scan will be used -- the rest of the scans will be "parasitic" in that they'll read the table blocks from memory.
When you load the duplicate table consider ordering the rows in the select so that commonly used predicates on the reports are able to access fewer blocks. For example if you commonly query on date ranges, order by the date column. Remember that a little extra time spent in building this report table can be recovered in reduced report query execution time.
Consider compressing the table also, but only if you're loading with direct path insert unless you have the pricey Advanced Compression option. Index compression and bitmap indexes are also worth considering.
Also, consider not analyzing the reporting table. Report queries commonly use multiple predicates that are not well estimated using conventional statistics, and you have to rely on dynamic sampling for good cardinality estimates anyway.

"Create Table As Select" generate lesser undo. That's an advantage.
When data is "inserted" indexes also are maintained and performance is impacted negatively.

Best choice for a huge database table which contains only integers (have to use SUM() or AVG() )

I'm currently using a MySQL table for an online game under LAMP.
One of the table is huge (soon millions of rows) and contains only integers (IDs,timestamps,booleans,scores).
I did everything to never have to JOIN on this table. However, I'm worried about the scalability. I'm thinking about moving this single table to another faster database system.
I use intermediary tables to calculate the scores but in some cases, I have to use SUM() or AVERAGE() directly on some filtered rowsets of this table.
For you, what is the best database choice for this table?
My requirements/specs:
This table contains only integers (around 15 columns)
I need to filter by certain columns
I'd like to have UNIQUE KEYS
It could be nice to have "INSERT ... ON DUPLICATE UPDATE" but I suppose my scripts can manage it by themselves.
i have to use SUM() or AVERAGE()
thanks

Just make sure you have the correct indexes on so selecting should be quick

Millions of rows in a table isn't huge. You shouldn't expect any problems in selecting, filtering or upserting data if you index on relevant keys as #Tom-Squires suggests.
Aggregate queries (sum and avg) may pose a problem though. The reason is that they require a full table scan and thus multiple fetches of data from disk to memory. A couple of methods to increase their speed:
If your data changes infrequently then caching those query results in your code is probably a good solution.
If it changes frequently then the quickest way to improve their performance is probably to ensure that your database engine keeps the table in memory. A quick calculation of expected size: 15 columns x 8 bytes x millions =~ 100's of MB - not really an issue (unless you're on a shared host). If your RDBMS does not support tuning this for a specific table, then simply put it in a different database schema - shouldn't be a problem since you're not doing any joins on this table. Most engines will allow you to tune that.

how to manage millions/billions of small values in a "database"

I have an application that will generate millions of date/type/value entries. we don't need to do complex queries, only for example get the average value per day of type X between date A and B.
I'm sure a normal db like mysql isn't the best to handle these sort of things, is there a better system that like these sort of data.
EDIT: The goal is not to say that relational database cannot handle my problem but to know if another type of database like key/value database, nosql, document oriented, ... can be more adapted to what i want to do.

If you are dealing with a simple table as such:
CREATE TABLE myTable (
[DATE] datetime,
[TYPE] varchar(255),
[VALUE] varchar(255)
)
Creating an index probably on TYPE,DATE,VALUE - in that order - will give you good performance on the query you've described. Use explain plan or whatever equivalent on the database you're working with to review the performance metrics. And, setup a scheduled task to defragment that index regularly - frequency will depend on how often insert, delete and update occurs.
As far as an alternative persistence store (i.e. NoSQL) you don't gain anything. NoSQL shines when you want schema-less storage. In other words you don't know the entity definitions head of time. But from what you've described, you have a very clear picture of what you want to store, which lends itself well to a relational database.
Now possibilities for scaling over time include partitioning and each TYPE record into a separate table. The partitioning piece could be done by type and/or date. Really would depend on the nature of the queries you're dealing with, if you typically query for values within the same year for instance, and what your database offers in that regard.

MS SQL Server and Oracle offer concept of Partitioned Tables and Indexes.
In short: you could group your rows by some value, i.e. by year and month. Each group could be accessible as separate table with own index. So you can list, summarize and edit February 2011 sales without accessing all rows. Partitioned Tables complicate the database, but in case of extremely long tables it could lead to significantly better performance.

Based upon the costs you can choose either MySQL or SQL Server, in this case you have to be clear that what do you want to achieve with the database just for storage then any RDBMS can handle.

You could store the data as fixed length records in a file.
Do binary search on the file opened for random access to find your start and end records then sum the appropriate field for the given condition of all records between your start index and end index into the file.

SQL Server Performance with Key/Pair Table vs XML Field and XPath

I've seen a few questions on this topic already but I'm looking for some insight on the performance differences between these two techniques.
For example, lets say I am recording a log of events which will come into the system with a dictionary set of key/value pairs for the specific event. I will record an entry in an Events table with the base data but then I need a way to also link the additional key/value data. I will never know what kinds of Keys or Values will come in so any sort of predefined enum table seems out of the question.
This event data will be constantly streaming in so insert times is just as important as query times.
When I query for specific events I will be using some fields on the Event as well as data from the key/value data. For the XML way I would simply use a Attributes.exists('xpath') statement as part of the where clause to filter the records.
The normalized way would be to use a Table with basically Key and Value fields with a foreign link to the Event record. This seems clean and simple but I worry about the amount of data that is involved.

You've got three major options for a 'flexible' storage mechanism.
XML fields are flexible but put you in the realm of blob storage, which is slow to query. I've seen queries against small data sets of 30,000 rows take 5 minutes when it was digging stuff out of the blobs with Xpath queries. This is the slowest option by far but it is flexible.
Key/value pairs are a lot faster, particularly if you put a clustered index on the event key. This means that all attributes for a single event will be physically stored together in the database, which will minimise the I/O. The approach is less flexible than XML but substantially faster. The most efficient queries to report against it would involve pivoting the data (i.e. a table scan to make an intermediate flattened result); joining to get individual fields will be much slower.
The fastest approach is to have a flat table with a set of user defined fields (Field1 - Field50) and hold some metadata about the contents of the fields. This is the fastest to insert and fastest and easiest to query, but the contents of the table are opaque to anything that does not have access to the metadata.

The problem I think the key/value table approach is regarding the datatypes - if a value could be a datetime, or a string or a unicode string or an integer, then how do you define the column? This dilemma means the value column has to be a datatype which can contain all the different types of data in it which then begs the question of efficiency/ease of querying. Alternatively, you have multiple columns of specific datatypes, but I think this is a bit clunky.
For a true flexible schema, I can't think of a better option than XML. You can index XML columns.
This article off MSDN discusses XML storage in more detail.

I'd assume the normalized way would be faster for both INSERT and SELECT operations, if only because that's what any RDBMS would be optimized for. The "amount of data involved" part might be an issue too, but a more solvable one - how long do you need that data immediately on hand, can you archive it after a day, or a couple weeks, or 3 months, etc? SQL Server can handle an awful lot.
This event data will be constantly streaming in so insert times is just as important as query times.
Option 3: If you really have a lot of data constantly streaming - create a separate queue in shared memory, in-process sqlite, separate db table, or even it's own server, to store the incoming raw event & attributes, and have another process (scheduled task, windows service, etc) parse that queue into whatever preferred format tuned for speedy SELECTs. Optimal input, optimal output, ready to scale in either direction, everyone's happy.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight