Cassandra's ORDER BY does not work as expected - database

So, I'm storing some statistics in cassandra.
I want to get the top 10 best entrys based on a specific column. The column in this case is kills.
As there is no ORDER BY command like in mysql, I have to create a PARTITION KEY.
I've created the following table:
CREATE TABLE IF NOT EXISTS stats ( uuid uuid, kills int, deaths int, playedGames int, wins int, srt int, PRIMARY KEY (srt, kills) ) WITH CLUSTERING ORDER BY (kills DESC);
The Problem I have is the following, as you see above, I'm using the column srt for ordering because when I'm going to use the column uuid for ordering, the result from my select query is totally random and not sorted as expected.
So I tried to add a column with always the same value for my PARTITION KEY. Sorting works now, but not really good. When I now try to SELECT * FROM stats;, the result is the following:
srt | kills | deaths | playedgames | uuid | wins
-----+-------+--------+-------------+--------------------------------------+------
0 | 49 | 35 | 48 | 6f284e6f-bd9a-491f-9f52-690ea2375fef | 2
0 | 48 | 21 | 30 | 4842ad78-50e4-470c-8ee9-71c5a731c935 | 4
0 | 47 | 48 | 14 | 91f41144-ef5a-4071-8c79-228a7e192f34 | 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0 | 2 | 32 | 20 | 387448a7-a08e-46d4-81a2-33d8a893fdb6 | 31
0 | 1 | 16 | 17 | fe4efbcd-34c3-419a-a52e-f9ae8866f2bf | 12
0 | 0 | 31 | 25 | 82b13d11-7eeb-411c-a521-c2c2f9b8a764 | 10
The problem about the result is, that "per kill" amout/value, there is only one row - but there should be definitly more.
So, any idea about using sorting in Cassandra without getting data stripped out?
I also heard about Datastax Enterprise (DSE) which supports solr in querys but DSE is only free for non-productive (and also only for 6 months) and the paid version is, at least what I heared of, pretty expensive (around 4000$ per node). So, is there any alternative like a Datastax Enterprise Community Edtion? Does not make sense but I'm just asking. I haven't found anything from googling so, can I also use solr with the "normal" cassandra?
Thank you for your help!
PS: Please don't mark this as a duplicate of order by caluse not working in Cassandra query because it didn't helped me. I already googled like 1 and a half hour for a solution.
EDIT:
Because of the fact that my primary key is PRIMARY KEY(srt, kills), the combination of (srt, kills) must be unique. Which basicly means, that rows with the same amout of kills are getting overwritten by each other. I would use PRIMARY KEY(uuid, kills) which would solve the problem with overwriting rows but when I do SELECT * FROM stats LIMIT 10 then, the results are totally random and not sorted by kills.

If you want to use column for sorting get it out from partition key. Rows will be sorted by this column within every partition - Cassandra splits data between nodes using partition key, and ordering it in each partition using clustering key:
PRIMARY KEY ((srt), kills)
EDIT:
You need to understand concepts a little bit more, i suggest you to take some free course on DSE site, it can help you with further development.
Anyway, about your question:
Primary key is a set of columns that make each row unique.
There are 2 types of columns in this primary key - partition key columns and clustering columns.
You can't use partition key for sorting or range queries - it is against the model of Cassandra - such query will be splitted to several nodes, or even all nodes and sstables. If you want to use both of the listed columns for sorting, you can use other column for partitioning (random number from 1 to 100 for example), and then you need to execute your query for each "batch", or simply use another column that has high enough number of unique values (at least 100),the data is evenly distributed between such values, and data is accessed using all these values, otherwise you will end up with hot nodes/partitions.
Primary key ((another_column), kills, srt)
What you have to understand, you can order your data only within partitions, but not between partitions.
that "per kill" amout/value - can you elaborate? There are only one row for each key in Cassandra, if you insert several rows with same key they will be overwritten with last insert values (read about upserts).

Related

Index in SQL Server 2008 database [duplicate]

I've created composite indexes (indices for you mathematical folk) on tables before with an assumption of how they worked. I was just curious if my assumption is correct or not.
I assume that when you list the order of columns for the index, you are also specifying how the indexes will be grouped. For instance, if you have columns a, b, and c, and you specify the index in that same order a ASC, b ASC, and c ASC then the resultant index will essentially be many indexes for each "group" in a.
Is this correct? If not, what will the resultant index actually look like?
Composite indexes work just like regular indexes, except they have multi-values keys.
If you define an index on the fields (a,b,c) , the records are sorted first on a, then b, then c.
Example:
| A | B | C |
-------------
| 1 | 2 | 3 |
| 1 | 4 | 2 |
| 1 | 4 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | 4 |
| 2 | 4 | 5 |
Composite index is like a plain alphabet index in a dictionary, but covering two or more letters, like this:
AA - page 1
AB - page 12
etc.
Table rows are ordered first by the first column in the index, then by the second one etc.
It's usable when you search by both columns OR by first column. If your index is like this:
AA - page 1
AB - page 12
…
AZ - page 245
BA - page 246
…
you can use it for searching on 2 letters ( = 2 columns in a table), or like a plain index on one letter:
A - page 1
B - page 246
…
Note that in case of a dictionary, the pages themself are alphabetically ordered. That's an example of a CLUSTERED index.
In a plain, non-CLUSTERED index, the references to pages are ordered, like in a history book:
Gaul, Alesia: pages 12, 56, 78
Gaul, Augustodonum Aeduorum: page 145
…
Gaul, Vellaunodunum: page 24
Egypt, Alexandria: pages 56, 194, 213, 234, 267
Composite indexes may also be used when you ORDER BY two or more columns. In this case a DESC clause may come handy.
See this article in my blog about using DESC clause in a composite index:
Descending indexes
The most common implementation of indices uses B-trees to allow somewhat rapid lookups, and also reasonably rapid range scans. It's too much to explain here, but here's the Wikipedia article on B-trees. And you are right, the first column you declare in the create index will be the high order column in the resulting B-tree.
A search on the high order column amounts to a range scan, and a B-tree index can be very useful for such a search. The easiest way to see this is by analogy with the old card catalogs you have in libraries that have not yet converted to on line catalogs.
If you are looking for all the cards for Authors whose last name is "Clemens", you just go to the author catalog, and very quickly find a drawer that says "CLE- CLI" on the front. That's the right drawer. Now you do a kind of informal binary search in that drawer to quickly find all the cards that say "Clemens, Roger", or "Clemens, Samuel" on them.
But suppose you want to find all the cards for the authors whose first name is "Samuel". Now you're up the creek, because those cards are not gathered together in one place in the Author catalog. A similar phenomenon happens with composite indices in a database.
Different DBMSes differ in how clever their optimizer is at detecting index range scans, and accurately estimating their cost. And not all indices are B-trees. You'll have to read the docs for your specific DBMS to get the real info.
No. Resultant index will be single index but with compound key.
KeyX = A,B,C,D; KeyY = 1,2,3,4;
Index KeyX, KeyY will be actually: A1,A2,A3,B1,B3,C3,C4,D2
So that in case you need to find something by KeyX and KeyY - that will be fast and will use single index. Something like SELECT ... WHERE KeyX = "B" AND KeyY = 3.
But it's important to understand: WHERE KeyX = ? requests will use that index, while WHERE KeyY = ? will NOT use such index at all.

SQL Server - Query String Greater Or Equal To

I am attempting to optimise a query in my application that is causing problems when scaling my application.
The table contains two columns: FROM and TO which each contain values. Here is an example:
Row | From | To
1 | AA | Z
2 | B | C
3 | JA | JZ
4 | JM | JZ
The query is passed a name (JOHN) and should return a list of ranges from the table that could contain the name.
select * from Ranges where From <= 'JOHN' and To >= 'JOHN'
Using the table above this would result in rows 1 and 3 being returned.
The problem I am having is one of query consistency.
All indexes are in place but if I search for JOHN the query returns in 20 milliseconds, whereas MARK returns in 250 milliseconds.
Looking at query analyzer shows me that JOHN is actually searching for more rows than MARK but I'm struggling to understand how or why MARK takes so long.
If the time difference was 20 - 40 milliseconds, I could live with that but 250 is so large a difference that the overall performance of my application is terrible.
Does anybody have any idea how I could narrow down why I get such variance in my queries OR a better way of storing and searching for string ranges (which could contains letters and numbers).
Many thanks in advance.
EDIT - One thing I forgot to mention was that the original table contains approximately 15 million rows (its actually postcodes).

Why does Postgres not use better index for my query?

I have a table where I keep record of who is following whom on a Twitter-like application:
\d follow
Table "public.follow" .
Column | Type | Modifiers
---------+--------------------------+-----------------------------------------------------
xid | text |
followee | integer |
follower | integer |
id | integer | not null default nextval('follow_id_seq'::regclass)
createdAt | timestamp with time zone |
updatedAt | timestamp with time zone |
source | text |
Indexes:
"follow_pkey" PRIMARY KEY, btree (id)
"follow_uniq_users" UNIQUE CONSTRAINT, btree (follower, followee)
"follow_createdat_idx" btree ("createdAt")
"follow_followee_idx" btree (followee)
"follow_follower_idx" btree (follower)
Number of entries in table is more than a million and when I run explain analyze on the query I get this:
explain analyze SELECT "follow"."follower"
FROM "public"."follow" AS "follow"
WHERE "follow"."followee" = 6
ORDER BY "follow"."createdAt" DESC
LIMIT 15 OFFSET 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..353.69 rows=15 width=12) (actual time=5.456..21.497
rows=15 loops=1)
-> Index Scan Backward using follow_createdat_idx on follow (cost=0.43..61585.45 rows=2615 width=12) (actual time=5.455..21.488 rows=15 loops=1)
Filter: (followee = 6)
Rows Removed by Filter: 62368
Planning time: 0.068 ms
Execution time: 21.516 ms
Why it is doing backward index scan on follow_createdat_idx where it could have been more faster execution if it had used follow_followee_idx.
This query is taking around 33 ms when running first time and then subsequent calls are taking around 22 ms which I feel are on higher side.
I am using Postgres 9.5 provided by Amazon RDS. Any idea what wrong could be happening here?
The multicolumn index on (follower, "createdAt") that user1937198 suggested is perfect for the query - as you found in your test already.
Since "createdAt" can be NULL (not defined NOT NULL), you may want to add NULLS LAST to query and index:
...
ORDER BY "follow"."createdAt" DESC NULLS LAST
And:
"follow_follower_createdat_idx" btree (follower, "createdAt" DESC NULLS LAST)
More:
PostgreSQL sort by datetime asc, null first?
There are minor other performance implications:
The multicolumn index on (follower, "createdAt") is 8 bytes per row bigger than the simple index on (follower) - 44 bytes vs 36. More (btree indexes have mostly the same page layout as tables):
Making sense of Postgres row sizes
Columns involved in an index in any way cannot be changed with a HOT update. Adding more columns to an index might block this optimization - which seems particularly unlikely given the column name. And since you have another index on just ("createdAt") that's not an issue anyway. More:
PostgreSQL Initial Database Size
There is no downside in having another index on just ("createdAt") (other than the maintenance cost for each (for write performance, not for read performance). Both indexes support different queries. You may or may not need the index on just ("createdAt") additionally. Detailed explanation:
Is a composite index also good for queries on the first field?

Another way to build database structure

I have to optimize my little-big database, because it's too slow, maybe we'll find another solution together.
First of all let's talk about data that are stored in the database. There are two objects: users and let's say messages
Users
There is something like that:
+----+---------+-------+-----+
| id | user_id | login | etc |
+----+---------+-------+-----+
| 1 | 100001 | A | ....|
| 2 | 100002 | B | ....|
| 3 | 100003 | C | ....|
|... | ...... | ... | ....|
+----+---------+-------+-----+
There is no problem inside this table. (Don't afraid of id and user_id. user_id is used by another application, so it has to be here.)
Messages
And the second table has some problem. Each user has for example messages like this:
+----+---------+------+----+
| id | user_id | from | to |
+----+---------+------+----+
| 1 | 1 | aab | bbc|
| 2 | 2 | vfd | gfg|
| 3 | 1 | aab | bbc|
| 4 | 1 | fge | gfg|
| 5 | 3 | aab | gdf|
|... | ...... | ... | ...|
+----+---------+------+----+
There is no need to edit messages, but there should be an opportunity to updated the list of messages for the user. For example, an external service sends all user's messages to the db and the list has to be updated.
And the most important thing is that there are about 30 Mio of users and average user has 500+ of messages. Another problem that I have to search through the field from and calculate number of matches. I designed a simple SQL query with join, but it takes too much time to get the data.
So...it's quite big amount of data. I decided not to use RDS (I used Postgresql) and decided to move to databases like Clickhouse and so on.
However I faced with a problem that for example Clickhouse doesn't support UPDATE statement.
To resolve this issues I decided to store messages as one row. So the table Messages should be like this:
Here I'd like to store messages in JSON format
{"from":"aaa", "to":bbe"}
{"from":"ret", "to":fdd"}
{"from":"gfd", "to":dgf"}
||
\/
+----+---------+----------+------+ And there I'd like to store the
| id | user_id | messages | hash | <= hash of the messages.
+----+---------+----------+------+
I think that full-text search inside the messages column will save some time resources and so on.
Do you have any ideas? :)
In ClickHouse, the most optimal way is to store data in "big flat table".
So, you store every message in a separate row.
15 billion rows is Ok for ClickHouse, even on single node.
Also, it's reasonable to have each user attributes directly in messages table (pre-joined), so you don't need to do JOINs. It is suitable if user attributes are not updated.
These attributes will have repeated values for each users' message - it's Ok because ClickHouse compresses data well, especially repeated values.
If users' attributes are updated, consider to store users table in separate database and use 'External dictionaries' feature to join it.
If message is updated, just don't update it. Write another row with modified message to a table instead and leave old message as is.
Its important to have right primary key for your table. You should use table from MergeTree family, which constantly reorders data by primary key and so maintains efficiency of range queries. Primary key is not required to be unique, for example you could define primary key as just (from) if you would frequently write "from = ...", and if these queries must be processed in short time.
And you could use user_id as primary key: if queries by user id are frequent and must be processed as fast as possible, but then queries with predicate on 'from' will scan whole table (mind that ClickHouse do full scan efficiently).
If you need to fast lookup by many different attributes, you could just duplicate table with different primary keys. It's typically that table will be compressed well enough and you could afford to have data in few copies with different order for different range queries.
First of all, when we have such a big dataset, from and to columns should be integers, if possible, as their comparison is faster.
Second, you should consider creating proper indexes. As each user has relatively few records (500 compared to 30M in total), it should give you a huge performance benefit.
If everything else fails, consider using partitions:
https://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
In your case they would be dynamic, and hinder first time inserts immensely, so I would consider them only as last, if very efficient, resort.

PostgreSQL multidimensional array search

I am a newbie to Postgresql and was trying with it.
I have created a simple table:
CREATE table items_tags (
ut_id SERIAL Primary KEY,
item_id integer,
item_tags_weights text[]
);
where:
item_id - Item Id with these tags are associated
item_tags_weights - Tags associated with Itm including weight
Example entry:
--------------------
ut_id | item_id | item_tags_weights
---------+---------+-------------------------------------------------------------------------------------------------------------------------------
3 | 2 | {{D,1},{B,9},{W,3},{R,18},{F,9},{L,15},{G,12},{T,17},{0,3},{I,7},{E,14},{S,2},{O,5},{M,4},{V,3},{H,2},{X,14},{Q,9},{U,6},{P,16},{N,11},{J,1},{A,12},{Y,15},{C,15},{K,4},{Z,17}}
1000003 | 3 | {{Q,4},{T,19},{P,15},{M,14},{O,20},{S,3},{0,6},{Z,6},{F,4},{U,13},{E,18},{B,14},{V,14},{X,10},{K,18},{N,17},{R,14},{J,12},{L,15},{Y,3},{D,20},{I,18},{H,20},{W,15},{G,7},{A,11},{C,14}}
4 | 4 | {{Q,2},{W,7},{A,6},{T,19},{P,8},{E,10},{Y,19},{N,11},{Z,13},{U,19},{J,3},{O,1},{C,2},{L,7},{V,2},{H,12},{G,19},{K,15},{D,7},{B,4},{M,9},{X,6},{R,14},{0,9},{I,10},{F,12},{S,11}}
5 | 5 | {{M,9},{B,3},{I,6},{L,12},{J,2},{Y,7},{K,17},{W,6},{R,7},{V,1},{0,12},{N,13},{Q,2},{G,14},{C,2},{S,6},{O,19},{P,19},{F,4},{U,11},{Z,17},{T,3},{E,10},{D,2},{X,18},{H,2},{A,2}}
(4 rows)
where:
{D,1} - D = tag, 1 = tag weight
Well, I just wanted to list the items_id where tags = 'U' according tag weight.
On way is to select ALL the tags from database and do the processing in high-level language with sort and use the result set.
For this, I can do the following:
1) SELECT * FROM user_tags WHERE 'X' = ANY (interest_tags_weights)
2) Extract and sort the information and display.
But considering that multiple items can be associated with a single 'TAG', and assuming
10 million entry, this method will be surely sluggish.
Any idea to list as needed with CREATE function or so?
Any pointers will be helpfull.
Many thanks.
Have you considered normalization, i.e. moving the array field into another table? Apart from being easy to query and extend, it's likely to have better performance on larger databases.

Resources