Cassandra, implementing high-cardinality indexes - database

As it is known, Cassandra is great in low-cardinality indexes and not so good with high-cardinality ones. My column family contains a field storing URL value.
Naturally, searching for this specific value in a big dataset can be slow.
As a solution, I've come up with idea of taking first characters of url and storing them
in separate columns, e.g. test.com/abcd would be stored as (ab, test.com/abcd) columns.
So that when a search by specific URL value needs to be done, I can narrow it down by 26*26 times by searching the "ab" first and only then looking up exact url in the obtained resultset.
Does it look like a working solution to reduce URL cardinality in Cassandra?

If you need this to be really fast, you probably want to consider having a separate table with the value that you are searching for as the column key. Key prefix searches are usually faster than column searches in BigTable implementations.

A problem with that is that a sequential scan is going to have to follow after you use the low-cardinality index, in order to finally arrive at the one specific URL queried.
As Chris Shain mentioned, you can build a separate column family to build an inverted index:
Column Family 'people'
ssn | name | url
----- | ------ | ---
1234 | foo | http://example.com/1234
5678 | bar | http://hello.com/world
Column Family 'urls'
url | ssn
------------------------ | ------
http://example.com/1234 | 1234
http://hello.com/world | 5678
The downside is that you need to maintain the integrity of your manual index yourself.

Related

Another way to build database structure

I have to optimize my little-big database, because it's too slow, maybe we'll find another solution together.
First of all let's talk about data that are stored in the database. There are two objects: users and let's say messages
Users
There is something like that:
+----+---------+-------+-----+
| id | user_id | login | etc |
+----+---------+-------+-----+
| 1 | 100001 | A | ....|
| 2 | 100002 | B | ....|
| 3 | 100003 | C | ....|
|... | ...... | ... | ....|
+----+---------+-------+-----+
There is no problem inside this table. (Don't afraid of id and user_id. user_id is used by another application, so it has to be here.)
Messages
And the second table has some problem. Each user has for example messages like this:
+----+---------+------+----+
| id | user_id | from | to |
+----+---------+------+----+
| 1 | 1 | aab | bbc|
| 2 | 2 | vfd | gfg|
| 3 | 1 | aab | bbc|
| 4 | 1 | fge | gfg|
| 5 | 3 | aab | gdf|
|... | ...... | ... | ...|
+----+---------+------+----+
There is no need to edit messages, but there should be an opportunity to updated the list of messages for the user. For example, an external service sends all user's messages to the db and the list has to be updated.
And the most important thing is that there are about 30 Mio of users and average user has 500+ of messages. Another problem that I have to search through the field from and calculate number of matches. I designed a simple SQL query with join, but it takes too much time to get the data.
So...it's quite big amount of data. I decided not to use RDS (I used Postgresql) and decided to move to databases like Clickhouse and so on.
However I faced with a problem that for example Clickhouse doesn't support UPDATE statement.
To resolve this issues I decided to store messages as one row. So the table Messages should be like this:
Here I'd like to store messages in JSON format
{"from":"aaa", "to":bbe"}
{"from":"ret", "to":fdd"}
{"from":"gfd", "to":dgf"}
||
\/
+----+---------+----------+------+ And there I'd like to store the
| id | user_id | messages | hash | <= hash of the messages.
+----+---------+----------+------+
I think that full-text search inside the messages column will save some time resources and so on.
Do you have any ideas? :)
In ClickHouse, the most optimal way is to store data in "big flat table".
So, you store every message in a separate row.
15 billion rows is Ok for ClickHouse, even on single node.
Also, it's reasonable to have each user attributes directly in messages table (pre-joined), so you don't need to do JOINs. It is suitable if user attributes are not updated.
These attributes will have repeated values for each users' message - it's Ok because ClickHouse compresses data well, especially repeated values.
If users' attributes are updated, consider to store users table in separate database and use 'External dictionaries' feature to join it.
If message is updated, just don't update it. Write another row with modified message to a table instead and leave old message as is.
Its important to have right primary key for your table. You should use table from MergeTree family, which constantly reorders data by primary key and so maintains efficiency of range queries. Primary key is not required to be unique, for example you could define primary key as just (from) if you would frequently write "from = ...", and if these queries must be processed in short time.
And you could use user_id as primary key: if queries by user id are frequent and must be processed as fast as possible, but then queries with predicate on 'from' will scan whole table (mind that ClickHouse do full scan efficiently).
If you need to fast lookup by many different attributes, you could just duplicate table with different primary keys. It's typically that table will be compressed well enough and you could afford to have data in few copies with different order for different range queries.
First of all, when we have such a big dataset, from and to columns should be integers, if possible, as their comparison is faster.
Second, you should consider creating proper indexes. As each user has relatively few records (500 compared to 30M in total), it should give you a huge performance benefit.
If everything else fails, consider using partitions:
https://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
In your case they would be dynamic, and hinder first time inserts immensely, so I would consider them only as last, if very efficient, resort.

How to store different types of address data in db?

I have to create a database combined with 4 types of xls files, for example A, B, C and D. Every year new file is created, starting from 2004. A have 7 sheets with 800-1000 rows, B - D have one sheet with max 200 rows.
Everyone knows that people are lazy, so in excel files, address data are stored differently in each sheet. One of them, from 2008, have address data stored separately, but every other sheets have this data combined into one column.
Sooo, here is a question - how should I design a datatable? Something like this?
+---------+----------+----------+-------------+--------------------------------+
| Street | House Nr | City | Postal Code | Combined Address |
+---------+----------+----------+-------------+--------------------------------+
| Street1 | 20 | Somwhere | 00-000 | null |
| Street2 | 98 | Elswhere | 99-999 | null |
| null | null | null | null | Somwhere 00-000, street3 29 |
| null | null | null | null | st. Street2 65 12-345 Elswhere |
+---------+----------+----------+-------------+--------------------------------+
There will be a lot of nulls, so maybe best solution would be 2 different tables?
Most important thing is that users will search by using this data, and in the future add data into that database without excel files.
There are at least two different angles of view here: Normalization and efficiency, leading to different results.
Normalization
If this is the most important criterion you would make even three tables. Obviously Combined Address needs a place of it's own, but also Postal Code and City have to be stored into another table, because there is a dependency between them. Just one of the two, most probably Postal Code will stay. (Yes, there even is sth. about Street and Postal Code too, but I'm clearly not going to be pedantic.)
Efficiency
Normalization as an end in itself doesn't necessarily make the best result. If you permit yourself to be a bit sloppy on that and leave it the way it is in the model you posted, things might become easier in coding. You could use a trigger to make sure Combined Address is never null or use a (materialized) view that pretends it is and just search in Combined Address for the time being.
Imagine the effort if you use different tables and there is a need for referencing these addresses in other tables (Which table to use when? How to provide a unique id? Clearly a problem.).
So, decide on what's more important.
If I'm not mistaken we are taking about some 2000 rows or some 8000 rows if it is '7 sheets with 800-1000 rows each' actually. Even if the latter applies this isn't a number that makes data correction impracticable. If the number of different input pattern in the combined column is low, you might be able to do this (partly) automatically and just have some-one prove reading.
So you might want to think about a future redesign as well and choose what's more convenient in this case.

Can anyone suggest a method of versioning ATTRIBUTE (rather than OBJECT) data in DB

Taking MySQL as an example DB to perform this in (although I'm not restricted to Relational flavours at this stage) and Java style syntax for model / db interaction.
I'd like the ability to allow versioning of individual column values (and their corresponding types) as and when users edit objects. This is primarily in an attempt to drop the amount of storage required for frequent edits of complex objects.
A simple example might be
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
So we could insert an object into the database that looks like...
Food banana = new Food("Banana",0.3);
giving us
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
if we then want to update the weight we might use
banana.weight = 0.4;
banana.save();
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.4 |
+----+--------+--------+
Obviously though this is going to overwrite the data.
I could add a revision column to this table, which could be incremented as items are saved, and set a composite key that combines id/version, but this would still mean storing ALL attributes of this object for every single revision
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
- revision (INT)
+----+--------+--------+----------+
| id | name | weight | revision |
+----+--------+--------+----------+
| 1 | Banana | 0.3 | 1 |
| 1 | Banana | 0.4 | 2 |
+----+--------+--------+----------+
But in this instance we're going to be storing every single piece of data about every single item. This isn't massively efficient if users are making minor revisions to larger objects where Text fields or even BLOB data may be part of the object.
What I'd really like, would be the ability to selectively store data discretely, so the weight could possible be saved in a separate DB in its own right, that would be able to reference the table, row and column that it relates to.
This could then be smashed together with a VIEW of the table, that could sort of impose any later revisions of individual column data into the mix to create the latest version, but without the need to store ALL data for each small revision.
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
+-----+------------+-------------+-----------+-----------+----------+
| ID | TABLE_NAME | COLUMN_NAME | OBJECT_ID | BLOB_DATA | REVISION |
+-----+------------+-------------+-----------+-----------+----------+
| 456 | Food | weight | 1 | 0.4 | 2 |
+-----+------------+-------------+-----------+-----------+----------+
Not sure how successful storing any data as blob to then CAST back to original DTYPE might be, but thought since I was inventing functionality here, why not go nuts.
This method of storage would also be fairly dangerous, since table and column names are entirely subject to change, but hopefully this at least outlines the sort of behaviour I'm thinking of.
A table in 6NF has one CK (candidate key) (in SQL a PK) and at most one other column. Essentially 6NF allows each pre-6NF table's column's update time/version and value recorded in an anomaly-free way. You decompose a table by dropping a non-prime column while adding a table with it plus an old CK's columns. For temporal/versioning applications you further add a time/version column and the new CK is the old one plus it.
Adding a column of time/whatever interval (in SQL start time and end time columns) instead of time to a CK allows a kind of data compression by recording longest uninterupted stretches of time or other dimension through which a column had the same value. One queries by an original CK plus the time whose value you want. You dont need this for your purposes but the initial process of normalizing to 6NF and the addition of a time/whatever column should be explained in temporal tutorials.
Read about temporal databases (which deal both with "valid" data that is times and time intervals but also "transaction" times/versions of database updates) and 6NF and its role in them. (Snodgrass/TSQL2 is bad, Date/Darwen/Lorentzos is good and SQL is problematic.)
Your final suggested table is an example of EAV. This is usually an anti-pattern. It encodes a database in to one or more tables that are effectively metadata. But since the DBMS doesn't know that you lose much of its functionality. EAV is not called for if DDL is sufficient to manage tables with columns that you need. Just declare appropriate tables in each database. Which is really one database, since you expect transactions affecting both. From that link:
You are using a DBMS anti-pattern EAV. You are (trying to) build part of a DBMS into your program + database. The DBMS already exists to manage data and metadata. Use it.
Do not have a class/table of metatdata. Just have attributes of movies be fields/columns of Movies.
The notion that one needs to use EAV "so every entity type can be
extended with custom fields" is mistaken. Just implement via calls
that update metadata tables sometimes instead of just updating regular
tables: DDL instead of DML.

Relevance and Solr Grouping

Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.
It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort parameter, and the records within each group are sorted according to the group.sort parameter. If you sort the groups by sort=score desc (this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.

Which normal form does this table violate?

Consider this table:
+-------+-------+-------+-------+
| name |hobby1 |hobby2 |hobby3 |
+-------+-------+-------+-------+
| kris | ball | swim | dance |
| james | eat | sing | sleep |
| amy | swim | eat | watch |
+-------+-------+-------+-------+
There is no priority on the types of hobbies, thus all the hobbies belong to the same domain. That is, the hobbies in the table can be moved on any hobby# column. It doesn't matter on which column, a particular hobby can be in any column.
Which database normalization rule does this table violate?
Edit
Q. Is "the list of hobbies [...] in an arbitrary order"?
A. Yes.
Q. Does the table have a primary key?
A. Yes, suppose the key is an AUTO_INCREMENT column type named user_id.
The question is if the columns hobby# are repeating groups or not.
Sidenote: This is not a homework. It's kind of a debate, which started in the comments of the question SQL - match records from one table to another table based on several columns. I believe this question is a clear example of the 1NF violation.
However, the other guy believes that I "have fallen fowl of one of the fallacies of 1NF." That argument is based on the section "The ambiguity of Repeating Groups" of the article Facts and Fallacies about First Normal Form.
I am not writing this to humiliate him, me, or whomever. I am writing this, because I might be wrong, and there is something I am clearly missing and maybe this guy is not explaining it good enough to me.
You say that the hobbies belong to the same domain and that they can move around in the columns. If by this you mean that for any specific name the list of hobbies is in an arbitrary order and kriss could just as easily have dance, ball, swim as ball, swim, dance, then I would say you have a repeating group and the table violates 1NF.
If, on the other hand, there is some fundamental semantic difference between a particular person's first and second hobbies, then there may be an argument for saying that the hobbies are not repeating groups and the table may be in 3NF (assuming that hobby columns are FK to a hobby table). I would suggest that this argument, if it exists, is weak.
One other factor to consider is why there are precisely 3 hobbies and whether more or fewer hobbies are a potential concern. This factor is important not so much for normalization as for flexibility of design. This is one reason I would split the hobbies into rows, even if they are semantically different from one-another.
Your three-hobby table design probably violates what I usually call the spirit of the original 1NF (probably for the reasons given by dportas and others).
It turns out however, that it is extremely difficult to find [a set of] formal and precise "measurable" criteria that accurately express that original "spirit". That's what your other guy was trying to explain talking about "the ambiguity of repeating groups".
Stress "formal", "precise" and "measurable" here. Definitions for all other normal forms exist that satisfy "formal", "precise" and "measurable" (i.e. objectively observable). For 1NF it's just hard (/impossible ???) to do. If you want to see why, try this :
You stated that the question was "whether those three hobby columns constitute a repeating group". Answer this question with "yes", and then provide a rigorous formal underpinning for your answer.
You cannot just say "the column names are the same, except for the numbered suffix". Making a violation of such a rule objectively observable/measurable would require to enumerate all the possible ways of suffixing.
You cannot just say "swim, tennis" could equally well be "tennis, swim", because getting to know that for sure requires inspecting the external predicate of the table. If that is just "person <name> has hobby <hobby1> and also has <hobby2>" , then indeed both are equally valid (aside : and due to the closed world assumption it would in fact require all possible permutations of the hobbies to be present in the table !!!). However, if that external predicate is "person <name> spends the most time on <hobby1> and the least on <hobby2>", then "swim, tennis" could NOT equally well be "tennis,swim". But how do you make such interpretations of the external predicate of the table objective (for ALL POSSIBLE PREDICATES) ???
etc. etc.
This clearly "looks" like a design error.
It's not not a design error when this data is simply stored and retrieved. You need only 3 of the hobbies and you don't intend to use this data in any other way than retrieve.
Let's consider this relationship:
Hobby1 is the main hobby at some point in a person's life (before 18 years of age for example)
Hobby2 is the hobby at another point (19-30)
Hobby3 is her hobby at a another one.
Then this table seems definitely well designed and while the 1NF convention is respected the naming arguably "sucks".
In the case of an indiscriminate storage of hobbies this is clearly wrong in most if not all cases I can think of right now. Your table has duplicate rows which goes against the 1NF principles.
Let's not consider the reduced efficiency of SQL requests to access data from this table when you need to sort the results for paging or any other practical reason.
Let's take into consideration the effort required to work with your data when your database will be used by another developer or team:
The data here is "scattered". You have to look in multiple columns to aggregate related data.
You are limited to only 3 of the hobbies.
You can't use simple rules to establish unicity (same hobby only once per user).
You basically create frustration, anger and hatred and the Force is disturbed.
Well,
The point is that, as long as all hobby1, hobby2 and hobby3 values are not null, AND names are unique, this table could be considered more or less as abbiding by 1NF rules (see here for example ...)
But does everybody has 3 hobbies? Of course not! Do not forget that databases are basically supposed to hold data as a representation of reality! So, away of all theories, one cannot say that everybody has 3 hobbies, except if ... our table is done to hold data related to people that have three hobbies without any preference between them!
This said, and supposing that we are in the general case, the correct model could be
+------------+-------+
| id_person |name |
+------------+-------+
for the persons (do not forget to unique key. I do not think 'name' is a good one
+------------+-------+
| id_hobby |name |
+------------+-------+
for the hobbies. id_hobby key is theoretically not mandatory, as hobby name can be the key ...
+------------+-----------+
| id_person |id_hobby |
+------------+-----------+
for the link between persons and hobbies, as the physical representation of the many-to-many link that exists between persons and their hobbies.
My proposal is basic, and satisfies theory. It can be improved in many ways ...
Without knowing what keys exist and what dependencies the table is supposed to satisfy it's impossible to determine for sure what Normal Form it satisfies. All we can do is make guesses based on your attribute names.
Does the table have a key? Suppose for the sake of example that Name is a candidate key. If there is exactly one value permitted for each of the other attributes for each tuple (which means that no attribute can be null) then the table is in at least First Normal Form.
If any of the columns in the table accept nulls then then the table violates first normal form. Assuming no nulls, #dportas has already provided the correct answer.
The table does not violate first normal form.
First normal form does not have any prohibition against multiple columns of the same type. As long as they have distinct column names, it is fine.
The prohibition against "Repeating Groups" concerns nested records - a structure which is common in hierarchical databases, but typically not possible in relational databases.
The table using repeating groups would look something like this:
+-------+--------+
| name |hobbies |
+-------+--------+
| kris |+-----+ |
| ||ball | |
| |+-----+ |
| ||swim | |
| |+-----+ |
| ||dance| |
| |+-----+ |
+-------+--------+
| james |+-----+ |
| ||eat | |
| |+-----+ |
| ||sing | |
| |+-----+ |
| ||sleep| |
| |+-----+ |
+-------+--------+
| amy |+-----+ |
| ||swim | |
| |+-----+ |
| ||eat | |
| |+-----+ |
| ||watch| |
| |+-----+ |
+-------+--------+
In a table conforming to 1NF all values can be located though table name, primary key, and column name. But this is not possible with repeated groups, which require further navigation.

Resources