Database Design - Need help scaling query - database

I'm trying to find the best data-structure/data store solution (highest performance) for the following request:
I have a list of attributes that I need to store for all individual in the US, for example:
+------------+-------+-------------+
| Attribute | Value | SSN |
+------------+-------+-------------+
| hair color | black | 123-45-6789 |
| eye color | brown | 123-45-6789 |
| height | 175 | 123-45-6789 |
| sex | M | 123-45-6789 |
| shoe size | 42 | 123-45-6789 |
As you can guess, with the general population, there are nothing unique and identifiable from those attributes.
However, let's assume that if we were to fetch from a combination of 3 or 4 attributes, then I would be able to uniquely identify a person (find their SSN).
Now here's the difficulties, the set of combinations that can uniquely identify a person will evolve over time and be adjusted.
What would be my best bet for storing and querying the data with the scenario mentioned above, that will remain highly performant (<100ms) at scale?
Current attempt with combining two attributes:
SELECT * FROM (SELECT * FROM people WHERE hair='black') p1
JOIN (SELECT * FROM people WHERE height=175) p2
ON p1.SSN = p2.SSN
But with a database with millions of rows, as you can guess.. NOT performant.
Thank you!

if the data store is not a constraint, I would use a DocumentDB, something like MongoDB, CosmosDB or even ElasticSearch.
With Mongo, for example, you could leverage its schemaless nature and have a collection of People with one property per "attribute":
{
"SSN": "123-45-6789",
"eyeColor": "brown",
"hairColor" "blond",
"sex": "M"
}
documents in this collection might have different properties, but it's not an issue. All you have to do now is to put an index on each one and run your queries.

Related

What is the name of the approach where you store the history in the same table?

I'm developing an application that uses a mysql database and we wanted to do an approach for history purposes, that we store the current state and the history in the same table for performance reasons (on updates the application doesn't have the id for an entity just a key pair, so it is easier just to insert a new row).
The table looks like this:
+------+-------+-----------+------------------------------+
| id |user_id| type |content |
+------+-------+-----------+------------------------------+
| 1 |'1-2-3'| position | *creation |
| 2 |'1-2-3'| position | *something_changed |
| 3 |'1-2-3'| device | *creation |
| 4 |'1-2-4'| position | *creation |
| 5 |'1-2-4'| device | *creation |
| 6 |'1-2-4'| device | *something_changed |
+------+-------+-----------+------------------------------+
Every entity is described with the user_id and type "key" pair, when something is changed in the entity a new row is inserted. The current state of an entity is selected by the highest id row from the group, which is grouped by the user_id and type. Performance wise the updates should be super fast and the selects can be slower, because those are not used often.
I would like to look up best practices and other people experiences with this method, but I don't know how to search for them. Can you help me? I'm interested in your experiences or opinions on this topic as well.
I know about Kafka and other streaming platforms, but that was sadly not an option for this.

Another way to build database structure

I have to optimize my little-big database, because it's too slow, maybe we'll find another solution together.
First of all let's talk about data that are stored in the database. There are two objects: users and let's say messages
Users
There is something like that:
+----+---------+-------+-----+
| id | user_id | login | etc |
+----+---------+-------+-----+
| 1 | 100001 | A | ....|
| 2 | 100002 | B | ....|
| 3 | 100003 | C | ....|
|... | ...... | ... | ....|
+----+---------+-------+-----+
There is no problem inside this table. (Don't afraid of id and user_id. user_id is used by another application, so it has to be here.)
Messages
And the second table has some problem. Each user has for example messages like this:
+----+---------+------+----+
| id | user_id | from | to |
+----+---------+------+----+
| 1 | 1 | aab | bbc|
| 2 | 2 | vfd | gfg|
| 3 | 1 | aab | bbc|
| 4 | 1 | fge | gfg|
| 5 | 3 | aab | gdf|
|... | ...... | ... | ...|
+----+---------+------+----+
There is no need to edit messages, but there should be an opportunity to updated the list of messages for the user. For example, an external service sends all user's messages to the db and the list has to be updated.
And the most important thing is that there are about 30 Mio of users and average user has 500+ of messages. Another problem that I have to search through the field from and calculate number of matches. I designed a simple SQL query with join, but it takes too much time to get the data.
So...it's quite big amount of data. I decided not to use RDS (I used Postgresql) and decided to move to databases like Clickhouse and so on.
However I faced with a problem that for example Clickhouse doesn't support UPDATE statement.
To resolve this issues I decided to store messages as one row. So the table Messages should be like this:
Here I'd like to store messages in JSON format
{"from":"aaa", "to":bbe"}
{"from":"ret", "to":fdd"}
{"from":"gfd", "to":dgf"}
||
\/
+----+---------+----------+------+ And there I'd like to store the
| id | user_id | messages | hash | <= hash of the messages.
+----+---------+----------+------+
I think that full-text search inside the messages column will save some time resources and so on.
Do you have any ideas? :)
In ClickHouse, the most optimal way is to store data in "big flat table".
So, you store every message in a separate row.
15 billion rows is Ok for ClickHouse, even on single node.
Also, it's reasonable to have each user attributes directly in messages table (pre-joined), so you don't need to do JOINs. It is suitable if user attributes are not updated.
These attributes will have repeated values for each users' message - it's Ok because ClickHouse compresses data well, especially repeated values.
If users' attributes are updated, consider to store users table in separate database and use 'External dictionaries' feature to join it.
If message is updated, just don't update it. Write another row with modified message to a table instead and leave old message as is.
Its important to have right primary key for your table. You should use table from MergeTree family, which constantly reorders data by primary key and so maintains efficiency of range queries. Primary key is not required to be unique, for example you could define primary key as just (from) if you would frequently write "from = ...", and if these queries must be processed in short time.
And you could use user_id as primary key: if queries by user id are frequent and must be processed as fast as possible, but then queries with predicate on 'from' will scan whole table (mind that ClickHouse do full scan efficiently).
If you need to fast lookup by many different attributes, you could just duplicate table with different primary keys. It's typically that table will be compressed well enough and you could afford to have data in few copies with different order for different range queries.
First of all, when we have such a big dataset, from and to columns should be integers, if possible, as their comparison is faster.
Second, you should consider creating proper indexes. As each user has relatively few records (500 compared to 30M in total), it should give you a huge performance benefit.
If everything else fails, consider using partitions:
https://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
In your case they would be dynamic, and hinder first time inserts immensely, so I would consider them only as last, if very efficient, resort.

Database Design - how to store quantities that are measured in different ways

I would like to know if the database design i have in mind for an online food store is good according to the usually followed standards and conventions.
Basically the confusion i have is how to store items whose quantity is measured in different ways.
For example, there are items that are measured in terms of kilograms and then there are items measured in terms of number of packets.
For example rice is measured in kilograms and something like say, Noodles would be measured in terms of number of packets.
so the tables are planned to have below fields:
Items table with the fields: category,name,company,variant and a boolean variable named measured_in_packets?..
for items where measured_in_packets is set to true, an entry in another table will hold the available packet sizes:
packet_sizes table with item_id and packet_size..
so if one product is available in multiple packet sizes (250 gm, 500 gm etc), a row would be made for each available size against the item id...
does this sound like a good database design?
In a nutshell, you have items which have a quantity value, but that quantity value can be measured in different kinds of measurement types. You gave examples such as kilograms, packages, and we can perhaps add others such as litres for liquids, etc.
One of the problems with the current solution is that is doesn't allow for any easy alteration or expansion. It also relies on the checking of a boolean field in order to make decisions (such as which table to join I believe, based on your description).
Instead, a better approach would be to create a table containing the possible measurement types, such as kilograms or packets. Your items then simply have a foreign key to this table, and that tells you how the item is measured. This allows you to expand the types in the future, and no need to maintain a boolean flag, or do any other manual work.
This diagram illustrates what I'm referring to:
So if the data in these tables looked like this:
items
+----+---------+----------+----------------------+
| id | name | quantity | measurement_types_id |
+----+---------+----------+----------------------+
| 1 | Rice | 50 | 1 |
| 2 | Noodles | 75 | 2 |
+----+---------+----------+----------------------+
measurement_types
+----+-----------+--------------------+
| id | name | measurement_symbol |
+----+-----------+--------------------+
| 1 | Kilograms | kg |
| 2 | Packets | packets |
+----+-----------+--------------------+
A practical example of this data using the following query:
SELECT items.name, items.quantity, measurement_types.measurement_symbol
FROM items
INNER JOIN measurement_types
ON measurement_types.id = items.measurement_types_id;
would yield this result:
+---------+----------+--------------------+
| name | quantity | measurement_symbol |
+---------+----------+--------------------+
| Rice | 50 | kg |
| Noodles | 75 | packets |
+---------+----------+--------------------+

Friendship Website Database Design

I'm trying to create a database for a frienship website I'm building. I want to store multiple attributes about the user such as gender, education, pets etc.
Solution #1 - User table:
id | age | birth day | City | Gender | Education | fav Pet | fav hobbie. . .
--------------------------------------------------------------------------
0 | 38 | 1985 | New York | Female | University | Dog | Ping Pong
The problem I'm having is the list of attributes goes on and on and right now my user table has 20 something columns.
I feel I could normalize this by creating another table for each attribute see below. However this would create many joins and I'm still left with a lot of columns in the user table.
Solution #2 - User table:
id | age | birth day | City | Gender | Education | fav Pet | fav hobbies
--------------------------------------------------------------------------
0 | 38 | 1985 | New York | 0 | 0 | 0 | 0
Pets table:
id | Pet Type
---------------
0 | Dog
Anyone have any ideas how to approach this problem it feels like both answers are wrong. What is the proper table design for this database?
There is more to this than meets the eye: First of all - if you have tons of attributes, many of which will likely be null for any specific row, and with a very dynamic selection of attributes (i.e. new attributes will appear quite frequently during the code's lifecycle), you might want to ask yourself, whether a RDBMS is the best way to materialize this ... essentially non-schema. Maybe a document store would be a better fit?
If you do want to stay in the RDBMS world, the canonical answer is to have either one or one-per-datatype property table plus a table of properties:
Users.id | .name | .birthdate | .Gender | .someotherfixedattribute
----------------------------------------------------------
1743 | Me. | 01/01/1970 | M | indeed
Propertytpes.id | .name
------------------------
234 | pet
235 | hobby
Poperties.uid | .pid | .content
-----------------------------
1743 | 234 | Husky dog
You have a comment and an answer that recommend (or at least suggest) and Entity-Attribute-Value (EAV) model.
There is nothing wrong with using EAV if your attributes need to be dynamic, and your system needs to allow adding new attributes post-deployment.
That said, if your columns and relationships are all known up front, and they don't need to be dynamic, you are much better off creating an explicit model. It will (generally) perform better and will be much easier to maintain.
Instead of a wide table with a field per attribute, or many attribute tables, you could make a skinny table with many rows, something like:
Attributes (id,user_id,attribute_type,attribute_value)
Ultimately the best solution depends greatly on how the data will be used. People can only have one DOB, but maybe you want to allow for multiple addresses (billing/mailing/etc.), so addresses might deserve a separate table.

Which is a better database schema for a tracking tool?

I have to generate a view that shows tracking across each month. The ultimate view will be something like this:
| Person | Task | Jan | Feb | Mar| Apr | May | June . . .
| Joe | Roof Work | 100% | 50% | 50% | 25% |
| Joe | Basement Work | 0% | 50% | 50% | 75% |
| Tom | Basement Work | 100% | 100% | 100% | 100% |
I already have the following tables:
Person
Task
I am now creating a new table to foreign key into the above 2 tables and i am trying to figure out the pros and cons of creating 1 or 2 tables.
Option 1:
Create a new table with the following Columns:
Id
PersonId
TaskId
Jan2012
Feb2012
Mar2012
Apr2013
or
Option 2:
have 2 seperate tables
One table for just
Id
PersonId
TaskId
and another table for just the following columns
Id
PersonTaskId (the id from table above)
MonthYearKey
MonthYearValue
So an example record would be
| 1 | 13 | Jan2011 | 100% |
where 13 would represent a specific unique Person and Task combination. This second way would avoid having to create new columns to continue over time (which seems right) but i also want to avoid overkill.
which would be a more scalable way to have this schema. Also, any other suggestions or more elegant ways of doing this would be great as well?
You can have a m2m table with data columns. I don't see a reason why you can't just put MonthYearKey, MonthYearValue on the same table with PersonId and TaskId
Id
TaskId
PersonId
MonthYearKey
MonthYearValue
It's possible too that you would want to move the MonthYearKey out into their own table, it really just comes down to common queries and what this data is used for.
I would note, you never want to design a schema where you are adding columns due to time. The first option would require maintenance all the time, and would become very difficult to query also.
Option 2 is definitely more scalable and is not overkill.
Option 1 would require you to add a new column every month and simple date based queries of your data would not be possible, e.g. Show me all people who worked at least 90% in any month last year.
The ultimate view would be generated from a particular query or view of your data.

Resources