Cassandra/Solr data model improvement - solr

I have the following table:
CREATE TABLE videos_tags (
id text,
tag text,
video text,
someotherfield long,
PRIMARY KEY (id),
) WITH gc_grace_seconds = 1296000
AND compaction={'class': 'LeveledCompactionStrategy'}
AND compression={'sstable_compression': 'LZ4Compressor'};
The table stores a list of tags and videos. A video can have one or more tags; and a tag can be attributed to more than one video. Example:
id | tag | video
------------------------------------------
1 | dancing | video1
2 | singing | video2
3 | prank | video3
4 | prank | video4
5 | funny | video3
6 | cover | video2
I want to show to my users a list of related videos based from tag assignment - the more tags a certain video has in common with the user's video, the more "related" it is. The actual approach that I use comprises of 2 steps:
Get a list of the user's video's tags
q=:&fq=video:video1&fl=tag
Identify the videos use the same tags as the user's video and select the top 10 (resultset slicing is done in application side)
q=:&fq=tag:tag1 AND tag:tag2 AND tag:tag3 AND !video:video1&fl=video&stats=true&stats.field=someotherfield&stats.facet=video
Note: I used stats instead of plain facet because I also need the sum of someotherfield
This approach yields an average execution time of 30 seconds. Unfortunately, the maximum acceptable query time for my app is 10 seconds
Is there a better approach to tackling this data requirement? I'm open to:
Alternative query approach (minor tweaks are preferred; but I can accept something as drastic as replacing my 2-step approach completely)
Alternative schema
Notes:
The actual schema has several other fields that I removed from this post for brevity
I do all read operations via Solr (Datastax Enterprise 4.6.0). Nothing fancy in the Solr schema
The table currently holds 1.5 billion rows, but could grow to double or triple of that within years (so the solution must take into account the table/index size)
No fulltext search - only exact string filters

Related

Load big table into web browser using react in on-demand instantiation of table row

I'm building a Excel-like table into web browser with React.js using only <div> not <table>.
Number of columns are about 90, rows are about 24000.
As we know, it is impossible to load whole data into HTML at single web page due to performance issue.
So I decided to show partial data to user using scrolling.
The main concept is simple, build HTML near user's viewport.
Guess if user is seeing 1800th to 1900th data in single viewport. I'will load only about 1750th ~ 1950th data into HTML. If user scroll up, I'll load HTML for 1700th ~ 1750th data and remove 1900th ~ 1950th data.
I think I need to manually manipulate scroll offset for getting pos where user is at. If each row's height is same as 40px and height of viewport is 1000px, then user will see 25 items at single viewport, so I need to load about 25(front) + 25(currently seeing) + 25(end) data and if user go upside or downside, I'll load additional data and remove data which far away from user.
However, I found that, requirement for my table is not matched with this situations. Here's my situation.
First, Each row's height is not same. Basically my table will show rows of row as single row. What I mean is, table single row can be looks like below,
| Photo| ProductName | Size Pool | Stock |
.... // Below are single row
+------+---------------+-------------------+------------+
| | Boots | 110-120 | 24 | // Row header (Shows Summary of child row)
+ +---------------+-------------------+------------+
| | Boots | 110 | 16 | // Row's row #1
+ +---------------+-------------------+------------+
| | Boots | 120 | 8 | // Row's row #2
+------+------------------------------------------------+
...
+------+---------------+-------------------+------------+
| |Leather Shoe | 120 | 8 | // Row can come with no header row, only single
+------+---------------+-------------------+------------+
...
Like above, if product has more than 2 options, then it merge into rows of single row and show with summary header. And if not a option product, it shows only it's row. And if content inside the row is big, it will stretch to fit the content inside
All data came from remote DataBase which retrieve data via REST API.
DataBase scheme is like below, 2 table as example.
Table #1 ProductInfo
+--------------+------------+------------+-----------+
| GroupNumber |ProductName | Size | Stock |
+--------------+------------+------------+-----------+
| 1 | Boots | 110 | 16 |
+--------------+------------+------------+-----------+
| 1 | Boots | 120 | 8 |
+--------------+------------+------------+-----------+
| 2 |Leather Shoe| 120 | 8 |
+--------------+------------+------------+-----------+
Table #2 GroupInfo
+-----------+------------+--------------+
|GroupNumber| SizePool | ImageURL |
+-----------+------------+--------------+
| 1 | 110-120 | https://abc |
+-----------+------------+--------------+
| 2 | 120 | https://def |
+-----------+------------+--------------+
And future requirements are below, (And most of them are implemented)
Sort by each columns, multi-pivot sort by row of row OR row (Handled via SQL)
Filter data by expression (Handled by client)
Hiding, resizing, change order of column(s) (Handled by client)
Interactable component inside cell like DatePicker, Pop-up etc... (Handled by client)
I succeed to create such table with page based method. But I need scrolling viewport table.
The table contains lots of dependent value column like sum, average which are not in stored in DB except for special reason (Like performance). (Most of them are handled by DB View or Procedure including sorting, calculations etc). So overall performance is really important.
I considered few questions and way to handle this, Can you check and give me a advice?
Q1. How can I decide when data should be loaded and removed and it's amount?
Data height is not consistent, so I think I cannot use scroll offset or data number as measurement criteria. (Is it possible with predictable way?)
Is it possible to archive by accessing DOM element? I'm new to Web dev. Sorry.
Q2. I can get a data from DB in 2 different ways.
Getting ProductInfo And GroupInfo seperately [<ProductInfo>,...] And [<GroupInfo>,...]
Getting Single group which object like this { group:<GroupINfo>, values:[<ProductInfo>,...] }
which is better for performance in this case in typical situations?
Q3. If I got a data like { group:<GroupINfo>, values:[<ProductInfo>,...] }, is there any problems with performance?
Like query overhead (I need to use query joined 6 times with maximum 6 depth nested SELECT query with 30 calculated columns for single data retrieval attempt. -- Pre-calculated view or table can have problems because I have many user to use it and update frequently. So I need to worry about Mutual Exclusive at least on updating.
I'm sure that above query's performance is sufficient for cropping if I got data like [<ProductInfo>,...] And [<GroupInfo>,...]. But I think later one is better. so I need to change interface if possible.
Q4. If I crop whole data from DB and structurize at the beginning, and load and remove data only for DOM, Can it be a good way?
Of course, Q1 is my primary matter, but this also seems good except for data sync with DB (Cause other user can update value while client contain outdated data)
I considered of using Infinite-Scrolling, but this is not for my case, I need perform load data and remove data at the same time. But infinite-scrolling seems dose not support removing data from viewport. Also inconsistent row height may be a problem.
I found react-virtualized and it works.
It also support dynamic resizing of row and it greatly helped

Optimising WordPress query speed by grouping metadata

I am building an eCommerce site and I have about 10 different custom post types. For each custom post type, I have about 20 custom fields (with a single meta_value assigned to each meta_key).
What I was thinking about doing is grouping all 20 custom fields into a single custom field so that I'll have 20 meta_values (array) assigned to a single meta_key. This would essentially reduce the number of rows inside the database table from 20 rows down to a single row for each custom post type.
For example, let's say I have post type "book".
Currently my post_meta table for this post type might look like this:
meta_key | meta_value
_________________________
chapters | 5
_________________________
paragraphs | 10
_________________________
author | john_smith
_________________________
price | 19
_________________________
If I convert it to this:
meta_key | meta_value
____________________________
book_detail | [chapters:5 ,paragraphs:10, author:john_smith, price:19]
____________________________
Would doing something like this optimize WordPress query speed or would I just be wasting time?
The tables should be kept as normalized as possible. Storing the CSV or any delimited values is not recommended.
Also, in case you need to query a specific aspect of a post such as only chapters, then normalized data comes handy where you just need to put a simple predicate:
where metakey = 'chapters';

Another way to build database structure

I have to optimize my little-big database, because it's too slow, maybe we'll find another solution together.
First of all let's talk about data that are stored in the database. There are two objects: users and let's say messages
Users
There is something like that:
+----+---------+-------+-----+
| id | user_id | login | etc |
+----+---------+-------+-----+
| 1 | 100001 | A | ....|
| 2 | 100002 | B | ....|
| 3 | 100003 | C | ....|
|... | ...... | ... | ....|
+----+---------+-------+-----+
There is no problem inside this table. (Don't afraid of id and user_id. user_id is used by another application, so it has to be here.)
Messages
And the second table has some problem. Each user has for example messages like this:
+----+---------+------+----+
| id | user_id | from | to |
+----+---------+------+----+
| 1 | 1 | aab | bbc|
| 2 | 2 | vfd | gfg|
| 3 | 1 | aab | bbc|
| 4 | 1 | fge | gfg|
| 5 | 3 | aab | gdf|
|... | ...... | ... | ...|
+----+---------+------+----+
There is no need to edit messages, but there should be an opportunity to updated the list of messages for the user. For example, an external service sends all user's messages to the db and the list has to be updated.
And the most important thing is that there are about 30 Mio of users and average user has 500+ of messages. Another problem that I have to search through the field from and calculate number of matches. I designed a simple SQL query with join, but it takes too much time to get the data.
So...it's quite big amount of data. I decided not to use RDS (I used Postgresql) and decided to move to databases like Clickhouse and so on.
However I faced with a problem that for example Clickhouse doesn't support UPDATE statement.
To resolve this issues I decided to store messages as one row. So the table Messages should be like this:
Here I'd like to store messages in JSON format
{"from":"aaa", "to":bbe"}
{"from":"ret", "to":fdd"}
{"from":"gfd", "to":dgf"}
||
\/
+----+---------+----------+------+ And there I'd like to store the
| id | user_id | messages | hash | <= hash of the messages.
+----+---------+----------+------+
I think that full-text search inside the messages column will save some time resources and so on.
Do you have any ideas? :)
In ClickHouse, the most optimal way is to store data in "big flat table".
So, you store every message in a separate row.
15 billion rows is Ok for ClickHouse, even on single node.
Also, it's reasonable to have each user attributes directly in messages table (pre-joined), so you don't need to do JOINs. It is suitable if user attributes are not updated.
These attributes will have repeated values for each users' message - it's Ok because ClickHouse compresses data well, especially repeated values.
If users' attributes are updated, consider to store users table in separate database and use 'External dictionaries' feature to join it.
If message is updated, just don't update it. Write another row with modified message to a table instead and leave old message as is.
Its important to have right primary key for your table. You should use table from MergeTree family, which constantly reorders data by primary key and so maintains efficiency of range queries. Primary key is not required to be unique, for example you could define primary key as just (from) if you would frequently write "from = ...", and if these queries must be processed in short time.
And you could use user_id as primary key: if queries by user id are frequent and must be processed as fast as possible, but then queries with predicate on 'from' will scan whole table (mind that ClickHouse do full scan efficiently).
If you need to fast lookup by many different attributes, you could just duplicate table with different primary keys. It's typically that table will be compressed well enough and you could afford to have data in few copies with different order for different range queries.
First of all, when we have such a big dataset, from and to columns should be integers, if possible, as their comparison is faster.
Second, you should consider creating proper indexes. As each user has relatively few records (500 compared to 30M in total), it should give you a huge performance benefit.
If everything else fails, consider using partitions:
https://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
In your case they would be dynamic, and hinder first time inserts immensely, so I would consider them only as last, if very efficient, resort.

Optimal View Design To Find Mismatches Between Two Sets of Data

A bit of background...my company utilizes a piece of software that stores information about a mortgage loan in independent fields. These fields are broken up across many tables in the loan database.
My current dilemma revolves around designing a view(s) that will allow me to find mismatched data on a subset of loans from the underwriting side of our software and the lock side of our software.
Here is a quick example of the data returned from the two views that already exist:
UW View
transID | DTIField | LTVField | MIField
50000 | 37.5 | 85.0 | 1
Lock View
transID | DTIField | LTVField | MIField
50000 | 42.0 | 85.0 | 0
In the above situation, the view should return the fields that are not matching (in this case the DTIField and the MIField). I have built a comparison view that uses a series of CASE statements to return either a 0 for not matched or a 1 for matched already:
transID | DTIField | LTVField | MIField
50000 | 0 | 1 | 0
This is fine in itself but it is creating a bit of an issue downstream on the reporting side. We want to be able to build a report that would display only those transIDs that have mismatched data and show which columns are not matched. Crystal Reports is the reporting solution in question.
Some specifics about the data sets...we have 27 items of the loan that we are comparing (so a total 54 fields). There are over 4000 loans in the system and growing. There are already indexes on the transID fields.
How would you structure the view to return all the data needed for the report? We can do a good amount of work in Crystal Reports but ideally much of the logic would be handled in MSSQL.
Thanks for any assistance.
I think there should be no issue in comparing the 27 columns for a given row. Since you'll be reading the row just once and comparing the columns on that row in both the tables, it shouldn't really pose any performance issues. You can use some hash functions HASHBYTES to assign a hash value to the combination of these 27 fields in both the tables and then use this field to compare which rows should be returned by the view. This should result in some performance improvement. Testing will reveal more.

Can anyone suggest a method of versioning ATTRIBUTE (rather than OBJECT) data in DB

Taking MySQL as an example DB to perform this in (although I'm not restricted to Relational flavours at this stage) and Java style syntax for model / db interaction.
I'd like the ability to allow versioning of individual column values (and their corresponding types) as and when users edit objects. This is primarily in an attempt to drop the amount of storage required for frequent edits of complex objects.
A simple example might be
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
So we could insert an object into the database that looks like...
Food banana = new Food("Banana",0.3);
giving us
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
if we then want to update the weight we might use
banana.weight = 0.4;
banana.save();
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.4 |
+----+--------+--------+
Obviously though this is going to overwrite the data.
I could add a revision column to this table, which could be incremented as items are saved, and set a composite key that combines id/version, but this would still mean storing ALL attributes of this object for every single revision
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
- revision (INT)
+----+--------+--------+----------+
| id | name | weight | revision |
+----+--------+--------+----------+
| 1 | Banana | 0.3 | 1 |
| 1 | Banana | 0.4 | 2 |
+----+--------+--------+----------+
But in this instance we're going to be storing every single piece of data about every single item. This isn't massively efficient if users are making minor revisions to larger objects where Text fields or even BLOB data may be part of the object.
What I'd really like, would be the ability to selectively store data discretely, so the weight could possible be saved in a separate DB in its own right, that would be able to reference the table, row and column that it relates to.
This could then be smashed together with a VIEW of the table, that could sort of impose any later revisions of individual column data into the mix to create the latest version, but without the need to store ALL data for each small revision.
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
+-----+------------+-------------+-----------+-----------+----------+
| ID | TABLE_NAME | COLUMN_NAME | OBJECT_ID | BLOB_DATA | REVISION |
+-----+------------+-------------+-----------+-----------+----------+
| 456 | Food | weight | 1 | 0.4 | 2 |
+-----+------------+-------------+-----------+-----------+----------+
Not sure how successful storing any data as blob to then CAST back to original DTYPE might be, but thought since I was inventing functionality here, why not go nuts.
This method of storage would also be fairly dangerous, since table and column names are entirely subject to change, but hopefully this at least outlines the sort of behaviour I'm thinking of.
A table in 6NF has one CK (candidate key) (in SQL a PK) and at most one other column. Essentially 6NF allows each pre-6NF table's column's update time/version and value recorded in an anomaly-free way. You decompose a table by dropping a non-prime column while adding a table with it plus an old CK's columns. For temporal/versioning applications you further add a time/version column and the new CK is the old one plus it.
Adding a column of time/whatever interval (in SQL start time and end time columns) instead of time to a CK allows a kind of data compression by recording longest uninterupted stretches of time or other dimension through which a column had the same value. One queries by an original CK plus the time whose value you want. You dont need this for your purposes but the initial process of normalizing to 6NF and the addition of a time/whatever column should be explained in temporal tutorials.
Read about temporal databases (which deal both with "valid" data that is times and time intervals but also "transaction" times/versions of database updates) and 6NF and its role in them. (Snodgrass/TSQL2 is bad, Date/Darwen/Lorentzos is good and SQL is problematic.)
Your final suggested table is an example of EAV. This is usually an anti-pattern. It encodes a database in to one or more tables that are effectively metadata. But since the DBMS doesn't know that you lose much of its functionality. EAV is not called for if DDL is sufficient to manage tables with columns that you need. Just declare appropriate tables in each database. Which is really one database, since you expect transactions affecting both. From that link:
You are using a DBMS anti-pattern EAV. You are (trying to) build part of a DBMS into your program + database. The DBMS already exists to manage data and metadata. Use it.
Do not have a class/table of metatdata. Just have attributes of movies be fields/columns of Movies.
The notion that one needs to use EAV "so every entity type can be
extended with custom fields" is mistaken. Just implement via calls
that update metadata tables sometimes instead of just updating regular
tables: DDL instead of DML.

Resources