clickhouse schema design, predefined set of columns - database

I have multiple sources of input with different schemas. To do some analytics using Clickhouse, I though of of 2 approaches to handling the analytic workload, using join or aggregation operation:
Using join involves defining a table corresponding to each input.
Using aggregated functions requires a single table, with a predefined set of columns, The number of columns and the type of the columns would be based on my approximations, and may change in the future.
My question is: If I go with the second approach, defining lots of columns let's say hundred of columns. How does it affect the performance, storage cost... etc ?

Generally speaking, a large table with all your values + the usage of aggregated functions is often the usecase for which clickhouse was designed.
Various types of Join based queries start being efficient on large datasets, when the queries are distributed between machines. But if you can afford to keep your data on a single SSD RAID, try using a single table and aggregated functions.
Of course, that's generic advice, it really depends on your data.
As far as irregular data goes, depending on how varied it can be, you may want to look into using a dynamic solution (e.g. Spark or Elastic Search) or a database that supports "sparse" columns (e.g. Cassandra or ScyllaDb).
If you want to use Clickhouse for this, look into using arrays and tuples to hold them.
Overall, clickhouse is pretty clever about compressing data, so adding a lot of empty values should be fine (e.g. they won't increase query time by almost anything and they won't occupy extra space). The queries are column-based, so if you don't need a column for a specific query, the performance won't be affected by the simple fact said column exists (e.g. like it would in an RDBMS).
So even if your table has, say 200 columns, as long as your query is only using 2 of those columns, it will be basically as efficient as if the table only had 2 columns. Also, the lower the granularity of a column, the faster the queries on that column (with some caveats). That being said, if you plan to query hundreds of columns in the same query... it's probably going to go fairly slow, but clickhouse is very good at parallelizing work, so if your data is in the lower dozens of Tb (uncompressed), getting a machine with some large SSDs and a 2 Xeons will usually do the trick.
But, again, this all depends heavily on the dataset, you have to explain your data and the types of queries you need in order to get a more meaningful answer.

Related

Performance of Column Family in Cassandra DB

I have a table where my queries will be purely based on the id and created_time, I have the 50 other columns which will be queried purely based on the id and created_time, I can design it in two ways,
Either by multiple small tables with 5 column each for all 50 parameters
A single table with all 50 columns with id and created_at as primary
key
Which will be better, my rows will increase tremendously, so should I bother on the length of column family while modelling?
Actually, you need to have small tables to decrease the load on single table and should also try to maintain a query based table. If the query used contains the read statement to get all the 50 columns, then you can proceed with single table. But if you are planning to get part of data in each of your query, then you should maintain query based small tables which will redistribute the data evenly across the nodes or maintain multiple partitions as alex suggested(but you cannot get range based queries).
This really depends on how you structure of your partition key & distribution of data inside partition. CQL has some limits, like, max 2 billion cells per partitions, but this is a theoretical limit, and practical limits - something like, not having partitions bigger than 100Mb, etc. (DSE has recommendations in the planning guide).
If you'll always search by id & created_time, and not doing range queries on created_time, then you may even have the composite partition key comprising of both - this will distribute data more evenly across the cluster. Otherwise make sure that you don't have too much data inside partitions.
Or you can add another another piece into partition key, for example, sometimes people add the truncated date-time into partition key, for example, time rounded to hour, or to the day - but this will affect your queries. It's really depends on them.
Sort of in line with what Alex mentions, the determining factor here is going to be the size of your various partitions (which is an extension of the size of your columns).
Practically speaking, you can have problems going both ways - partitions that are too narrow can be as problematic as partitions that are too wide, so this is the type of thing you may want to try benchmarking and seeing which works best. I suspect for normal data models (staying away from the pathological edge cases), either will work just fine, and you won't see a meaningful difference (assuming 3.11).
In 3.11.x, Cassandra does a better job of skipping unrequested values than in 3.0.x, so if you do choose to join it all in one table, do consider using 3.11.2 or whatever the latest available release is in the 3.11 (or newer) branch.

Tableau Reading Data: Which table type is better?

What performs better (what returns queries faster) with Tableau (a read-only program) when Tableau is connected to tables of data through SQL Server? Multiple tall, thin tables that are joined or a single short and wide table?
The tall and thin tables have many rows but few columns and are joined. The short and wide table has fewer rows, but more columns.
I believe the tall and thin option returns queries faster because there is less redundant data, less columns (creates faster indexing), less NULLS, and less indexing (because there's less columns), but I need at least a second opinion, so please let me know yours.
The reason I'm interested in this question is to improve the query performance by our clients when there query our server for data to render their visualizations.
It depends largely on what you're trying to achieve. For some applications, it's better to have fewer entries with many fields, and for others it's better to have many entries with fewer fields.
Keep in mind that Tableau is not like Excel nor SQL, meaning, you should keep data manipulation to a minimum as some calculations are not easy/possible to be done in Tableau (and some are possible but involves exporting data and reconnecting to it). Tableau should be used mostly for data visualization purposes
Additionally, it's very troublesome to compare different measures in the same chart. Meaning, if you want to compare sum(A) to sum(B), you'll have to plot 2 different charts (and not put both in the same). I find it easier to have few measure fields and lots of dimensions. That way I can easily slice/compare measures. In the last example, instead of having 1 entry with A and B measures, I would have 2 entries, one with A measure and one dimension (saying it's A that is being measured) and one with B measure and one dimension (in the same respectively fields)
BUT that doesn't mean you should go always with "tall thin tables". You need to see what you're trying to achieve and what format better suits your needs (and Tableau design). And unless you're working with really big tables and your analysis are done many times a day (or real time) and performance is a very big issue, then you should focus in what makes your life easier (specially when you have to change and adapt analysis later on).
And for performance, in Tableau I follow 3 rules:
1) Always extract (data to a tde) - it's way faster than most of other database format (I didn't test all, but it's way faster the csv,mdb, xls or SQL connected directly)
2) Never use Tableau links - Unless it does not affect performance (e.g., nomenclature for a low range field) it's better that all your information is already in the same database
3) Remove the thrash - It's very appealing having all information possible in a database, but it also comes at a performance cost. I try to keep only the information necessary for the analysis, to the limits of flexibility I need. Filtering the data is ok, putting the filter in context is better, but filtering on the extract or in the data source itself is the best solution
After lots of researching, I've found a general answer. Generally, and especially with SQL Server and Tableau, you want to steer towards normalizing your tables, so you can avoid redundant data and thus, your table has less data to scan, making it's queries faster to execute. However, you don't want to normalize your tables to a point where the joins between the tables actually cause the query to take longer than if the query was just being send to one short,wide table. Ultimately, you're just going to have to test to see what amount of normalization/denormalization is best for the quickest query return.

Querying large not indexed tables

we are developing a CRUD like web interface for out application. For this, we need to show data from different tables. Some are huge and very "alive", with many rows (millions). Some are small, configuration tables.
Now we want to allow our users filtering, refinement, sorting, pagination etc. on grids we show. As a result of user selection - we are building select queries.
For obvious reasons, filtering on not indexed fields will produce a rather long running query. On the other hand, indexing every column of a table, looks a bit "weird". And we do have tables with more than 50 rows.
We are looking into Apache Lucene, but as far as I understand - it well help us solve text indexing. But what about numbers, dates, ranges? Is there any solutions, discussions available for said issue?
Also, I must point that this issue is UX specific only. For all applications own needs, we do good.
You are correct, in general, you don't want to allow random predicates on non indexed fields, however how much effect this has is very dependent on table size, database engine being used and machine being used to drive the database. Some engines are not too bad with non indexed columns, but in worst case each will degenerate to a sequential scan. Sequential scans aren't always as bad as they sound either.
Some ideas
Investigate using a column store database engine, these store data columnwise rather than row wise which can be much faster for random predicates on non indexed columns. Column stores aren't a universal solution though if you often need all fields on a row
Index the main columns that will be queried by users and indicate in the UX layer that queries on some columns will be slower. Users will be more accepting, especially if they know in advance that a column query will be slow
If possible, just throw memory at it. Engines like oracle or sql/server will be pretty good while most of your database fits in memory. Only problem is that once your database exceeds the memory performance will fall off a cliff (without warning)
Consider using vertical partitioning if possible. This lets you split a row into 2 or more pieces for storage, which can reduce IO for predicates.
Sure you know this, but make sure columns used for joins are indexed.

Best choice for a huge database table which contains only integers (have to use SUM() or AVG() )

I'm currently using a MySQL table for an online game under LAMP.
One of the table is huge (soon millions of rows) and contains only integers (IDs,timestamps,booleans,scores).
I did everything to never have to JOIN on this table. However, I'm worried about the scalability. I'm thinking about moving this single table to another faster database system.
I use intermediary tables to calculate the scores but in some cases, I have to use SUM() or AVERAGE() directly on some filtered rowsets of this table.
For you, what is the best database choice for this table?
My requirements/specs:
This table contains only integers (around 15 columns)
I need to filter by certain columns
I'd like to have UNIQUE KEYS
It could be nice to have "INSERT ... ON DUPLICATE UPDATE" but I suppose my scripts can manage it by themselves.
i have to use SUM() or AVERAGE()
thanks
Just make sure you have the correct indexes on so selecting should be quick
Millions of rows in a table isn't huge. You shouldn't expect any problems in selecting, filtering or upserting data if you index on relevant keys as #Tom-Squires suggests.
Aggregate queries (sum and avg) may pose a problem though. The reason is that they require a full table scan and thus multiple fetches of data from disk to memory. A couple of methods to increase their speed:
If your data changes infrequently then caching those query results in your code is probably a good solution.
If it changes frequently then the quickest way to improve their performance is probably to ensure that your database engine keeps the table in memory. A quick calculation of expected size: 15 columns x 8 bytes x millions =~ 100's of MB - not really an issue (unless you're on a shared host). If your RDBMS does not support tuning this for a specific table, then simply put it in a different database schema - shouldn't be a problem since you're not doing any joins on this table. Most engines will allow you to tune that.

Table design and performance related to database?

I have a table with 158 columns in SQL Server 2005.
any disdvantage of keeping so many columns.
Also I have to keep those many columns,
how can i improve performance - like using SP's, Indexes?
Wide tables can be quite performant when you usually want all the fields for a particular row. Have you traced your user's usage patterns? If they're usually pulling just one or two fields from multiple rows then your performance will suffer. The main issue is when your total row size hits the 8k page mark. That means SQL has to hit the disk twice for every row (first page + overflow page), and thats not counting any index hits.
The guys at Universal Data Models will have some good ideas for refactoring your table. And Red Gate's SQL Refactor makes splitting a table heaps easier.
There is nothing inherently wrong with wide tables. The main case for normalization is database size, where lots of null columns take up a lot of space.
The more columns you have, the slower your queries will be.
That's just a fact. That isn't to say you aren't justified in having many columns. The above does not give one carte blanche to split one entity's worth of table with many columns into multiple tables with fewer columns. The administrative overhead of such a solution would most probably outweigh any perceived performance gains.
My number one recommendation to you, based off of my experience with abnormally wide tables (denormalized schemas of bulk imported data) is to keep the columns as thin as possible. I had to work with a lot of crazy data and left most of the columns as VARCHAR(255). I recommend against this. Although convenient for development purposes, performance would spiral out of control, especially when working with Perl. Shrinking the columns to their bare minimum (VARCHAR(18) for instance) helped immensely.
Stored procedures are just batches of SQL commands; they don't have any direct on speed other than that regular use of certain types of stored procedures will end up using cached query plans (which is a performance boost).
You can use indexes to speed up certain queries, but there's no hard and fast rule here. Good index design depends entirely on the type of queries you're running. Indexing will, by definition, make your writes slower; they exist only to make your reads faster.
The problem with having that many columns in a table is that finding rows using the clustered primary key can be expensive. If it were possible to change the schema, breaking this up into many normalized tables will be the best way to improve efficiency. I would strongly recommend this course.
If not, then you may be able to use indices to make some SELECT queries faster. If you have queries that only use a small number of the columns, adding indices on those columns could mean that the clustered index will not need to be scanned. Of course, there is always a price to pay with indices, in terms of storage space and INSERT, UPDATE and DELTETE time, so this may not be a good idea for you.

Resources