My application (industrial automation) uses SQL Server 2017 Standard Edition on a Dell T330 server, has the configuration:
Xeon E3-1200 v6
16gb DDR4 UDIMMs
2 x 2tb HD 7200RPM (Raid 1)
In this bank, I am saving the following tables:
Table: tableHistory
Insert Range: Every 2 seconds
410 columns type float
409 columns type int
--
Table: tableHistoryLong
Insert Range: Every 10 minutes
410 columns type float
409 columns type int
--
Table: tableHistoryMotors
Insert Range: Every 2 seconds
328 columns type float
327 columns type int
--
Table: tableHistoryMotorsLong
Insert Range: Every 10 minutes
328 columns type float
327 columns type int
--
Table: tableEnergy
Insert Range: Every 700 milliseconds
220 columns type float
219 columns type int
Note:
When I generate reports / graphs, my application inserts the inclusions in the buffer. Because the system cannot insert and consult at the same time. Because queries are well loaded.
A columns, they are values of current, temperature, level, etc. This information is recorded for one year.
Question
With this level of processing can I have any performance problems?
Do I need better hardware due to high demand?
Can my application break at some point due to the hardware?
Your question may be closed as too broad but I want to elaborate more on the comments and offer additional suggestions.
How much RAM you need for adequate performance depends on the reporting queries. Factors include the number of rows touched, execution plan operators (sort, hash, etc.), number of concurrent queries. More RAM can also improve performance by avoiding IO, especially costly with spinning media.
A reporting workload (large scans) against a 1-2TB database with traditional tables needs fast storage (SSD) and/or more RAM (hundreds of GB) to provide decent performance. The existing hardware is the worst case scenario because data are unlikely to be cached with only 16GB RAM and a singe spindle can only read about 150MB per second. Based on my rough calculation of the schema in your question, a monthly summary query of tblHistory will take about a minute just to scan 10 GB of data (assuming a clustered index on a date column). Query duration will increase with the number of concurrent queries such that it would take at least 5 minutes per query with 5 concurrent users running the same query due to disk bandwidth limitations. SSD storage can sustain multiple GB per second so, with the same query and RAM, a data transfer time for the query above will take under 5 seconds.
A columnstore (e.g. a clustered columnstore index) as suggested by #ConorCunninghamMSFT will reduce the amount of data transferred from storage greatly because only data for the columns specified in the query are read and inherent columnstore compression
will reduce both the size of data on disk and the amount transferred from disk. The compression savings will depend much on the actual column values but I'd expect 50 to 90 percent less space compared to a rowstore table.
Reporting queries against measurement data are likely to specify date range criteria so partitioning the columnstore by date will limit scans to the specified date range without a traditional b-tree index. Partitioning will also also facilitate purging for the 12-month retention criteria with sliding window partition maintenenace (partition TRUNCATE, MERGE, SPLIT) and thereby greatly improve performance of the process compared to a delete query.
Related
Suppose I have a table with ten billion rows spread across 100 machines. The table has the following structure:
PK1 PK2 PK3 V1 V2
Where PK represents a partition key and V represents a value. So in the above example, the partition key consists of 3 columns.
Scylla requires that you to specify all columns of the partition key in the WHERE clause.
If you want to execute a query while specifying only some of the columns you'd get a warning as this requires a full table scan:
SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y
In the above query, we only specify 2 out of 3 columns. Suppose the query matches 1 billion out of 10 billion rows - what is a good mental model to think about the cost/performance of this query?
My assumption is that the cost is high: It is equivalent to executing ten billion separate queries on the data set since 1) there is no logical association between the rows in the way the rows are stored to disk as each row has a different partition key (high cardinality) 2) in order for Scylla to determine which rows match the query it has to scan all 10 billion rows (even though the result set only matches 1 billion rows)
Assuming a single server can process 100K transactions per second (well within the range advertised by ScyllaDB folks) and the data resides on 100 servers, the (estimated) time to process this query can be calculated as: 100K * 100 = 10 million queries per second. 10 billion divided by 10M = 1,000 seconds. So it would take the cluster approx. 1,000 seconds to process the query (consuming all of the cluster resources).
Is this correct? Or is there any flaw in my mental model of how Scylla processes such queries?
Thanks
As you suggested yourself, Scylla (and everything I will say in my answer also applies to Cassandra) keeps the partitions hashed by the full partition key - containing three columns. ּSo Scylla has no efficient way to scan only the matching partitions. It has to scan all the partitions, and check each of those whether its partition-key matches the request.
However, this doesn't mean that it's as grossly inefficient as "executing ten billion separate queries on the data". A scan of ten billion partitions is usually (when each row's data itself isn't very large) much more efficient than executing ten billion random-access reads, each reading a single partition individually. There's a lot of work that goes into random-access reads - Scylla needs to reach a coordinator which then sends it to replicas, each replica needs to find the specific position in its one-disk data files (often multiple files), often need to over-read from the disk (as disk and compression alignments require), and so on. Compare to this a scan - which can read long contiguous swathes of data sorted by tokens (partition-key hash) from disk and can return many rows fairly quickly with fewer I/O operations and less CPU work.
So if your example setup can do 100,000 random-access reads per node, it can probably read a lot more than 100,000 rows per second during scan. I don't know which exact number to give you, but the blog post https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-rows-a-second/ we (full disclosure: I am a ScyllaDB developer) showed an example use case scanning a billion (!) rows per second with just 83 nodes - that's 12 million rows per second on each node instead of your estimate of 100,000. So your example use case can potentially be over in just 8.3 seconds, instead of 1000 seconds as you calculated.
Finally, please don't forget (and this also mentioned in the aforementioned blog post), that if you do a large scan you should explicitly parallelize, i.e., split the token range into pieces and scan then in parallel. First of all, obviously no single client will be able to handle the results of scanning a billion partitions per second, so this parallelization is more-or-less unavoidable. Second, scanning returns partitions in partition order, which (as I explained above) sit contiguously on individual replicas - which is great for peak throughput but also means that only one node (or even one CPU) will be active at any time during the scan. So it's important to split the scan into pieces and do all of them in parallel. We also had a blog post about the importance of parallel scan, and how to do it: https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/.
Another option is to move a PK to become a clustering key, this way if you have the first two PKs, you'll be able to locate partition, and just search withing it
Summary
I have a simple power BI report with 3 KPIs that take 0.3 seconds to refresh. Once I move the data to Azure Dedicated SQL Pool, same report takes 6 seconds to refresh. The goal is to have a report that would load in less than 5 seconds with 10-15 KPIs and up to 10 filters applied and connect to much larger (several million lines) datasets.
How to improve performance of Power BI connected via Direct Query to Azure Dedicated SQL pool?
More details
Setup: Dedicated SQL pool > Power BI (Direct Query)
Data: 3 tables (rows x columns: 4x136 user table, 4x3000 date table, 8x270.000 main table)
Data model: Main table to Date table connection - many-to-one based on “date” type field. Main to User connection - many-to-one based on user_id (varchar(10)). Other data types used: date, integers, varchars (50 and less).
Dedicated SQL pool performance level is Gen2: DW200c. However, the number of DWUs used never exceeds 50, most often it stays under 15.
Report:
The report includes 3 KPIs that are calculated based on the same formula (just with “5 symbols text” being different):
KPI 1:=
CALCULATE(
COUNTA('Table Main'[id]),
CONTAINSSTRING('Table Main'[varchar 50 field 0], "5 symbols text"),
ISBLANK('Table Main' [varchar 50 field 1]),
ISBLANK('Table Main' [varchar 50 field 2])
)
The report includes 6 filters: 5 “basic filtering” with 1 or several values chosen, 1 “string contains” filter.
Problem:
The visual displaying 3 KPIs (just 3 numbers) takes 6…8 seconds to load, which is a lot, especially given the need to add more KPIs and filters. (compared to 0,3 seconds if all data is loaded in .pbix file). It is a problem because adding more KPIs on the page (10-15) increases the time to load proportionally or even more for KPI that are more complicated to calculate and reports become unusable.
Question:
Is it possible to significantly improve the performance of the report/AAS/SQL pool (2…10 times faster)? How?
If not, is it possible to somehow cache the calculated KPIs / visual contents in the report or AAS without querying the data every time and without keeping the data in pbix or AAS model?
Solutions tried and not working:
Sole use and different combinations of clustered columnstore, clustered rowstore, non-clustered rowstore indexes. Automated statistics on/off. Automated indexes and stats do give an improvement of 10…20%, but it is definitely not enough.
Simple list of values (1 column from any table) takes 1.5 to 4 seconds to load)
What I have tried
Moving SQL pool from West Europe to France and back. No improvements
Applying indexes: row- and columnstore, clustered and non-clustered, manual and automatically defined (inl. statistics) - gives 10...20% improvements in performance, which does not resolve the issue.
Changing resource classes: smallrc, largerc, xlargerc. % of DWUs used still never exceeds 50 (out of 200). No improvements
Shrinking data formats and removing excessive data: minimal nvarchar(n) possible, the biggest is nvarchar (50), all excessive columns have been removed. No improvements
Isolating the model: I have a larger data model, for testing puproses I have isolated 3 tables into a separate model to make sure other parts do not affect the performance. No improvements.
Reducing number of KPI and filters
With only 2 report filters left (main table fields only) the visual takes 2 seconds to load. With +2 filters on the connected date table 2,5 sec, with +2 filters on user table 6 sec. Which means that I would only be able to use 1-2 filter reports, which is not acceptable.
It's a bit of trial and error process unfortunately. Here are a few things that are not in your list already:
many-to-one based on user_id (varchar(10)) -> Add a numeric column which is a hash of user_id column and use that to join instead of varchar column.
Ensure your statistics are up to date.
Try Dual mode. Load smaller dimension tables in-memory and keep fact tables in DB.
Use aggregates such that unless a user is trying to drill down report is actually populated without querying DB.
Partition your fact table by appropriate column.
Make sure you're using the right distribution for your fact table and have chosen the right column.
Be careful with Partitioning and Distribution. Synapse is designed a little differently than traditional RDBs (like MySQL), it's closer to NoSQL DBs to some extent, but not completely. So understand how these concepts work in Syanpse before using them (or your might get worse performance)!
I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)
Later I should be able to query per customer_year_month and per customer_year_calendar_week.
That means that I will only query for the intervals:
week
and month
Now I'am asking myself how to partition this table?
I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?
so the question of partitioning a table is quite complex and should consider a lot of things. Among others:
What queries should be run?
The way the data is inserted
Available hardware resources
Cluster size
Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot.
So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.
create table "customer-logging" (
customer_id long,
log string,
ts timestamp,
week as date_trunc('week', ts)
) partitioned by (week) into 8 shards
Please only use 8 shards if you have an appropriate amount of CPU cores ;)
Docs: date_trunc(), partitioned tables
Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)
Cheers, Claus
There is a table with 5 columns and no more. The size of each row is less then 200 bytes but the number of the table rows may be increased to several tens of billions during the time.
The application will be storing data at a rate of 100 per second or more. Once these data are stored, they will never be updated but they will be removed after 1 year. They will not be read many times though, but may be queried by selecting within a time range, e.g. selecting rows for a given hour in a given day.
Questions
Which type of Nosql database is suited for this?
Which of these databases would be best suited? (Doesn't have to be listed)
If your Oracle license includes the partitioning option, partition by month or year, and if most/all of your queries include the date column you partitioned on, that will help dramatically. It also makes dropping a year's worth of data take a few seconds.
As others noted in the comments, depends on how much data is being returned by a query. If a query is returning millions of rows, then yes, it may take 15 minutes. Oracle can handle queries against billion row tables in a few seconds if the criteria is restricting enough and appropriate indexes are present, and statistics gathered appropriately.
So how many rows are returned by your 15 minute query?
I have a growing table storing time series data, 500M entries now, and 200K new records every day. The total size is around 15GB for now.
My clients are querying the table via a PHP script mostly, and the size of the result set is around 10K records (not very large).
select * from T where timestamp > X and timestamp < Y and additionFilters
And I want this operation cheap.
Currently my table is hosting in Postgres 7, on a single 16G memory Box, and I would love to see some good suggestion for me to host this in low cost and also allow me to scale up for performance if needed.
The table serves:
1. Query: 90%
2. Insert: 9.9%
2. Update: 0.1% <-- very rare.
PostgreSQL 9.2 supports partitioning and partial indexes. If there are a few hot partitions, and you can put those partitions or their indexes on a solid state disk, you should be able to run rings around your current configuration.
There may or may not be a low cost, scalable option. It depends on what low cost and scalable mean to you.