HanaDB - Complexity of: SELECT COUNT( * ) FROM dbtab - database

This question is the same as MySQL - Complexity of: SELECT COUNT(*) FROM MyTable;.
The difference is that instead MySQL i want to know the answer for HDB.
I Googled it, and looked for it in SAP Knowledge Base without finding an answer.
To clarify: The question is regarding selecting count without any additional conditions:
SELECT COUNT( * ) FROM dbtab.
What is the complexity of the above query. Does HDB stores a counter on top of each table?

HANA supports a large variety of table types, e.g. ROW-, COLUMN-, VIRTUAL-, EXTENDED-, and MULTISTORE-tables come to mind here.
For some of those, the current raw record count is kept as part of the internal storage structures and does not need to computed upon query time.
This is specifically true for ROW and COLUMN tables.
VIRTUAL tables are on the extreme other end and behave a lot more like complex views when it comes to SELECT count(*). Depending on the DB "behind" the virtual table, the performance of this can vary wildly!
Also, be careful assuming that ROW and COLUMN store tables will return the information with nearly no effort. HANA is a shared-nothing distributed database (in scale-out setups), which means that this kind of information is only known to the node that a table is located on. Finding out the row count of, e.g. a partitioned table with X partitions on Y number of nodes can take a considerable amount of time!
Finally, this raw record count is only available for tables that are currently in memory. Running a SELECT count(*) on a table that is currently unloaded will trigger the load of the columns that are required to answer that query (basically all primary key columns + some internal table management structures).
In the ideal case (a column table, loaded to memory and all partitions on a single node) this query should return instantaneous; but the other mentioned scenarios need to be considered, too.
Hope that answers the rather broad question.

Related

Azure Database Large Table Group By Performance

I'm looking for design and/or index recommendations for the problem listed below.
I have a couple of denormalized tables in an Azure S1 Standard (20 DTU) database. One of those tables has ~20 columns and a million rows. My application requirements need me to support sub-second (or at least close to it) querying of this table by any combination of columns in my WHERE clause, as well as sub-second (or at least close to it) querying of DISTINCT values in each column.
In order to picture the use case behind this, here is an example. Imagine you were using an HR application that allowed you to search for employees and view employee information. The employee table might have 5 columns and millions of rows. The application allows you to filter by any column, and provides an interface to allow this. Therefore, the underlying SQL queries that must be made are:
A GROUP BY (or DISTINCT) query for each column, which provides the interface with the available filter options
A general employee search query, that filters all rows by any combination of filters
In order to solve performance issues on the first set of queries, I've implemented the following:
Index columns with a large variety of values
Full-Text index columns that require string matching (So CONTAINS querying instead of LIKE)
Do not index columns with a small variety of values
In order to solve the performance issues on the second query, I've implemented the following:
Forcing the front end to use pagination, implemented using SELECT * FROM table OFFSET 0 ROWS FETCH NEXT n ROWS ONLY, and ensuring the order by column is indexed
Locally, this seemed to work fine. Unfortunately, and Azure Standard database doesn't have the same performance as my local machine, and I'm seeing issues. Specifically, the columns I am not indexing (the ones with a very small set of distinct values) are taking 30+ seconds to query for. Additionally, while the paging is initially very quick, the query takes longer and longer the higher and higher I increase the offset.
So I have two targeted questions, but any other advice or design suggestions would be most welcome:
How bad is it to index every column in the table? Know that the table does need to be updated, but the columns that I update won't actually be part of any filters or WHERE clauses. Will the indexes still need to be rebuilt on update? You can also safely assume that the table will not see any inserts/deletes, except for once a month where the entire table is truncated and rebuilt from scratch
In regards to the paging getting slower and slower the deeper I get, I've read this is expected, but the performance becomes unacceptable at a certain point. Outside of making my clustered column the sort by column, are there any other suggestions to get this working?
Thanks,
-Tim

Indexing and Partitioning: Improving SqlServer 2008 R2 query performance

I have a database in 'SQL Server 2008 R2' that its size is about 5 TB and it grows in size continuously.
I have some problem with running a simple query on tbl1 with hundreds of million rows:
select x1,x2,x3
from tbl1
where date > '2017-04-03 00:00:00.000' and date < '2017-04-04 00:00:00.000'
and mid = 300
this query takes about 20 seconds.
I have two non-clustered indexes on date and mid columns and this query takes advantage of them.
What is the best idea for improving performance of select and insert in this table? (such as automatic partitioning)
I'm using Entity Framework, so I don't want to change the name of the table or partitioning it into some different names.
I appreciate any help.
The way your question is stated it leads me to believe that you are under the impression that partitioning is something that you have to do manually, i.e. splitting a table into multiple tables each having a different name.
That's not the case.
With ms-sql-server, all you need to do in order to partition your tables and indexes is to issue the CREATE PARTITION commands. So, go ahead and look them up:
CREATE PARTITION FUNCTION
CREATE PARTITION SCHEME
So, in your case I would presume that you would partition on the date column, probably putting each year on a different partition, or possibly even each month on a different partition.
However, be aware that your question might be a case of an X-Y problem. The difficulty you are having seems to be performance related. You appear to have arrived at the conclusion that what you need to do in order to solve your problem is partitioning, so you are posting a question about partitioning. I am answering your question, but it may well be the case that your problem is not partitioning. It could be a large number of other things, for example locking: if your table is so huge and it is continuously growing, then what is probably happening is that rows are being continuously added to it, so it could be that your SELECTs are fighting against your INSERTs for access to the table.

Best choice for a huge database table which contains only integers (have to use SUM() or AVG() )

I'm currently using a MySQL table for an online game under LAMP.
One of the table is huge (soon millions of rows) and contains only integers (IDs,timestamps,booleans,scores).
I did everything to never have to JOIN on this table. However, I'm worried about the scalability. I'm thinking about moving this single table to another faster database system.
I use intermediary tables to calculate the scores but in some cases, I have to use SUM() or AVERAGE() directly on some filtered rowsets of this table.
For you, what is the best database choice for this table?
My requirements/specs:
This table contains only integers (around 15 columns)
I need to filter by certain columns
I'd like to have UNIQUE KEYS
It could be nice to have "INSERT ... ON DUPLICATE UPDATE" but I suppose my scripts can manage it by themselves.
i have to use SUM() or AVERAGE()
thanks
Just make sure you have the correct indexes on so selecting should be quick
Millions of rows in a table isn't huge. You shouldn't expect any problems in selecting, filtering or upserting data if you index on relevant keys as #Tom-Squires suggests.
Aggregate queries (sum and avg) may pose a problem though. The reason is that they require a full table scan and thus multiple fetches of data from disk to memory. A couple of methods to increase their speed:
If your data changes infrequently then caching those query results in your code is probably a good solution.
If it changes frequently then the quickest way to improve their performance is probably to ensure that your database engine keeps the table in memory. A quick calculation of expected size: 15 columns x 8 bytes x millions =~ 100's of MB - not really an issue (unless you're on a shared host). If your RDBMS does not support tuning this for a specific table, then simply put it in a different database schema - shouldn't be a problem since you're not doing any joins on this table. Most engines will allow you to tune that.

how to manage millions/billions of small values in a "database"

I have an application that will generate millions of date/type/value entries. we don't need to do complex queries, only for example get the average value per day of type X between date A and B.
I'm sure a normal db like mysql isn't the best to handle these sort of things, is there a better system that like these sort of data.
EDIT: The goal is not to say that relational database cannot handle my problem but to know if another type of database like key/value database, nosql, document oriented, ... can be more adapted to what i want to do.
If you are dealing with a simple table as such:
CREATE TABLE myTable (
[DATE] datetime,
[TYPE] varchar(255),
[VALUE] varchar(255)
)
Creating an index probably on TYPE,DATE,VALUE - in that order - will give you good performance on the query you've described. Use explain plan or whatever equivalent on the database you're working with to review the performance metrics. And, setup a scheduled task to defragment that index regularly - frequency will depend on how often insert, delete and update occurs.
As far as an alternative persistence store (i.e. NoSQL) you don't gain anything. NoSQL shines when you want schema-less storage. In other words you don't know the entity definitions head of time. But from what you've described, you have a very clear picture of what you want to store, which lends itself well to a relational database.
Now possibilities for scaling over time include partitioning and each TYPE record into a separate table. The partitioning piece could be done by type and/or date. Really would depend on the nature of the queries you're dealing with, if you typically query for values within the same year for instance, and what your database offers in that regard.
MS SQL Server and Oracle offer concept of Partitioned Tables and Indexes.
In short: you could group your rows by some value, i.e. by year and month. Each group could be accessible as separate table with own index. So you can list, summarize and edit February 2011 sales without accessing all rows. Partitioned Tables complicate the database, but in case of extremely long tables it could lead to significantly better performance.
Based upon the costs you can choose either MySQL or SQL Server, in this case you have to be clear that what do you want to achieve with the database just for storage then any RDBMS can handle.
You could store the data as fixed length records in a file.
Do binary search on the file opened for random access to find your start and end records then sum the appropriate field for the given condition of all records between your start index and end index into the file.

Is it possible to partition more than one way at a time in SQL Server?

I'm considering various ways to partition my data in SQL Server. One approach I'm looking at is to partition a particular huge table into 8 partitions, then within each of these partitions to partition on a different partition column. Is this even possible in SQL Server, or am I limited to definining one parition column+function+scheme per table?
I'm interested in the more general answer, but this strategy is one I'm considering for Distributed Partitioned View, where I'd partition the data under the first scheme using DPV to distribute the huge amount of data over 8 machines, and then on each machine partition that portion of the full table on another parition key in order to be able to drop (for example) sub-paritions as required.
You are incorrect that the partitioning key cannot be computed. Use a computed, persisted column for the key:
ALTER TABLE MYTABLE ADD PartitionID AS ISNULL(Column1 * Column2,0) persisted
I do it all the time, very simple.
The DPV across a set of Partitioned Tables is your only clean option to achieve this, something like a DPV across tblSales2007, tblSales2008, tblSales2009, and then each of the respective sales tables are partitioned again, but they could then be partitioned by a different key. There are some very good benefits in doing this in terms of operational resiliance (one partitioned table going offline does not take the DPV down - it can satisfy queries for the other timelines still)
The hack option is to create an arbitary hash of 2 columns and store this per record, and partition by it. You would have to generate this hash for every query / insertion etc since the partition key can not be computed, it must be a stored value. It's a hack and I suspect would lose more performance than you would gain.
You do have to be thinking of specific management issues / DR over data quantities though, if the data volumes are very large and you are accessing it in a primarily read mechanism then you should look into SQL 'Madison' which will scale enormously in both number of rows as well as overall size of data. But it really only suits the 99.9% read type data warehouse, it is not suitable for an OLTP.
I have production data sets sitting in the 'billions' bracket, and they reside on partitioned table systems and provide very good performance - although much of this is based on the hardware underlying a system, not the database itself. Scalaing up to this level is not an issue and I know of other's who have gone well beyond those quantities as well.
The max partitions per table remains at 1000, from what I remember of a conversation about this, it was a figure set by the testing performed - not a figure in place due to a technical limitation.

Resources