Memory Size in Delta after Delta Merge - database

On my HANA database, I executed delta merge on a few tables that I had executed a number of INSERT statements
I see that the merge command is completed successfully
On the other hand, when I query M_CS_TABLES table, I see that the MEMORY_SIZE_IN_DELTA is bigger than zero and a high percentage when compared with MEMORY_SIZE_IN_MAIN. I was expecting to see 0 or a less percentage in fact.
Could you please help me to understand this delta merge and memory size in delta issue?
I created sample column tables in my schema on a HANA database and populated with data using INSERT commands. Then I executed following command
MERGE DELTA OF "SALESORDERHEADER";
To query merge statistics for column tables,
select * from M_CS_TABLES where schema_name = Current_schema;
Although the table sizes are quite small, I expected the delta or row-store section of the tables to be near zero
Additionally, RAW_RECORD_COUNT_IN_DELTA for all tables are 0 as I understand this means there are no records waiting in delta to be merged.
For column based statistics, I executed
select * from M_CS_ALL_COLUMNS
where schema_name = Current_schema and table_name = 'SALESORDERHEADER';
Output is as

Thanks for adding the information.
When you look at the sizes of the delta stores for each column (M_CS_ALL_COLUMNS) you see that most of the values are really close to 8K (8196 bytes). That's no accident, but the internal minimal allocation size for delta store structures.
So, yes, if there are no entries to be merged left, each column of a column store table still has a - at least 8K sized - delta store memory structure "hanging on to it".
Your remark about the relative size of this delta store is valid, but it is important to recognize that this table is tiny. 480KB is virtually nothing in terms of database tables.
And for such tiny tables, column store structures generally are less space efficient than other data structures. This of course changes, once you load data volumes for which you want/need a column store. Then compression, parallelization, CPU cache friendliness etc. all way up the additional memory used for empty delta stores.
Anyhow, all good here, nothing to see ... move on :-D

Related

HanaDB - Complexity of: SELECT COUNT( * ) FROM dbtab

This question is the same as MySQL - Complexity of: SELECT COUNT(*) FROM MyTable;.
The difference is that instead MySQL i want to know the answer for HDB.
I Googled it, and looked for it in SAP Knowledge Base without finding an answer.
To clarify: The question is regarding selecting count without any additional conditions:
SELECT COUNT( * ) FROM dbtab.
What is the complexity of the above query. Does HDB stores a counter on top of each table?
HANA supports a large variety of table types, e.g. ROW-, COLUMN-, VIRTUAL-, EXTENDED-, and MULTISTORE-tables come to mind here.
For some of those, the current raw record count is kept as part of the internal storage structures and does not need to computed upon query time.
This is specifically true for ROW and COLUMN tables.
VIRTUAL tables are on the extreme other end and behave a lot more like complex views when it comes to SELECT count(*). Depending on the DB "behind" the virtual table, the performance of this can vary wildly!
Also, be careful assuming that ROW and COLUMN store tables will return the information with nearly no effort. HANA is a shared-nothing distributed database (in scale-out setups), which means that this kind of information is only known to the node that a table is located on. Finding out the row count of, e.g. a partitioned table with X partitions on Y number of nodes can take a considerable amount of time!
Finally, this raw record count is only available for tables that are currently in memory. Running a SELECT count(*) on a table that is currently unloaded will trigger the load of the columns that are required to answer that query (basically all primary key columns + some internal table management structures).
In the ideal case (a column table, loaded to memory and all partitions on a single node) this query should return instantaneous; but the other mentioned scenarios need to be considered, too.
Hope that answers the rather broad question.

Maintenance commands on Redshift

I find myself dealing with a Redshift cluster with 2 different types of tables: the ones that get fully replaced every day and the ones that receive a merge every day.
From what I understand so far, there are maintenance commands that should be given since all these tables have millions of rows. The 3 commands I've found so far are:
vacuum table_name;
vacuum reindex table_name;
analyze table_name;
Which of those commands should be applied on which circumstance? I'm planning on doing it daily after they load in the middle of the night. The reason to do it daily is because after running some of these manually, there is a huge performance improvement.
After reading the documentation, I feel it's not very clear what the standard procedure should be.
All the tables have interleaved sortkeys regardless of the load type.
A quick summary of the commands, from the VACUUM documentation:
VACUUM: Sorts the specified table (or all tables in the current database) and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations. A full vacuum doesn't perform a reindex for interleaved tables.
VACUUM REINDEX: Analyzes the distribution of the values in interleaved sort key columns, then performs a full VACUUM operation.
ANALYZE: Updates table statistics for use by the query planner.
It is good practice to perform an ANALYZE when significant quantities of data have been loaded into a table. In fact, Amazon Redshift will automatically skip the analysis if less than 10% of data has changed, so there is little harm in running ANALYZE.
You mention that some tables get fully replaced every day. This should be done either by dropping and recreating the table, or by using TRUNCATE. Emptying a table with DELETE * is less efficient and should not be used to empty a table.
VACUUM can take significant time. In situations where data is being appended in time-order and the table's SORTKEY is based on the time, it is not necessary to vacuum the table. This is because the table is effectively sorted already. This, however, does not apply to interleaved sorts.
Interleaved sorts are more tricky. From the sort key documentation:
An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
Basically, interleaved sorts use a fancy algorithm to sort the data so that queries based on any of the columns (individually or in combination) will minimize the number of data blocks that are required to be read from disk. Disk access always takes the most time in a database, so minimizing disk access is the best way to speed-up the database. Amazon Redshift uses Zone Maps to identify which blocks to read from disk and the best way to minimize such access is to sort data and then skip over as many blocks as possible when performing queries.
Interleaved sorts are less performant than normal sorts, but give the benefit that multiple fields are fairly well sorted. Only use interleaved sorts if you often query on many different fields. The overhead in maintaining an interleaved sort (via VACUUM REINDEX) is quite high and should only be done if the reindex effort is worth the result.
So, in summary:
ANALYZE after significant data changes
VACUUM regularly if you delete data from the table
VACUUM REINDEX if you use Interleaved Sorts and significant amounts of data have changed

Netezza table size increased after using CTAS command

I have a large table in Netezza and table size is approx 600 GB.
When I tried to create a new table from my existing table the table size has increased. New table size is 617 GB.
SQL which I used to create new table:
create table new_table_name as select * from old_table_name distribution on (column_name);
generate statistics on new_table_name;
However row count for new table and old table are same.
What could be the reason for increasing table size?
Thanks in advance.
There are two relevant measurements for 'size' of a table: allocated and used size (both in bytes)
_v_table_storage_stat will help you look at both sizes for a given table
For small tables, the allocated size can be many times larger than the used size, and assuming an even distribution of rows, a minimum of 3MB will be allocated on each data slice. I do most of my work on a double rack MAKO system with 480 data slices. Therefore any table with less than 14,4GB are more or less irrelevant for optimization of 'size'
Nevertheless I'll try to explain what you see:
You must realize that
1) all data in Netezza are compressed.
2) Compression is being done for a 'block' of data on each individual dataslice.
3) Compression ratio (the size of data after compression divided by the size before) gets better (smaller) if data in each block share many similarities compared to the most 'mixed' situation imaginable.
4) 'distribute on' and 'organize on' can both affect this. The same can an 'order by' or even a 'group by' in the select statement used when adding data to your table
In my system, I have a very wide table with several 'copies' per day of the bank accounts of our customers. Each copy are 99% identical to the previous one and only things like 'balance' changes.
By distributing on accountID and organizing on AccountID,Timestamp - I saw a 10-15% smaller size. Some data slices had a better effect, because they contained a lot of 'system' account ID's which have a different pattern in the data.
In short:
A) it's perfectly natural
B) don't worry too much about it since:
C) a 'large' table on a Netezza system is not the same as on a 4-core database with too little memory and sloooow disks :)

Does the number of records have any impact on performance

does the number of records from a db affect the speed of select queries?
i mean if a db has 50 records and another one has 5 million records, will the selects from the 2nd one be slower? assuming i have all the indexes in the right place
Yes, but it doesn't have to be a large penalty.
At the most basic level an index is a b-tree. Performance is somewhat correlated to the number of levels in the b-tree, so a 5 record database has about 2 levels, a 5 million record database has about 22 levels. But it's binary, so a 10 million row database has 23 levels, and really, index access times are typically not the problem in performance tuning - the usual problem is tables that aren't indexed properly.
As noted by odedsh, caching is also a large contributor, and small databases will be cached well. Sqlite stores records in primary key sequence, so picking a primary key that allows records that are commonly used together to be stored together can be a big benefit.
Yeah it matters for the reasons the others said.
There's other things that can effect the speed of Select statements to, such as how many columns you're grabbing data from.
I once did some speed tests in a table with over 150 columns, where I needed to grab only about 40 of the columns, and I needed all 20,000+ records. While the speed differences were very minimal (we're talking 20 to 40 milliseconds), it was actually faster to grab the data from All the columns with a 'SELECT ALL *', rather than going 'Select All Field1, Field2, etc'.
I assume the more records and columns in your table, the greater the speed difference this example will net you, but I never had a need to test it any farther in more extreme cases like 5 million records in a table.
Yes.
If a table is tiny and the entire db is tiny when you select anything from the table it is very likely that all the data is in memory already and the result can be returned immediately.
If the table is huge but you have an index and you are doing a simple select on the indexed columns then the index can be scanned then the correct blocks can be read from disk and the result returned.
If there is no index that can be used then the db will do a full table scan reading the table block by block looking for matches.
If there is a partial map between the index columns and the select query columns then the db can try to minimize the number of blocks that should be read. And a lot of thought can be placed into properly choosing the indexes structure and type (BITMAP / REGULAR)
And this is just for the most basic SQL that selects from a single table without any calculations.

Database that can handle >500 millions rows

I am looking for a database that could handle (create an index on a column in a reasonable time and provide results for select queries in less than 3 sec) more than 500 millions rows. Would Postgresql or Msql on low end machine (Core 2 CPU 6600, 4GB, 64 bit system, Windows VISTA) handle such a large number of rows?
Update: Asking this question, I am looking for information which database I should use on a low end machine in order to provide results to select questions with one or two fields specified in where clause. No joins. I need to create indices -- it can not take ages like on mysql -- to achieve sufficient performance for my select queries. This machine is a test PC to perform an experiment.
The table schema:
create table mapper {
key VARCHAR(1000),
attr1 VARCHAR (100),
attr1 INT,
attr2 INT,
value VARCHAR (2000),
PRIMARY KEY (key),
INDEX (attr1),
INDEX (attr2)
}
MSSQL can handle that many rows just fine. The query time is completely dependent on a lot more factors than just simple row count.
For example, it's going to depend on:
how many joins those queries do
how well your indexes are set up
how much ram is in the machine
speed and number of processors
type and spindle speed of hard drives
size of the row/amount of data returned in the query
Network interface speed / latency
It's very easy to have a small (less than 10,000 rows) table which would take a couple minutes to execute a query against. For example, using lots of joins, functions in the where clause, and zero indexes on a Atom processor with 512MB of total ram. ;)
It takes a bit more work to make sure all of your indexes and foreign key relationships are good, that your queries are optimized to eliminate needless function calls and only return the data you actually need. Also, you'll need fast hardware.
It all boils down to how much money you want to spend, the quality of the dev team, and the size of the data rows you are dealing with.
UPDATE
Updating due to changes in the question.
The amount of information here is still not enough to give a real world answer. You are going to just have to test it and adjust your database design and hardware as necessary.
For example, I could very easily have 1 billion rows in a table on a machine with those specs and run a "select top(1) id from tableA (nolock)" query and get an answer in milliseconds. By the same token, you can execute a "select * from tablea" query and it take a while because although the query executed quickly, transferring all of that data across the wire takes awhile.
Point is, you have to test. Which means, setting up the server, creating some of your tables, and populating them. Then you have to go through performance tuning to get your queries and indexes right. As part of the performance tuning you're going to uncover not only how the queries need to be restructured but also exactly what parts of the machine might need to be replaced (ie: disk, more ram, cpu, etc) based on the lock and wait types.
I'd highly recommend you hire (or contract) one or two DBAs to do this for you.
Most databases can handle this, it's about what you are going to do with this data and how you do it. Lots of RAM will help.
I would start with PostgreSQL, it's for free and has no limits on RAM (unlike SQL Server Express) and no potential problems with licences (too many processors, etc.). But it's also my work :)
Pretty much every non-stupid database can handle a billion rows today easily. 500 million is doable even on 32 bit systems (albeit 64 bit really helps).
The main problem is:
You need to have enough RAM. How much is enough depends on your queries.
You need to have a good enough disc subsystem. This pretty much means if you want to do large selects, then a single platter for everything is totally out of the question. Many spindles (or a SSD) are needed to handle the IO load.
Both Postgres as well as Mysql can easily handle 500 million rows. On proper hardware.
What you want to look at is the table size limit the database software imposes. For example, as of this writing, MySQL InnoDB has a limit of 64 TB per table, while PostgreSQL has a limit of 32 TB per table; neither limits the number of rows per table. If correctly configured, these database systems should not have trouble handling tens or hundreds of billions of rows (if each row is small enough), let alone 500 million rows.
For best performance handling extremely large amounts of data, you should have sufficient disk space and good disk performance—which can be achieved with disks in an appropriate RAID—and large amounts of memory coupled with a fast processor(s) (ideally server-grade Intel Xeon or AMD Opteron processors). Needless to say, you'll also need to make sure your database system is configured for optimal performance and that your tables are indexed properly.
The following article discusses the import and use of a 16 billion row table in Microsoft SQL.
https://www.itprotoday.com/big-data/adventures-big-data-how-import-16-billion-rows-single-table.
From the article:
Here are some distilled tips from my experience:
The more data you have in a table with a defined clustered index, the
slower it becomes to import unsorted records into it. At some point,
it becomes too slow to be practical. If you want to export your table
to the smallest possible file, make it native format. This works best
with tables containing mostly numeric columns because they’re more
compactly represented in binary fields than character data. If all
your data is alphanumeric, you won’t gain much by exporting it in
native format. Not allowing nulls in the numeric fields can further
compact the data. If you allow a field to be nullable, the field’s
binary representation will contain a 1-byte prefix indicating how many
bytes of data will follow. You can’t use BCP for more than
2,147,483,647 records because the BCP counter variable is a 4-byte
integer. I wasn’t able to find any reference to this on MSDN or the
Internet. If your table consists of more than 2,147,483,647 records,
you’ll have to export it in chunks or write your own export routine.
Defining a clustered index on a prepopulated table takes a lot of disk
space. In my test, my log exploded to 10 times the original table size
before completion. When importing a large number of records using the
BULK INSERT statement, include the BATCHSIZE parameter and specify how
many records to commit at a time. If you don’t include this parameter,
your entire file is imported as a single transaction, which requires a
lot of log space. The fastest way of getting data into a table with a
clustered index is to presort the data first. You can then import it
using the BULK INSERT statement with the ORDER parameter.
Even that is small compared to the multi-petabyte Nasdaq OMX database, which houses tens of petabytes (thousands of terabytes) and trillions of rows on SQL Server.
Have you checked out Cassandra? http://cassandra.apache.org/
As mentioned pretty much all DB's today can handle this situation - what you want to concentrate on is your disk i/o subsystem. You need to configure a RAID 0 or RAID 0+1 situation throwing as many spindles to the problem as you can. Also, divide up your Log/Temp/Data logical drives for performance.
For example, let say you have 12 drives - in your RAID controller I'd create 3 RAID 0 partitions of 4 drives each. In Windows (let's say) format each group as a logical drive (G,H,I) - now when configuring SQLServer (let's say) assign the tempdb to G, the Log files to H and the data files to I.
I don't have much input on which is the best system to use, but perhaps this tip could help you get some of the speed you're looking for.
If you're going to be doing exact matches of long varchar strings, especially ones that are longer than allowed for an index, you can do a sort of pre-calculated hash:
CREATE TABLE BigStrings (
BigStringID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
Value varchar(6000) NOT NULL,
Chk AS (CHECKSUM(Value))
);
CREATE NONCLUSTERED INDEX IX_BigStrings_Chk ON BigStrings(Chk);
--Load 500 million rows in BigStrings
DECLARE #S varchar(6000);
SET #S = '6000-character-long string here';
-- nasty, slow table scan:
SELECT * FROM BigStrings WHERE Value = #S
-- super fast nonclustered seek followed by very fast clustered index range seek:
SELECT * FROM BigStrings WHERE Value = #S AND Chk = CHECKSUM(#S)
This won't help you if you aren't doing exact matches, but in that case you might look into full-text indexing. This will really change the speed of lookups on a 500-million-row table.
I need to create indices (that does not take ages like on mysql) to achieve sufficient performance for my select queries
I'm not sure what you mean by "creating" indexes. That's normally a one-time thing. Now, it's typical when loading a huge amount of data as you might do, to drop the indexes, load your data, and then add the indexes back, so the data load is very fast. Then as you make changes to the database, the indexes would be updated, but they don't necessarily need to be created each time your query runs.
That said, databases do have query optimization engines where they will analyze your query and determine the best plan to retrieve the data, and see how to join the tables (not relevant in your scenario), and what indexes are available, obviously you'd want to avoid a full table scan, so performance tuning, and reviewing the query plan is important, as others have already pointed out.
The point above about a checksum looks interesting, and that could even be an index on attr1 in the same table.

Resources