Netezza table size increased after using CTAS command - netezza

I have a large table in Netezza and table size is approx 600 GB.
When I tried to create a new table from my existing table the table size has increased. New table size is 617 GB.
SQL which I used to create new table:
create table new_table_name as select * from old_table_name distribution on (column_name);
generate statistics on new_table_name;
However row count for new table and old table are same.
What could be the reason for increasing table size?
Thanks in advance.

There are two relevant measurements for 'size' of a table: allocated and used size (both in bytes)
_v_table_storage_stat will help you look at both sizes for a given table
For small tables, the allocated size can be many times larger than the used size, and assuming an even distribution of rows, a minimum of 3MB will be allocated on each data slice. I do most of my work on a double rack MAKO system with 480 data slices. Therefore any table with less than 14,4GB are more or less irrelevant for optimization of 'size'
Nevertheless I'll try to explain what you see:
You must realize that
1) all data in Netezza are compressed.
2) Compression is being done for a 'block' of data on each individual dataslice.
3) Compression ratio (the size of data after compression divided by the size before) gets better (smaller) if data in each block share many similarities compared to the most 'mixed' situation imaginable.
4) 'distribute on' and 'organize on' can both affect this. The same can an 'order by' or even a 'group by' in the select statement used when adding data to your table
In my system, I have a very wide table with several 'copies' per day of the bank accounts of our customers. Each copy are 99% identical to the previous one and only things like 'balance' changes.
By distributing on accountID and organizing on AccountID,Timestamp - I saw a 10-15% smaller size. Some data slices had a better effect, because they contained a lot of 'system' account ID's which have a different pattern in the data.
In short:
A) it's perfectly natural
B) don't worry too much about it since:
C) a 'large' table on a Netezza system is not the same as on a 4-core database with too little memory and sloooow disks :)

Related

Memory Size in Delta after Delta Merge

On my HANA database, I executed delta merge on a few tables that I had executed a number of INSERT statements
I see that the merge command is completed successfully
On the other hand, when I query M_CS_TABLES table, I see that the MEMORY_SIZE_IN_DELTA is bigger than zero and a high percentage when compared with MEMORY_SIZE_IN_MAIN. I was expecting to see 0 or a less percentage in fact.
Could you please help me to understand this delta merge and memory size in delta issue?
I created sample column tables in my schema on a HANA database and populated with data using INSERT commands. Then I executed following command
MERGE DELTA OF "SALESORDERHEADER";
To query merge statistics for column tables,
select * from M_CS_TABLES where schema_name = Current_schema;
Although the table sizes are quite small, I expected the delta or row-store section of the tables to be near zero
Additionally, RAW_RECORD_COUNT_IN_DELTA for all tables are 0 as I understand this means there are no records waiting in delta to be merged.
For column based statistics, I executed
select * from M_CS_ALL_COLUMNS
where schema_name = Current_schema and table_name = 'SALESORDERHEADER';
Output is as
Thanks for adding the information.
When you look at the sizes of the delta stores for each column (M_CS_ALL_COLUMNS) you see that most of the values are really close to 8K (8196 bytes). That's no accident, but the internal minimal allocation size for delta store structures.
So, yes, if there are no entries to be merged left, each column of a column store table still has a - at least 8K sized - delta store memory structure "hanging on to it".
Your remark about the relative size of this delta store is valid, but it is important to recognize that this table is tiny. 480KB is virtually nothing in terms of database tables.
And for such tiny tables, column store structures generally are less space efficient than other data structures. This of course changes, once you load data volumes for which you want/need a column store. Then compression, parallelization, CPU cache friendliness etc. all way up the additional memory used for empty delta stores.
Anyhow, all good here, nothing to see ... move on :-D

Performance of Column Family in Cassandra DB

I have a table where my queries will be purely based on the id and created_time, I have the 50 other columns which will be queried purely based on the id and created_time, I can design it in two ways,
Either by multiple small tables with 5 column each for all 50 parameters
A single table with all 50 columns with id and created_at as primary
key
Which will be better, my rows will increase tremendously, so should I bother on the length of column family while modelling?
Actually, you need to have small tables to decrease the load on single table and should also try to maintain a query based table. If the query used contains the read statement to get all the 50 columns, then you can proceed with single table. But if you are planning to get part of data in each of your query, then you should maintain query based small tables which will redistribute the data evenly across the nodes or maintain multiple partitions as alex suggested(but you cannot get range based queries).
This really depends on how you structure of your partition key & distribution of data inside partition. CQL has some limits, like, max 2 billion cells per partitions, but this is a theoretical limit, and practical limits - something like, not having partitions bigger than 100Mb, etc. (DSE has recommendations in the planning guide).
If you'll always search by id & created_time, and not doing range queries on created_time, then you may even have the composite partition key comprising of both - this will distribute data more evenly across the cluster. Otherwise make sure that you don't have too much data inside partitions.
Or you can add another another piece into partition key, for example, sometimes people add the truncated date-time into partition key, for example, time rounded to hour, or to the day - but this will affect your queries. It's really depends on them.
Sort of in line with what Alex mentions, the determining factor here is going to be the size of your various partitions (which is an extension of the size of your columns).
Practically speaking, you can have problems going both ways - partitions that are too narrow can be as problematic as partitions that are too wide, so this is the type of thing you may want to try benchmarking and seeing which works best. I suspect for normal data models (staying away from the pathological edge cases), either will work just fine, and you won't see a meaningful difference (assuming 3.11).
In 3.11.x, Cassandra does a better job of skipping unrequested values than in 3.0.x, so if you do choose to join it all in one table, do consider using 3.11.2 or whatever the latest available release is in the 3.11 (or newer) branch.

Why PostgreSQL(timescaledb) costs more storage in table?

I'm new to database. Recently I start using timescaledb, which is an extension in PostgreSQL, so I guess this is also PostgreSQL related.
I observed a strange behavior. I calculated my table structure, 1 timestamp, 2 double, so totally 24bytes per row. And I imported (by psycopg2 copy_from) 2,750,182 rows from csv file. I manually calculated the size should be 63MB, but I query timescaledb, it tells me the table size is 137MB, index size is 100MB and total 237MB. I was expecting that the table size should equal my calculation, but it doesn't. Any idea?
There are two basic reasons your table is bigger than you expect:
1. Per tuple overhead in Postgres
2. Index size
Per tuple overhead: An answer to a related question goes into detail that I won't repeat here but basically Postgres uses 23 (+padding) bytes per row for various internal things, mostly multi-version concurrency control (MVCC) management (Bruce Momjian has some good intros if you want more info). Which gets you pretty darn close to the 137 MB you are seeing. The rest might be because of either the fill factor setting of the table or if there are any dead rows still included in the table from say a previous insert and subsequent delete.
Index Size: Unlike some other DBMSs Postgres does not organize its tables on disk around an index, unless you manually cluster the table on an index, and even then it will not maintain the clustering over time (see https://www.postgresql.org/docs/10/static/sql-cluster.html). Rather it keeps its indices separately, which is why there is extra space for your index. If on-disk size is really important to you and you aren't using your index for, say, uniqueness constraint enforcement, you might consider a BRIN index, especially if your data is going in with some ordering (see https://www.postgresql.org/docs/10/static/brin-intro.html).

Does the number of records have any impact on performance

does the number of records from a db affect the speed of select queries?
i mean if a db has 50 records and another one has 5 million records, will the selects from the 2nd one be slower? assuming i have all the indexes in the right place
Yes, but it doesn't have to be a large penalty.
At the most basic level an index is a b-tree. Performance is somewhat correlated to the number of levels in the b-tree, so a 5 record database has about 2 levels, a 5 million record database has about 22 levels. But it's binary, so a 10 million row database has 23 levels, and really, index access times are typically not the problem in performance tuning - the usual problem is tables that aren't indexed properly.
As noted by odedsh, caching is also a large contributor, and small databases will be cached well. Sqlite stores records in primary key sequence, so picking a primary key that allows records that are commonly used together to be stored together can be a big benefit.
Yeah it matters for the reasons the others said.
There's other things that can effect the speed of Select statements to, such as how many columns you're grabbing data from.
I once did some speed tests in a table with over 150 columns, where I needed to grab only about 40 of the columns, and I needed all 20,000+ records. While the speed differences were very minimal (we're talking 20 to 40 milliseconds), it was actually faster to grab the data from All the columns with a 'SELECT ALL *', rather than going 'Select All Field1, Field2, etc'.
I assume the more records and columns in your table, the greater the speed difference this example will net you, but I never had a need to test it any farther in more extreme cases like 5 million records in a table.
Yes.
If a table is tiny and the entire db is tiny when you select anything from the table it is very likely that all the data is in memory already and the result can be returned immediately.
If the table is huge but you have an index and you are doing a simple select on the indexed columns then the index can be scanned then the correct blocks can be read from disk and the result returned.
If there is no index that can be used then the db will do a full table scan reading the table block by block looking for matches.
If there is a partial map between the index columns and the select query columns then the db can try to minimize the number of blocks that should be read. And a lot of thought can be placed into properly choosing the indexes structure and type (BITMAP / REGULAR)
And this is just for the most basic SQL that selects from a single table without any calculations.

Database that can handle >500 millions rows

I am looking for a database that could handle (create an index on a column in a reasonable time and provide results for select queries in less than 3 sec) more than 500 millions rows. Would Postgresql or Msql on low end machine (Core 2 CPU 6600, 4GB, 64 bit system, Windows VISTA) handle such a large number of rows?
Update: Asking this question, I am looking for information which database I should use on a low end machine in order to provide results to select questions with one or two fields specified in where clause. No joins. I need to create indices -- it can not take ages like on mysql -- to achieve sufficient performance for my select queries. This machine is a test PC to perform an experiment.
The table schema:
create table mapper {
key VARCHAR(1000),
attr1 VARCHAR (100),
attr1 INT,
attr2 INT,
value VARCHAR (2000),
PRIMARY KEY (key),
INDEX (attr1),
INDEX (attr2)
}
MSSQL can handle that many rows just fine. The query time is completely dependent on a lot more factors than just simple row count.
For example, it's going to depend on:
how many joins those queries do
how well your indexes are set up
how much ram is in the machine
speed and number of processors
type and spindle speed of hard drives
size of the row/amount of data returned in the query
Network interface speed / latency
It's very easy to have a small (less than 10,000 rows) table which would take a couple minutes to execute a query against. For example, using lots of joins, functions in the where clause, and zero indexes on a Atom processor with 512MB of total ram. ;)
It takes a bit more work to make sure all of your indexes and foreign key relationships are good, that your queries are optimized to eliminate needless function calls and only return the data you actually need. Also, you'll need fast hardware.
It all boils down to how much money you want to spend, the quality of the dev team, and the size of the data rows you are dealing with.
UPDATE
Updating due to changes in the question.
The amount of information here is still not enough to give a real world answer. You are going to just have to test it and adjust your database design and hardware as necessary.
For example, I could very easily have 1 billion rows in a table on a machine with those specs and run a "select top(1) id from tableA (nolock)" query and get an answer in milliseconds. By the same token, you can execute a "select * from tablea" query and it take a while because although the query executed quickly, transferring all of that data across the wire takes awhile.
Point is, you have to test. Which means, setting up the server, creating some of your tables, and populating them. Then you have to go through performance tuning to get your queries and indexes right. As part of the performance tuning you're going to uncover not only how the queries need to be restructured but also exactly what parts of the machine might need to be replaced (ie: disk, more ram, cpu, etc) based on the lock and wait types.
I'd highly recommend you hire (or contract) one or two DBAs to do this for you.
Most databases can handle this, it's about what you are going to do with this data and how you do it. Lots of RAM will help.
I would start with PostgreSQL, it's for free and has no limits on RAM (unlike SQL Server Express) and no potential problems with licences (too many processors, etc.). But it's also my work :)
Pretty much every non-stupid database can handle a billion rows today easily. 500 million is doable even on 32 bit systems (albeit 64 bit really helps).
The main problem is:
You need to have enough RAM. How much is enough depends on your queries.
You need to have a good enough disc subsystem. This pretty much means if you want to do large selects, then a single platter for everything is totally out of the question. Many spindles (or a SSD) are needed to handle the IO load.
Both Postgres as well as Mysql can easily handle 500 million rows. On proper hardware.
What you want to look at is the table size limit the database software imposes. For example, as of this writing, MySQL InnoDB has a limit of 64 TB per table, while PostgreSQL has a limit of 32 TB per table; neither limits the number of rows per table. If correctly configured, these database systems should not have trouble handling tens or hundreds of billions of rows (if each row is small enough), let alone 500 million rows.
For best performance handling extremely large amounts of data, you should have sufficient disk space and good disk performance—which can be achieved with disks in an appropriate RAID—and large amounts of memory coupled with a fast processor(s) (ideally server-grade Intel Xeon or AMD Opteron processors). Needless to say, you'll also need to make sure your database system is configured for optimal performance and that your tables are indexed properly.
The following article discusses the import and use of a 16 billion row table in Microsoft SQL.
https://www.itprotoday.com/big-data/adventures-big-data-how-import-16-billion-rows-single-table.
From the article:
Here are some distilled tips from my experience:
The more data you have in a table with a defined clustered index, the
slower it becomes to import unsorted records into it. At some point,
it becomes too slow to be practical. If you want to export your table
to the smallest possible file, make it native format. This works best
with tables containing mostly numeric columns because they’re more
compactly represented in binary fields than character data. If all
your data is alphanumeric, you won’t gain much by exporting it in
native format. Not allowing nulls in the numeric fields can further
compact the data. If you allow a field to be nullable, the field’s
binary representation will contain a 1-byte prefix indicating how many
bytes of data will follow. You can’t use BCP for more than
2,147,483,647 records because the BCP counter variable is a 4-byte
integer. I wasn’t able to find any reference to this on MSDN or the
Internet. If your table consists of more than 2,147,483,647 records,
you’ll have to export it in chunks or write your own export routine.
Defining a clustered index on a prepopulated table takes a lot of disk
space. In my test, my log exploded to 10 times the original table size
before completion. When importing a large number of records using the
BULK INSERT statement, include the BATCHSIZE parameter and specify how
many records to commit at a time. If you don’t include this parameter,
your entire file is imported as a single transaction, which requires a
lot of log space. The fastest way of getting data into a table with a
clustered index is to presort the data first. You can then import it
using the BULK INSERT statement with the ORDER parameter.
Even that is small compared to the multi-petabyte Nasdaq OMX database, which houses tens of petabytes (thousands of terabytes) and trillions of rows on SQL Server.
Have you checked out Cassandra? http://cassandra.apache.org/
As mentioned pretty much all DB's today can handle this situation - what you want to concentrate on is your disk i/o subsystem. You need to configure a RAID 0 or RAID 0+1 situation throwing as many spindles to the problem as you can. Also, divide up your Log/Temp/Data logical drives for performance.
For example, let say you have 12 drives - in your RAID controller I'd create 3 RAID 0 partitions of 4 drives each. In Windows (let's say) format each group as a logical drive (G,H,I) - now when configuring SQLServer (let's say) assign the tempdb to G, the Log files to H and the data files to I.
I don't have much input on which is the best system to use, but perhaps this tip could help you get some of the speed you're looking for.
If you're going to be doing exact matches of long varchar strings, especially ones that are longer than allowed for an index, you can do a sort of pre-calculated hash:
CREATE TABLE BigStrings (
BigStringID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
Value varchar(6000) NOT NULL,
Chk AS (CHECKSUM(Value))
);
CREATE NONCLUSTERED INDEX IX_BigStrings_Chk ON BigStrings(Chk);
--Load 500 million rows in BigStrings
DECLARE #S varchar(6000);
SET #S = '6000-character-long string here';
-- nasty, slow table scan:
SELECT * FROM BigStrings WHERE Value = #S
-- super fast nonclustered seek followed by very fast clustered index range seek:
SELECT * FROM BigStrings WHERE Value = #S AND Chk = CHECKSUM(#S)
This won't help you if you aren't doing exact matches, but in that case you might look into full-text indexing. This will really change the speed of lookups on a 500-million-row table.
I need to create indices (that does not take ages like on mysql) to achieve sufficient performance for my select queries
I'm not sure what you mean by "creating" indexes. That's normally a one-time thing. Now, it's typical when loading a huge amount of data as you might do, to drop the indexes, load your data, and then add the indexes back, so the data load is very fast. Then as you make changes to the database, the indexes would be updated, but they don't necessarily need to be created each time your query runs.
That said, databases do have query optimization engines where they will analyze your query and determine the best plan to retrieve the data, and see how to join the tables (not relevant in your scenario), and what indexes are available, obviously you'd want to avoid a full table scan, so performance tuning, and reviewing the query plan is important, as others have already pointed out.
The point above about a checksum looks interesting, and that could even be an index on attr1 in the same table.

Resources