PostgreSQL: Column Disk usage

PostgreSQL: Column Disk usage - database

I have a big table on my database, but it has a lot of empty fields on every column, and I´d like to know how much does every column use.
Is there any way to know how much disk space is each tables columns using?

Try using pg_column_size(), it will return the column size in bytes:
SELECT sum(pg_column_size(column)) FROM yourtable

As the documentation mentions, NULL values are indicated in the null bitmap of every tuple, which is always present if the table has nullable columns.
So a NULL value consumes no extra space on disk.
If you design tables with very many columns, rethink your design.

Related

Reducing disk space of sql database

I got a database that have 2TB of data, and i wanna reduce it to 500Go by dropping some rows and removing some useless columns, but i have other ideas of optimizations, and i need an answer of some questions before.
My database got one .mdf file, and 9 other .ndf file and each file has an initiale size of 100Go.
Should I reduce the initiale size of each .ndf file to 50Go? can this operation affect my data?
Dropping an index help to reduce space?
PS : My Database contains only one single table, that has one clustered index and two other non clustered indexes,
I want to remove the two non clustered indexes
Remove the insertdate column
If you have any other ideas of optimizations, it would be very helpful

Before droping any indexes run these two views.
sys.dm_db_index_usage_stats
sys.dm_db_index_operational_stats
They will let you know if any of them are being used to support queries. The last thing you want is to remove an index and start seeing full table scans on a 2TB table.
If you can't split up the table into a relational model then try these for starters.
Check your data types.
-Can you replace NVARCHAR with VARCHAR or NCHAR with CHAR? (they take up half the space)
-Does your table experience a lot of Updates or a lot of Inserts (above view will tell you this)? If there are very few updates then consider changing CHAR fields to VARCHAR fields. Heavy updates can cause page splits and result in poor Page fullness.
-Check that columns only storing a Date with no time are not declared as Datetime
-Check value ranges in numeric fields i.e. try and use Smallint instead of Int.
Look at the activity on the table, update & insert behaviour. If the activity means very few Pages are re-arranged then consider increasing your Fill Factor.
Look at the Plan Cache, get an idea of how the table is being queried, if the bulk of queries focus on a specific portion of the table then implement a Filtered Index.
Is your Clustered Index Unique? If not then SQL creates a "hidden extra Integer column" that creates uniqueness under the bonnet.

Oracle - Make one table with many columns or split in many tables

What is the best way to model a database? I have many known channels with values. Is it better create one table with many columns, one for each channel or create two table one for values and one for channels? Like that:
Table RAW_VALUES: SERIE_ID, CHANNEL_1, ..., CHANNEL_1000
or
Table RAW_VALUES: SERIE_ID, CHANNEL_ID, VALUE
Table CHANNELS: CHANNEL_ID, NAME, UNIT, ....
My question is about performance to search some data or save database space.
Thanks.

Usually, one would want to know what type of queries you will run against the tables as well as the data distribution etc to choose between two designs. However, I think that there are more fundamental issues here to guide you.
The second alternative is certainly more flexible. Adding one more channel ("Channel_1001") can be done simply by inserting rows in the two tables (a simple DML operation), whereas if you use the first option, you need to add a column to the table (a DDL operation), and that will not be usable by any programs using this table unless you modify them.
That type of flexibility alone is probably a good reason to go with the second option.
Searching will also be better served with the second option. You may create one index on the raw_values table and support indexed searches on the Channel/Value columns. (I would avoid the name "value" for a column by the way.)
Now if you consider what column(s) to index under the first option, you will probably be stumped: you have 1001 columns there. If you want to support indexed searches on the values, would you index them all? Even if you were dealing with just 10 channels, you would still need to index those 10 columns under your first option; not a good idea in general to load a table with more than a few indexes.
As an aside, if I am not mistaken, the limit is 1000 columns per table these days, but a table with more than 255 columns will store a row in multiple row pieces, each storing 255 columns and that would create a lot of avoidable I/O for each select you issue against this table.

Database : Primary key columns in beginning of table

Dos it impact having all the primary key columns at the beginning of the table?
I know partial index reads most likely involve table scans that brings whole row into buffer pool for predicate matching. I am curious to know any performance gain having primary keys at the top of the table would provide.

In Oracle, the order of the columns of a table has little impact in general on performance.
The reason is that all columns of a row are generally contained on a single block and that the difference in time between finding the first column and the last column of a row in a block is infinitesimal compared to finding/reading the block.
Furthermore, when you reach the database block to read a row, the primary key may not be the most important column.
Here are a few exceptions where column order might have an impact:
when you have > 255 columns in your table, the rows will be split in two blocks (or more). Accessing the first 255 columns may be cheaper than accessing the remaining columns.
the last columns of a row take 0 byte of space if they are NULL. As such, columns that contain many NULL values are best left at the end of a row if possible to reduce space usage and therefore IO. In general the impact will be minimal since other NULL columns take 1 byte each so the space saved is small.
when compression is enabled, the efficiency of the compression may depend upon the column order. A good rule of thumb would be that columns with few distinct values should be grouped to enhance the chance that they will be merged by the compression algorithm.
You should think about the order of columns when you use Index Organized Table (IOT) with the overflow clause. With this clause, all columns after a determined dividing column will be stored out of line and accessing them will incur additional cost. Primary keys are always stored physically at the beginning of the rows in IOT.

At least in SQL Server there is no performance benefit based on the order of the columns in the table, primary key or not. The only benefit to having your primary key columns at the top of the list is organizational. Kind of like having a table with these columns Id, FirstName, LastName, Address1, Address2, City, State, Zip. It's a lot easier to follow in that order than Address2, State, Firstname, Id, Address1, Lastname, Zip, City. I don't know much about Oracle or DB2 but I believe it's the same.

In DB2, (and I think the answers about the other database manager systems should check the answers) the columns that have less modification should be at the beginning of each row, because when performing an update it takes from the first modified column till the end of the row, to write that in the transaction logs.
It only impacts the update operation, inserts, delete or select do not have problems. And the impact is that the IO is a little reduced, because less information should be written if just the last columns have to be written. This could be important when performing updates over a few small columns on tables with big rows with lots of record. If the first column is modified, DB2 will write the whole row.
Ordering columns to minimize update logging: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.dbobj.doc/doc/c0024496.html

(for ORACLE)
Is it fair to say then, that any and all primary key columns, even if there is just 1, should be the first or among the first few columns in a row. Further, tagging them on the END of the row is bad practice, particularly after a series of possibly/likely null attribute fields?
Thus, a row like:
pkcol(s), att1,att2,att3, varchar2(2000)
is better organized for all the reasons stated above than
att1, att2, att3, varchar2(2000), pkcol(s)
Why am I asking? Well, don't judge, but we are simplifying the PK for some tables and the developers have happily tagged the new GUID pk (don' judge #2) onto the end of the row. I am bothered by this but need some feedback to justify my fears. Also does this matter at all for SQL Server?

Adding a column efficently in SQL Server

I want to add an integer column to a table with a large number of rows and many indexes (Its a data warehouse Fact Table).
To keep the row width as narrow as possible all the columns in this table are defined as not null. So I want the new column to be not null with a default of zero.
From experience adding this column will take some time, presumably because the database will need to rewrite all the rows with a new column with a filled value. And this presumably will involve updating the clustered index and all the non-clustered indexes.
So should I drop all the indexes before adding the column and then recreate them all.
Or is there an easier way to do this?
Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row.

It will require updating the clustered index, yes - this IS the table data, after all.
But I don't see why any of the non-clustered indices would have to updated - your new column won't be member of any of the non-clustered indices.
Also, I don't see how dropping and recreating the indices would benefit you in this scenario. If you were bulk-loading several million existing rows from another table or database - yes, then it might be faster (due to the INSERTs being much faster) - but adding a column doesn't really suffer from any indices or constraints being around, I don't think.
Marc

SQL Server is a row oriented database. This is in contrast to a column oriented database. This means that in SQL Server, all of the data for a given row is stored together on the disk. Let's have an example:
Say you have a Customer table with 3 columns, FirstName, MiddleInitial, and LastName. Then, say you have 3 records in this table for Jabba T. Hutt, Dennis T. Menace, and George W. Bush.
In a row oriented database (like SQL Server), the records will be stored on disk as such:
Jabba, T, Hutt; Dennis, T, Menace; George, W, Bush;
In contrast, a column oriented database would store the records on disk like this:
Jabba, Dennis, George; T, T, W; Hutt Menace, Bush;
Where columns are grouped together instead of rows.
Now, when you go to add a column to a table in a row oriented database (SQL Server, for example), the new data for each column has to be inserted alongside the existing rows, shifting the rows requiring a lot of read/write operations. So, if you were to insert a new column for the customer prefix that defaults to 'Mr', this is what you'd get:
Mr, Jabba, T, Hutt; Mr, Dennis, T, Menace; Mr, George, W, Bush;
As you can see, all of the original data has been shifted to the right. On the other hand, when you insert a new column that defaults to NULL, no new data has to be put into the existing rows. Thus, there is less shifting, requiring fewer disk read/write operations.
Of course, this an oversimplification of what's actually going on on disk. There are other things to take into account when dealing with indexes, pages, etc. But, it should help you get the picture.
For clarification I'm not at all suggesting you move to a column oriented database, I just put that info in there to help explain what Row oriented meant.

"Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row."
Adding a nullable column merely changes the definition of the table. The individual records are not affected.

Best way to store tags in a sql server table?

What's the best way to store tags for a record? Just use a varchar field? What about when selecting rows that contains tag x? Use the like operator?
thanks!

Depends on two things:
1) The amount of tags/tagged records
2) Whether or not you have a religious opinion on normalization :-)
Unless dealing with very large volumes of data, I'd suggest having a 'Tags' table mapping varchar values to integer identifiers then second table mapping tagged records to their tag ids. I'd suggest implementing this first, then check if it doesn't meet your performance needs. In that case, keep a single table with a id for the tagged row and the actual text of the tag, but in this I'd suggest you use a char column as it will kill your query if the optimizer does a full table scan against a large table with a varchar column.

Use a tags table with the smallest allowable primary key. If there are less than 255 tags use a byte (tinyint) or else a word (smallint). The smaller the key the smaller and faster the index on the foreign key in the main table.

No, it is generally a bad idea to put multiple pieces of data in a single field. Instead, use a separate Tags table (perhaps with just a TagID and TagName) and then, for each record, indicate the TagID associated with it. If a record is associated with multiple tags, you will have duplicate records with the only difference being TagID.
The advantage here is that you can easily query by tag, by record, and maintain the Tags table separately (i.e. what if a tag name changes?).