How postgresql manage columns - database

I want to know the way how postgresql manage columns of table.
Say for e.g
I have created one table that contains 2 fields, so how postgresql manage these columns, table? In how many tables postgresql create entry for a single column ?
I would like to know the structure how the postgresql manage table and it's fields.
I only about pg_attribute table.
It would be good if anyone can share useful links.
Any help would be really appriciated.

Tables (and indexes) are organized in 8KB blocks in files in the data directory.
The column definitions are only in pg_attribute.
A table row with all its columns is stored together in one table block, and a table block can contain several such rows. In other words, PostgreSQL uses the traditional row oriented storage model.
Details can be read in the documentation.
Note: Don't use PostgreSQL 9.1 any more.

Related

Why can't columnar databases like Snowflake and Redshift change the column order?

I have been working with Redshift and now testing Snowflake. Both are columnar databases. Everything I have read about this type of databases says that they store the information by column rather than by row, which helps with the massive parallel processing (MPP).
But I have also seen that they are not able to change the order of a column or add a column in between existing columns (don't know about other columnar databases). The only way to add a new column is to append it at the end. If you want to change the order, you need to recreate the table with the new order, drop the old one, and change the name of the new one (this is called a deep copy). But this sometimes can't be possible because of dependencies or even memory utilization.
I'm more surprised about the fact that this could be done in row databases and not in columnar ones. Of course, there must be a reason why it's not a feature yet, but I clearly don't have enough information about it. I thought it was going to be just a matter of changing the ordinal of the tables in the information_schema but clearly is not that simple.
Does anyone know the reason of this?
Generally, column ordering within the table is not considered to be a first class attribute. Columns can be retrieved in whatever order you require by listing the names in that order.
Emphasis on column order within a table suggests frequent use of SELECT *. I'd strongly recommend not using SELECT * in columnar databases without an explicit LIMIT clause to minimize the impact.
If column order must be changed you do that in Redshift by creating a new empty version of the table with the columns in the desired order and then using ALTER TABLE APPEND to move the data into the new table very quickly.
https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE_APPEND.html
The order in which the columns are stored internally cannot be changed without dropping and recreating them.
Your SQL can retrieve the columns in any order you want.
General requirement to have columns listed in some particular order is for the viewing purpose.
You could define a view to be in the desired column order and use the view in the required operation.
CREATE OR REPLACE TABLE CO_TEST(B NUMBER,A NUMBER);
INSERT INTO CO_TEST VALUES (1,2),(3,4),(5,6);
SELECT * FROM CO_TEST;
SELECT A,B FROM CO_TEST;
CREATE OR REPLACE VIEW CO_VIEW AS SELECT A,B FROM CO_TEST;
SELECT * FROM CO_VIEW;
Creating a view to list the columns in the required order will not disturb the actual table underneath the view and the resources associated with recreation of the table is not wasted.
In some databases (Oracle especially) ordering columns on table will make difference in performance by storing NULLable columns at the end of the list. Has to do with how storage is beiing utilized within the data block.

Partition existing tables using PostgreSQL 10

I have gone through a bunch of documentation for PostgresSQL 10 partitioning but I am still not clear on whether existing tables can be partitioned. Most of the posts mention about partitioning existing tables using PostgreSQL 9.
Also, in the official PostgresSQL website : https://www.postgresql.org/docs/current/static/ddl-partitioning.html, it mentions 'It is not possible to turn a regular table into a partitioned table or vice versa'.
So, my question is can existing tables be partitioned in PostgreSQL 10?
If the answer is YES, my plan is :
Create a partitions
Alter the existing table to include the range so new data goes into the new partition. Once that is done, write a script which loops over the master table and moves the data into the right partitions.
Then, truncate the master table and enforce that nothing can be inserted into it.
If the answer is NO, my plan is to make the existing table the first partition
Create a new parent table and children(partitions).
Perform light transaction which will rename the existing table to a partition table name and the new parent to the actual table name.
Are there better ways to partition existing tables in PostgreSQL 10/9?

Database Partitioning- partition columns

I was reading about database partitioning from oracle documentation link(sales table). I got the concept that in range partitioning, you specify a column(or multiple columns) and give the corresponding range of values so that when you insert a value in table, it will go in related partitions making search efficient.
In the beginning of Range Partitioning section, it talks about the partitioning columns(sales_year, sales_month and sales_day) but those columns do not show in the create table statement. What exactly is partitioning columns? Are partitioning columns different from table columns?
No, partitioning columns are not different than table columns. I guess you are talking about the sale_year, sale_month, and sale_day columns in the documentation link.
I think that the example given in the link is incorrect. The partition is based on time_id and there are no sale_year, sale_month, and sale_day columns in the table. I think this piece of documentation is copy paste of older Oracle versions and not been updated appropriately in this case.
You can search for sale_year in the below links and find the correct example.
https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67772/partiti.htm
https://books.google.co.in/books?id=z7UYI2-329MC&pg=PA462&lpg=PA462&dq=sale_year,+sale_month,+and+sale_day&source=bl&ots=j6_Pn7saNp&sig=YLwus8UqXzyhkG5_XxcY2wyV7kU&hl=en&sa=X&ved=0ahUKEwiOqPT3193MAhWBMI8KHUDPCCYQ6AEIGzAA#v=onepage&q=sale_year%2C%20sale_month%2C%20and%20sale_day&f=false

How to create a 'sanitized' copy of our SQL Server database?

We're a manufacturing company, and we've hired a couple of data scientists to look for patterns and correlation in our manufacturing data. We want to give them a copy of our reporting database (SQL 2014), but it must be in a 'sanitized' form. This means that all table names get converted to 'Table1', 'Table2' etc., and column names in each table become 'Column1', 'Column2' etc. There will be roughly 100 tables, some having 30+ columns, and some tables have 2B+ rows.
I know there is a hard way to do this. This would be to manually create each table, with the sanitized table name and column names, and then use something like SSIS to bulk insert the rows from one table to another. This would be rather time consuming and tedious because of the manual SSIS column mapping required, and manual setup of each table.
I'm hoping someone has done something like this before and has a much faster, more efficienct, way.
By the way, the 'sanitized' database will have no indexes or foreign keys. Also, it may seem to make any sense why we would want to do this, but this is what was agreed to by our Director of Manufacturing and the data scientists, as the first round of analysis which will involve many rounds.
You basically want to scrub the data and objects, correct? Here is what I would do.
Restore a backup of the db.
Drop all objects not needed (indexes, constraints, stored procedures, views, functions, triggers, etc.)
Create a table with two columns, populate the table, each row has orig table name and new table name
Write a script that iterates through the table, roe by row, and renames your tables. Better yet, put the data into excel, and create a third column that builds the tsql you want to build, then cut/paste and execute in ssms.
Repeat step 4, but for all columns. Best to query sys.columns to get all the objects you need, put to excel, and build your tsql
Repeat again for any other objects needed.
Backip/restore will be quicker than dabbling in SSIS and data transfer.
They can see the data but they can't see the column names? What can that possibly accomplish? What are you protecting by not revealing the table or column names? How is a data scientist supposed to evaluate data without context? Without a FK all I see is a bunch of numbers on a column named colx. What are expecting to accomplish? Get a confidentially agreement. Consider a FK columns customerID verses a materialID. Patterns have widely different meanings and analysis. I would correlate a quality measure with materialID or shiftID but not with a customerID.
Oh look there is correlation between tableA.colB and tableX.colY. Well yes that customer is college team and they use aluminum bats.
On top of that you strip indexes (on tables with 2B+ rows) so the analysis they run will be slow. What does that accomplish?
As for the question as stated do a back up restore. Using system table drop all triggers, FK, index, and constraints. Don't forget to drop the triggers and constraints - that may disclose some trade secret. Then rename columns and then tables.

Table clusters in SQLServer

In Oracle, a table cluster is a group of tables that share common columns and store related data in the same blocks. When tables are clustered, a single data block can contain rows from multiple tables. For example, a block can store rows from both the employees and departments tables rather than from only a single table:
http://download.oracle.com/docs/cd/E11882_01/server.112/e10713/tablecls.htm#i25478
Can this be done in SQLServer?
On the one hand, this sounds very much like views. Data is stored in the table, and the views provide access to only those columns within the table specified by the view's definition. (Thus, your "common columns".)
On the other hand, this sounds like how the database engine stores data the hard drive. In SQL, this is done via 8kb pages. Assuming two completely separate table definitions, there is no way to store data from two such distinct tables in the same page. (If an Oracle block is more along the lines of OS files, then that turns into SQL Files and File Groups, at which point the answer is "yes"... but I suspect this is not what blocks are about.)
Not based on what I am reading here. In SQL Server, each table's pages are independent of other tables' pages.
On the other hand, each table can have a choice of clustered index which can influence the performance greatly. In addition, I believe partitions will influence the execution plan and if both table have similar partition functions, this might boost performance, but the normal objective of partitioning is not for performance reasons.
Typically, optimization of JOINS involves index strategies (in my experience, preferably with covering non-clustered indexes)

Resources