Compression and distribution of structured data in postgresql

Compression and distribution of structured data in postgresql - database

I am creating a table which is very huge(in TB) in postgresql db and like greenplum I like to specify the compression size and distribute the data randomly
But in the postgresql documentation, I can't find any clause for compression
Any idea how can I achieve the compression and random distribution of the data in postgresql
Thanks in advance

For compression, there is only TOAST. That compresses data automatically, but only for large rows (exceeding 2000 bytes). There is no way to compress the whole table as such.
I am not sure what "random data distribution" in a table is, but if you want to distribute that table data across several devices, you have to define tablespaces for them and use hash partitioning with a partition on each tablespace.

For compression, PostgreSQL will do this automatically for you when they go above a certain size. Compression is applied at each individual data value though - not at the full table level. Meaning that if you have a billion rows that are very narrow, they won't get compressed. Or if you have very many columns each with only a small value in it, they won't get compressed. Details about this scheme in the manual.
If you need it on the full table level, a solution is to create a TABLESPACE for those tables that you want to be compressed and point it to a compressed filesystem. As long as the filesystem still obeys fsync() and standard POSIX semantics, this should be perfectly safe. Details about this in the manual.
PostgreSQL is not natively distributed. If you want a distributed version of PostgreSQL where data can be spread across several nodes, and have those nodes use replication for high availability, there are some 3rd party options like:
Postgres-XL - a forked version of Postgres designed to be distributed and has some other features like MPP.

Compression does not exists in PostGreSQL. There is no way to do that. The only exception is that LOBs (Large OBjects) are systematically compressed as TOAST which is clearly inappropriate to many LOBs (pictures like .jpg, .png...).
Read my papers about PostGreSQL limitations compare to MS SQL Server.
http://mssqlserver.fr/postgresql-vs-sql-server-mssql-part-3-very-extremely-detailed-comparison/
particularly § "17 – Data and index compression"

Related

Choosing the right database engine for relational data with billions of records

My Python application data structure is pure relational.
My estimation for the biggest table is around 10 billion rows each year (all the other tables are very small).
Each row size is about 20-30 bytes
What is the right database engine for me?

You might consider the following that I have used, but of course this will depend on what your data looks like and how your APP/Users need to interact with it. This is not an exhaustive list, it's only the stuff I have used.
Greenplum database is a open source distributed Postgres database. http://greenplum.org/
It scales nicely and supports pretty much all Postgres stuff except for full text indexing last I knew
Apache Phoenix: An open source sql layer on top of Hadoop/HBase. It scales nicely, but the ecosystem is a bit complex (as Per Hadoop). Cloudera's Impala is similar. https://phoenix.apache.org/
Oracle Partitioning (preferably on RAC). If you can afford the license, Oracle partitioning allows for sharding of your data in various ways. If you have it with RAC, that will also provide parallel query execution
Just partition your data (on any RDBMS) and put the partitions on good disk
Those are the 4 ideas I have actually used, and remember, on good hardware, with some table partitioning, 10B rows isn't really all that much, so you might just need to get a better box[s] and hook it to a SAN with SSD of some kind over 10G network or better. ALso think about putting indexes on a separate disk from where the db files are, and always use SSD if you can afford it.
Anyway, HTH
MG

At 30 bytes per row that's less than 300GB, which is a small database, well within the capabilities of Oracle or SQL Server Enterprise editions. You won't need Oracle RAC.
You'll need to pay attention to application design and indexing/partitioning. Query and storage optimization will have a greater impact on performance than the choice of DBMS will.

SQL server scalability question

We are trying to build an application which will have to store billions of records. 1 trillion+
a single record will contain text data and meta data about the text document.
pl help me understand about the storage limitations. can a databse SQL or oracle support this much data or i have to look for some other filesystem based solution ? What are my options ?
Since the central server has to handle incoming load from many clients, how will parallel insertions and search scale ? how to distribute data over multiple databases or tables ? I am little green to database specifics for such scaled environment.
initally to fill the database the insert load will be high, later as the database grows, search load will increase and inserts will reduce.
the total size of data will cross 1000 TB.
thanks.

1 trillion+
a single record will contain text data
and meta data about the text document.
pl help me understand about the
storage limitations
I hope you have a BIG budget for hardware. This is big as in "millions".
A trillion documents, at 1024 bytes total storage per document (VERY unlikely to be realistic when you say text) is a size of about 950 terabyte of data. Storage limitations means you talk high end SAN here. Using a non-redundant setup of 2tb discs that is 450 discs. Make the maths. Adding redundancy / raid to that and you talk major hardware invesment. An this assumes only 1kb per document. If you have on average 16kg data usage, this is... 7200 2tb discs.
THat is a hardware problem to start with. SQL Server does not scale so high, and you can not do that in a single system anyway. The normal approach for a docuemnt store like this would be a clustered storage system (clustered or somehow distributed file system) plus a central database for the keywords / tagging. Depending on load / inserts possibly with replciations of hte database for distributed search.
Whatever it is going to be, the storage / backup requiments are terrific. Lagre project here, large budget.
IO load is gong to be another issue - hardware wise. You will need a large machine and get a TON of IO bandwidth into it. I have seen 8gb links overloaded on a SQL Server (fed by a HP eva with 190 discs) and I can imagine you will run something similar. You will want hardware with as much ram as technically possible, regardless of the price - unless you store the blobs outside.
SQL row compression may come in VERY handy. Full text search will be a problem.
the total size of data will cross 1000
TB.
No. Seriously. It will be a bigger, I think. 1000tb would assume the documents are small - like the XML form of a travel ticket.

According to the MSDN page on SQL Server limitations, it can accommodate 524,272 terabytes in a single database - although it can only accommodate 16TB per file, so for 1000TB, you'd be looking to implement partitioning. If the files themselves are large, and just going to be treated as blobs of binary, you might also want to look at FILESTREAM, which does actually keep the files on the file system, but maintains SQL Server notions such as Transactions, Backup, etc.
All of the above is for SQL Server. Other products (such as Oracle) should offer similar facilities, but I couldn't list them.

In the SQL Server space you may want to take a look at SQL Server Parallel Data Warehouse, which is designed for 100s TB / Petabyte applications. Teradata, Oracle Exadata, Greenplum, etc also ought to be on your list. In any case you will be needing some expert help to choose and design the solution so you should ask that person the question you are asking here.

When it comes to database its quite tricky and there can be multiple components involved to get performance like Redis Cache, Sharding, Read replicas etc.
Bellow post describes simplified DB scalability.
http://www.cloudometry.in/2015/09/relational-database-scalability-options.html

Fastest method to fill a database table with 10 Million rows

What is the fastest method to fill a database table with 10 Million rows? I'm asking about the technique but also about any specific database engine that would allow for a way to do this as fast as possible. I"m not requiring this data to be indexed during this initial data-table population.

Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.
I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.
After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.

The fastest method is probably running an INSERT sql statement with a SELECT FROM. I've generated test data to populate tables from other databases and even the same database a number of times. But it all depends on the nature and availability of your own data. In my case i had enough rows of collected data where a few select/insert routines with random row selection applied half-cleverly against real data yielded decent test data quickly. In some cases where table data was uniquely identifying i used intermediate tables and frequency distribution sorting to eliminate things like uncommon names (eliminated instances where a count with group by was less than or equal to 2)
Also, Red Gate actually provides a utility to do just what you're asking. It's not free and i think it's Sql Server-specific but their tools are top notch. Well worth the cost. There's also a free trial period.
If you don't want to pay or their utility you could conceivably build your own pretty quickly. What they do is not magic by any means. A decent developer should be able to knock out a similarly-featured though alpha/hardcoded version of the app in a day or two...

You might be interested in the answers to this question. It looks at uploading a massive CSV file to a SQL server (2005) database. For SQL Server, it appears that a SSIS DTS package is the fastest way to bulk import data into a database.

It entirely depends on your DB. For instance, Oracle has something called direct path load (http://download.oracle.com/docs/cd/B10501_01/server.920/a96652/ch09.htm), which effectively disables indexing, and if I understand correctly, builds the binary structures that will be written to disk on the -client- side rather than sending SQL over.
Combined with partitioning and rebuilding indexes per partition, we were able to load a 1 billion row (I kid you not) database in a relatively short order. 10 million rows is nothing.

Use MySQL or MS SQL and embedded functions to generate records inside the database engine. Or generate a text file (in cvs like format) and then use Bulk copy functionality.

Relational database design pattern question - Massive Read Only database

I have an interesting challenge of building a database that imports data from about 500 different sources.
Each source has their own schema, and many are very very different. However, they all are data about a common entity.
My first thought is a typical entity / Attribute / Value schema, however after converting the denormalized import from one source (550k rows) into AEV, I end up with 36 million rows in the Attribute_Value table. With proper indexes, this is still very fast, but this is just one out of 500 import sources in so far.
I don't think this will scale, however it does make for very nice logical partitioning, we don't need to join across import sources, so we could build out (theoretically) 50 or so separate databases.
I'm looking for people who have worked with massive datasources, and their experience with how to handle things when your row count is in the hundreds of millions.

Have you considered OLAP solutions? They are probably designed for situations like yours. Massive amount of data to read and analyze.

I have billion+ row tables, the number of rows is not as critical as the fragmentation level and the width of the table itself, the wider the table the less you can fit on a page
beside OLAP/SSAS
Have you looked at using partitioned functions (new in sql server 2005)
You could also take advantage of page and row level compression (new in sql server 2008) this will help you store more data into RAM, I did my own testing with compression, check out this link to see how it compared to no compression A Quick Look At Compression In SQL 2008

Using SQL Server as Image store

Is SQL Server 2008 a good option to use as an image store for an e-commerce website? It would be used to store product images of various sizes and angles. A web server would output those images, reading the table by a clustered ID. The total image size would be around 10 GB, but will need to scale. I see a lot of benefits over using the file system, but I am worried that SQL server, not having an O(1) lookup, is not the best solution, given that the site has a lot of traffic. Would that even be a bottle-neck? What are some thoughts, or perhaps other options?

10 Gb is not quite a huge amount of data, so you can probably use the database to store it and have no big issues, but of course it's best performance wise to use the filesystem, and safety-management wise it's better to use the DB (backups and consistency).
Happily, Sql Server 2008 allows you to have your cake and eat it too, with:
The FILESTREAM Attribute
In SQL Server 2008, you can apply the FILESTREAM attribute to a varbinary column, and SQL Server then stores the data for that column on the local NTFS file system. Storing the data on the file system brings two key benefits:
Performance matches the streaming performance of the file system.
BLOB size is limited only by the file system volume size.
However, the column can be managed just like any other BLOB column in SQL Server, so administrators can use the manageability and security capabilities of SQL Server to integrate BLOB data management with the rest of the data in the relational database—without needing to manage the file system data separately.
Defining the data as a FILESTREAM column in SQL Server also ensures data-level consistency between the relational data in the database and the unstructured data that is physically stored on the file system. A FILESTREAM column behaves exactly the same as a BLOB column, which means full integration of maintenance operations such as backup and restore, complete integration with the SQL Server security model, and full-transaction support.
Application developers can work with FILESTREAM data through one of two programming models; they can use Transact-SQL to access and manipulate the data just like standard BLOB columns, or they can use the Win32 streaming APIs with Transact-SQL transactional semantics to ensure consistency, which means that they can use standard Win32 read/write calls to FILESTREAM BLOBs as they would if interacting with files on the file system.
In SQL Server 2008, FILESTREAM columns can only store data on local disk volumes, and some features such as transparent encryption and table-valued parameters are not supported for FILESTREAM columns. Additionally, you cannot use tables that contain FILESTREAM columns in database snapshots or database mirroring sessions, although log shipping is supported.

Check out this white paper from MS Research (http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2006-45)
They detail exactly what you're looking for. The short version is that any file size over 1 MB starts to degrade performance compared to saving the data on the file system.

I doubt that O(log n) for lookups would be a problem. You say you have 10GB of images. Assuming an average image size of say 50KB, that's 200,000 images. Doing an indexed lookup in a table for 200K rows is not a problem. It would be small compared to the time needed to actually read the image from disk and transfer it through your app and to the client.
It's still worth considering the usual pros and cons of storing images in a database versus storing paths in the database to files on the filesystem. For example:
Images in the database obey transaction isolation, automatically delete when the row is deleted, etc.
Database with 10GB of images is of course larger than a database storing only pathnames to image files. Backup speed and other factors are relevant.
You need to set MIME headers on the response when you serve an image from a database, through an application.
The images on a filesystem are more easily cached by the web server (e.g. Apache mod_mmap), or could be served by leaner web server like lighttpd. This is actually a pretty big benefit.

For something like an e-commerce web site, I would be moe likely to go with storing the image in a blob store on the database. While you don't want to engage in premature optimization, just the benefit of having my images be easily organized alongside my data, as well as very portable, is one automatic benefit for something like ecommerce.

If the images are indexed then lookup won't be a big problem. I'm not sure but I don't think the lookup for file system is O(1), more like O(n) (I don't think the files are indexed by the file system).
What worries me in this setup is the size of the database, but if managed correctly that won't be a big problem, and a big advantage is that you have only one thing to backup (the database) and not worry about files on disk.

Normally a good solution is to store the images themselves on the filesystem, and the metadata (file name, dimensions, last updated time, anything else you need) in the database.
Having said that, there's no "correct" solution to this.