Postgresql - calculate database size - database

I need to build deploy plan for a medium application which contain many postgres database (720). The model of almost of that are similar, but i have to separate its for manager and performance. Each database have about 400.000 to 1.000.000 record include both read and write.
I have two questions:
1. How could i calculate amount of database on each machine (in centos 2.08 GHz CPU and 4 GB RAM)? Or how many database can i delop on each machine? The concurrence i guest about 10.
2. Is there any tutorial to calculate database size?
3. Is postgres database can "Active - Standby"?

I don't think that your server can possibly handle such load (if these are your true numbers).
Simple calculation: lets round up your 720 databases to 1000.
Also lets round up your average row width of 7288 to 10000 (10KB).
Assume that every database will be using 1 million rows.
Considering all these facts, total size of database in bytes can be estimated as:
1,000 * 10,000 * 1,000,000 = 10,000,000,000,000 = 10 TB
In other words, you will need at least few biggest hard drives money can buy (probably 4TB), and then you will need to use hardware or software RAID to get some adequate reliability out of them.
Note that I did not account for indexes. Depending on nature of your data and your queries, indexes can take anything from 10% to 100% of your data size. Full text search indexes can take 5x more than raw data.
At any rate, you server with just 4GB of ram will be barely moving trying to serve such a huge installation.
However, it should be able to serve not 1,000, but probably 10 or slightly more databases for your setup.

Related

Question about SQL Server replication scalability

I'm hoping someone has some insight to offer here. I'm in an environment where we have a central database server with a database of around 20GB and individual database servers across about 200 facilities. The intention is to run a copy of our application at each facility pointing at their local server, but to sync all databases in both directions as often as possible (no more than 10,000 rows affected per day, individual rows on average 1.5kb). Due to varying connectivity, a facility could be offline for a week or two at times and it needs to catch up once back online.
Question: Using pull replication with the merge strategy, are there practical limits that would affect our environment? At 50, 100, 200 facilities, what negative effects can we expect to see, if any? What kind of bandwidth expectations should we have for the central server (I'm finding very little about this number anywhere I look)?
I appreciate any thoughts or guidance you may have.
Based on your description, the math looks like this:
1.5 kb (per row) * 10000 rows = 15 GB per day (min) incoming, at every one of your 50 to 200 sites.
15 GB * (50 to 200 sites) = .7 to 3 TB per day (min), sent from your central server.
Your sites will be fairly busy (15 GB per day) and your hub will be very busy (3 TB per day)
So bandwidth might be a concern. You will definitely want to monitor your bandwidth and throughput. Negative side effects would be periodic slowness at your hub (every synch).

MariaDb one instance vs. many having hundreds of databases on Ubuntu server

I'm finishing a web app that consumes a lot of data. Currently there are 150 tables per db, many tables with 1,000 or less records, some with 10,000 or less records and just a few with 100,000 or less records and having a database per customer. I'm setting up a server with 64GB RAM, 1 TB SSD NVMe, Xeon 4 cores at 4GHz and 1GB bandwidth. I'm planning to make 10 instances of MariaDb with 6GB RAM each and the rest for the OS (Ubuntu 18 64-bit) putting 10-15 databases per instance.
Do you think this could be a good approach for the project?
Although 15000 tables is a lot, I see no advantage in setting up multiple instances on a single machine. Here's the rationale:
Unless you have lousy indexes, even a single instance will not chew up all the CPUs.
Having 10 instance means the ancillary data (tables, caches, other overhead) will be repeated 10 times.
When one instance is busy, it cannot rob RAM from the other instances.
If you do a table scan on 100K rows in a 6G instance (perhaps 4G for innodb_buffer_pool_size), it may push the other users in that instance out of the buffer_pool. Yeah, there are DBA things you could do, but remember, you're not a DBA.
150 tables? Could it be that there is some "over-normalization"?
I am actually a bit surprised that you are worried about 150 tables where the max size of some tables is 100k or fewer rows?
That certainly does not look like much of a problem for modern hardware that you are planning to use. I will, however, suggest allocating more RAM and use a large enough InnoDB bufferpool size, 70%, of your total ram is the suggested number by MariaDB.
I have seen clients running MariaDB with much larger data sets and a lot more tables + very high concurrency!
Anyways, the best way to proceed is, of course, to test during your UAT cycles with the similar to production data loads. I am still sure it's not going to be a problem.
If you can provide a Master/Slave setup and use MariaDB MaxScale on top, while MaxScale can provide you with automatic db server, connection, failover & transaction replay which gives you the highest level of availability, it also takes care of load balancing with its cool "Read/Write" splitting service. Your servers load will be balanced and overall a smooth experience :)
Not too sure if the above is something you have planned for in your design but just putting it here for a second look that your application will certainly benefit.
Hope this helps.
Cheers,
Faisal.

SQL Server long running query taking hours but using low CPU

I'm running some stored procedures in SQL Server 2012 under Windows Server 2012 in a dedicated server with 32 GB of RAM and 8 CPU cores. The CPU usage is always below 10% and the RAM usage is at 80% because SQL Server has 20 GB (of 32 GB) assigned.
There are some stored procedures that are taking 4 hours some days and other days, with almost the same data, are taking 7 or 8 hours.
I'm using the least restrictive isolation level so I think this should not be a locking problem. The database size is around 100 GB and the biggest table has around 5 million records.
The processes have bulk inserts, updates and deletes (in some cases I can use truncate to avoid generating logs and save some time). I'm making some full-text-search queries in one table.
I have full control of the server so I can change any configuration parameter.
I have a few questions:
Is it possible to improve the performance of the queries using
parallelism?
Why is the CPU usage so low?
What are the best practises for configuring SQL Server?
What are the best free tools for auditing the server? I tried one
from Microsoft called SQL Server 2012 BPA but the report is always
empty with no warnings.
EDIT:
I checked the log and I found this:
03/18/2015 11:09:25,spid26s,Unknown,SQL Server has encountered 82 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL11.HLSQLSERVER\MSSQL\DATA\templog.ldf] in database [tempdb] (2). The OS file handle is 0x0000000000000BF8. The offset of the latest long I/O is: 0x00000001fe4000
Bump up max memory to 24 gb.
Move tempdb off the c drive and consider mult tempdb files, with auto grow at least 128 Mbps or 256 Mbps.
Install performance dashboard and run performance dashboard report to see what queries are running and check waits.
If you are using auto grow on user data log and log files of 10%, change that to something similar to tempdb growth above.
Using performance dashboard check for obvious missing indexes that predict 95% or higher improvement impact.
Disregard all the nay Sayers who say not to do what I'm suggesting. If you do these 5 things and you're still having trouble post some of the results from performance dashboard, which by the way is free.
One more thing that may be helpful, download and install the sp_whoisactive stored proc, run it and see what processes are running. Research the queries that you find after running sp_whoisactive.
query taking hours but using low CPU
You say that as if CPU would matter for most db operations. HINT: They do not.
Databases need IO. RAM sin some cases helps mitigate this, but at the end it runs down to IO.
And you know what I see in your question? CPU, Memory (somehow assuming 32gb is impressive) but NO WORD ON DISC LAYOUT.
And that is what matters. Discs, distribution of files to spread the load.
If you look into performance counters then you will see latency being super high on discs - because whatever "pathetic" (in sql server terms) disc layout you have there, it simply is not up to the task.
Time to start buying. SSD are a LOT cheaper than discs. You may say "Oh, how are they cheaper". Well, you do not buy GB - you buy IO. And last time I checked SSD did not cost 100 times the price of discs - but they have 100 times or more the IO. and we talk always of random IO.
Then isolate Tempdb on separate SSD - tempdb either does no a lot or a TON and you want to see this.
Then isolate the log file.
Make multiple data files, for database and tempdb (particularly tempdb - as many as you have cores).
And yes, this will cost money. But at the end - you need IO and like most developers you got CPU. Bad for a database.

SQL server scalability question

We are trying to build an application which will have to store billions of records. 1 trillion+
a single record will contain text data and meta data about the text document.
pl help me understand about the storage limitations. can a databse SQL or oracle support this much data or i have to look for some other filesystem based solution ? What are my options ?
Since the central server has to handle incoming load from many clients, how will parallel insertions and search scale ? how to distribute data over multiple databases or tables ? I am little green to database specifics for such scaled environment.
initally to fill the database the insert load will be high, later as the database grows, search load will increase and inserts will reduce.
the total size of data will cross 1000 TB.
thanks.
1 trillion+
a single record will contain text data
and meta data about the text document.
pl help me understand about the
storage limitations
I hope you have a BIG budget for hardware. This is big as in "millions".
A trillion documents, at 1024 bytes total storage per document (VERY unlikely to be realistic when you say text) is a size of about 950 terabyte of data. Storage limitations means you talk high end SAN here. Using a non-redundant setup of 2tb discs that is 450 discs. Make the maths. Adding redundancy / raid to that and you talk major hardware invesment. An this assumes only 1kb per document. If you have on average 16kg data usage, this is... 7200 2tb discs.
THat is a hardware problem to start with. SQL Server does not scale so high, and you can not do that in a single system anyway. The normal approach for a docuemnt store like this would be a clustered storage system (clustered or somehow distributed file system) plus a central database for the keywords / tagging. Depending on load / inserts possibly with replciations of hte database for distributed search.
Whatever it is going to be, the storage / backup requiments are terrific. Lagre project here, large budget.
IO load is gong to be another issue - hardware wise. You will need a large machine and get a TON of IO bandwidth into it. I have seen 8gb links overloaded on a SQL Server (fed by a HP eva with 190 discs) and I can imagine you will run something similar. You will want hardware with as much ram as technically possible, regardless of the price - unless you store the blobs outside.
SQL row compression may come in VERY handy. Full text search will be a problem.
the total size of data will cross 1000
TB.
No. Seriously. It will be a bigger, I think. 1000tb would assume the documents are small - like the XML form of a travel ticket.
According to the MSDN page on SQL Server limitations, it can accommodate 524,272 terabytes in a single database - although it can only accommodate 16TB per file, so for 1000TB, you'd be looking to implement partitioning. If the files themselves are large, and just going to be treated as blobs of binary, you might also want to look at FILESTREAM, which does actually keep the files on the file system, but maintains SQL Server notions such as Transactions, Backup, etc.
All of the above is for SQL Server. Other products (such as Oracle) should offer similar facilities, but I couldn't list them.
In the SQL Server space you may want to take a look at SQL Server Parallel Data Warehouse, which is designed for 100s TB / Petabyte applications. Teradata, Oracle Exadata, Greenplum, etc also ought to be on your list. In any case you will be needing some expert help to choose and design the solution so you should ask that person the question you are asking here.
When it comes to database its quite tricky and there can be multiple components involved to get performance like Redis Cache, Sharding, Read replicas etc.
Bellow post describes simplified DB scalability.
http://www.cloudometry.in/2015/09/relational-database-scalability-options.html

SQL Server FILESTREAM performance

SQL Server FILESTREAM has some known limitations.
1) Database mirroring does not support FILESTREAM.
2) For clustering, FILESTREAM filegroups must be put on a shared disk which defeats the purpose of creating the clusters.
Given these limitations in FILESTREAM is it advisable to build a FILESTREAM solution. Im looking to save and retrieve .5 million files in a FILESTREAM database (approx 1TB of disk size) which would be accessed simultaneously by approx 2000 users. Given the fact that the FILESTREAM cannot be clustered or mirrored how does one devise a scalable solution.
If I live with a non scalable solution what would be the performance of such a system. Can I serve up say 100 users with 100 1 MB files within a 5 second window?
Reality check: Issue 2 is a non-issue. In a cluster ALL data must be on shared discs, otherwise the cluster failover can not access the data. If that defeats the purpose fof a cluster you are invited to install a SQL Server cluster without shared discs. ALL data storage on clusters must be on shared discs. Has been like this since cluster service was first created for windows.
Which basically makes your conclusions already quite - hm - wrong.
You also need to make some mathmatics. 5th grade style.
Can I serve up say 100 users with 100
1 MB files within a 5 second window?
Ignore SQL Server for a moment. Depending on how I read this thi is either 100mb or 10.000mb. Anyhow, 100mb in 5 seconds = 20mb per second, which runs around 200mbit. This is serious traffic. We taalkg of minimum 250 to 300mbit needed external bandwidth.

Resources