I'm hoping someone has some insight to offer here. I'm in an environment where we have a central database server with a database of around 20GB and individual database servers across about 200 facilities. The intention is to run a copy of our application at each facility pointing at their local server, but to sync all databases in both directions as often as possible (no more than 10,000 rows affected per day, individual rows on average 1.5kb). Due to varying connectivity, a facility could be offline for a week or two at times and it needs to catch up once back online.
Question: Using pull replication with the merge strategy, are there practical limits that would affect our environment? At 50, 100, 200 facilities, what negative effects can we expect to see, if any? What kind of bandwidth expectations should we have for the central server (I'm finding very little about this number anywhere I look)?
I appreciate any thoughts or guidance you may have.
Based on your description, the math looks like this:
1.5 kb (per row) * 10000 rows = 15 GB per day (min) incoming, at every one of your 50 to 200 sites.
15 GB * (50 to 200 sites) = .7 to 3 TB per day (min), sent from your central server.
Your sites will be fairly busy (15 GB per day) and your hub will be very busy (3 TB per day)
So bandwidth might be a concern. You will definitely want to monitor your bandwidth and throughput. Negative side effects would be periodic slowness at your hub (every synch).
Related
I'm finishing a web app that consumes a lot of data. Currently there are 150 tables per db, many tables with 1,000 or less records, some with 10,000 or less records and just a few with 100,000 or less records and having a database per customer. I'm setting up a server with 64GB RAM, 1 TB SSD NVMe, Xeon 4 cores at 4GHz and 1GB bandwidth. I'm planning to make 10 instances of MariaDb with 6GB RAM each and the rest for the OS (Ubuntu 18 64-bit) putting 10-15 databases per instance.
Do you think this could be a good approach for the project?
Although 15000 tables is a lot, I see no advantage in setting up multiple instances on a single machine. Here's the rationale:
Unless you have lousy indexes, even a single instance will not chew up all the CPUs.
Having 10 instance means the ancillary data (tables, caches, other overhead) will be repeated 10 times.
When one instance is busy, it cannot rob RAM from the other instances.
If you do a table scan on 100K rows in a 6G instance (perhaps 4G for innodb_buffer_pool_size), it may push the other users in that instance out of the buffer_pool. Yeah, there are DBA things you could do, but remember, you're not a DBA.
150 tables? Could it be that there is some "over-normalization"?
I am actually a bit surprised that you are worried about 150 tables where the max size of some tables is 100k or fewer rows?
That certainly does not look like much of a problem for modern hardware that you are planning to use. I will, however, suggest allocating more RAM and use a large enough InnoDB bufferpool size, 70%, of your total ram is the suggested number by MariaDB.
I have seen clients running MariaDB with much larger data sets and a lot more tables + very high concurrency!
Anyways, the best way to proceed is, of course, to test during your UAT cycles with the similar to production data loads. I am still sure it's not going to be a problem.
If you can provide a Master/Slave setup and use MariaDB MaxScale on top, while MaxScale can provide you with automatic db server, connection, failover & transaction replay which gives you the highest level of availability, it also takes care of load balancing with its cool "Read/Write" splitting service. Your servers load will be balanced and overall a smooth experience :)
Not too sure if the above is something you have planned for in your design but just putting it here for a second look that your application will certainly benefit.
Hope this helps.
Cheers,
Faisal.
We work for a small company that cannot afford to pay SQL DBA nor consultation.
What started as a small project has now become a full scale system with a lot of data.
I need someone to help me sort out performance improvements. I realise no-one will be able to help directly and nail this issue completely, but I just want to make sure I have covered my tracks.
OK, the problem is basically we are experiencing time-outs with our queries on cached data. I have increased the time-out time with c# code but I can only go so far when it's becoming ridiculous.
The current setup is a database that has data inserted every 5 / 10 seconds, constantly! During this process we populate tables from csv files. Over night we run data caching processes that reduces the overload on the "inserted" tables. Originally we were able to convert 10+ million rows into say 400000 rows, but as users want more filtering we had to include more data rows and of course increases the number of data cached tables from 400000 to 1-3 million rows.
On my SQL Development Server (which does not have data inserted every 5 seconds) it used to take 30 seconds to run queries on data cache table with 5 million rows, now with indexing and some improvements it's now 17 seconds. The live server has standard SQL Server and used to take 57 seconds, now 40 seconds.
We have 15+ instances running with same number of databases.
So far we have outlined the following ways of improving the system:
Indexing on some of the data cached tables - database now bloated and slows down overnight processes.
Increased CommandTimeout
Moved databases to SSD
Recent improvements likely:
Realised we will have to move csv files on another hard disk and not on the same SSD drive SQL Server databases reside.
Possibly use file-groups for indexing or cached tables - not sure if SQL Server standard will cover this.
Enterprise version and partition table data - customer may pay for this but we certainly can't afford this.
As I said I'm looking for rough guidelines and realise no-one may be able to help fix this issue completely. We're are a small team and no-one has extended SQL Server experience. Customer wants answers and we've tried everything we know. Incidentally they had a small scale version in Excel and said they found no issues so why are we?!?!?
Hope someone can help.
==> My System :
Processer - Xeon 8 Cores # 3.8GHz
RAM -20 GB
Storage – 11 HDD SAN - RAID 5 configured I/O Rate – 260 MB/s
Network – Cisco 1 Gbps Intranet
Front End – Using C#.Net desktop Application, PL/SQL Developer (write PL/SQL)
==> Database :
Oracle 11g (11.2.0.3) Standard Edition on Windows Server 2008 R2 64bit (OLTP use)
Have more than 60 tables.
Most tables have above 8 million records
Reports are generated which output gives approx. 5 millions records and send to front end.
==> My Problem :
Data fetched and processed on complex query is good enough.
But when procedure or query execute which output may have millions of data
Sending to front end or pl/sql developer(test procedure/query)
My problem starts here,
Oracle database processed data (no high cpu or hdd usage on server side)
Network usage show 2-3 MB per second transfer rate
Data starts coming into Client side slow like 800 records in 1 second.
Its millions of data to come, so taking too much time to complete data arrive on front end.
So management is not happy with this, its report on front side taking minutes to display.
How can I improve this? I need faster data on Client side.
In Any Report user wont and will never see all 5M rows... then whats the point to pull so much of data?
Do all agreggation/Header/Footer data calc in db itself & take only upto 100's of rows back to UI/App,
Desing needs to be worked out properly.... Nither DB/APP/NW is an issue.. they are well & good !
I need to build deploy plan for a medium application which contain many postgres database (720). The model of almost of that are similar, but i have to separate its for manager and performance. Each database have about 400.000 to 1.000.000 record include both read and write.
I have two questions:
1. How could i calculate amount of database on each machine (in centos 2.08 GHz CPU and 4 GB RAM)? Or how many database can i delop on each machine? The concurrence i guest about 10.
2. Is there any tutorial to calculate database size?
3. Is postgres database can "Active - Standby"?
I don't think that your server can possibly handle such load (if these are your true numbers).
Simple calculation: lets round up your 720 databases to 1000.
Also lets round up your average row width of 7288 to 10000 (10KB).
Assume that every database will be using 1 million rows.
Considering all these facts, total size of database in bytes can be estimated as:
1,000 * 10,000 * 1,000,000 = 10,000,000,000,000 = 10 TB
In other words, you will need at least few biggest hard drives money can buy (probably 4TB), and then you will need to use hardware or software RAID to get some adequate reliability out of them.
Note that I did not account for indexes. Depending on nature of your data and your queries, indexes can take anything from 10% to 100% of your data size. Full text search indexes can take 5x more than raw data.
At any rate, you server with just 4GB of ram will be barely moving trying to serve such a huge installation.
However, it should be able to serve not 1,000, but probably 10 or slightly more databases for your setup.
I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.)
I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea).
I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed).
Does anyone have any suggestions as to which databases would be a good fit?
P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?
I'm working on a very similar process at the present (a web domain crawlling database) with the same significant transaction rates.
At these ingest rates, it is critical to get the storage layer right first. You're going to be looking at several machines connecting to the storage in a SAN cluster. A singe database server can support millions of writes a day, it's the amount of CPU used per "write" and the speed that the writes can be commited.
(Network performance also often is an early bottleneck)
With clever partitioning, you can reduce the effort required to summarise the data. You don't say how up-to-date the summaries need to be, and this is critical. I would try to push back from "realtime" and suggest overnight (or if you can get away with it monthly) summary calculations.
Finally, we're using a 2 CPU 4GB RAM Windows 2003 virtual SQL Server 2005 and a single CPU 1GB RAM IIS Webserver as our test system and we can ingest 20 million records in a 10 hour period (and the storage is RAID 5 on a shared SAN). We get ingest rates upto 160 records per second batched in blocks of 40 records per network round trip.
Cassandra + Hadoop does sound like a good fit for you. 200M/8h is 7000/s, which a single Cassandra node could handle easily, and it sounds like your aggregation stuff would be simple to do with map/reduce (or higher-level Pig).
Greenplum or Teradata will be a good option. These databases are MPP and can handle peta-scale data. Greenplum is a distributed PostgreSQL db and also has it own mapreduce. While Hadoop may solve your storage problem but it wouldn't be helpful for performing summary queries on your data.