Can TerminusDB run in a multi-server environment?
Either with a Sharded and Clustered database mode
Or can it run on Distributed database scheme?
Or it can only run as a single process and does not horizontally scale?
If it can scale horizontally, how?
You can run multiple TerminusDB instances that use the same store, by using a distributed file system that supports at least optional locking. This is the case for NFS4 and many other distributed POSIX file systems. It may also work on windows shares (SMB), but this is currently untested. Besides pointing at the correct storage directory on each server (using the TERMINUSDB_SERVER_DB_PATH environment variable or by starting the server in the appropriate directory to allow auto-discovery), no extra setup has to be done to make this work.
TerminusDB does not support sharding. To use a database, each server instance needs to load in that full database into memory, so there's nothing gained by horizontally scaling if your goal is to reduce the memory footprint of individual instances.
Using multiple TerminusDB instances will help you though if you want to increase query throughput, as you can just round-robin requests to a server pool. This will be especially beneficial for read-heavy workloads.
Related
I know there have been many articles written about database replication. Trust me, I spent some time reading those articles including this SO one that explaints the pros and cons of replication. This SO article goes in depth about replication and clustering individually, but doesn't answer these simple questions that I have:
When do you replicate your database, and when do you cluster?
Can both be performed at the same time? If yes, what are the inspirations for each?
Thanks in advance.
MySQL currently supports two different solutions for creating a high availability environment and achieving multi-server scalability.
MySQL Replication
The first form is replication, which MySQL has supported since MySQL version 3.23. Replication in MySQL is currently implemented as an asyncronous master-slave setup that uses a logical log-shipping backend.
A master-slave setup means that one server is designated to act as the master. It is then required to receive all of the write queries. The master then executes and logs the queries, which is then shipped to the slave to execute and hence to keep the same data across all of the replication members.
Replication is asyncronous, which means that the slave server is not guaranteed to have the data when the master performs the change. Normally, replication will be as real-time as possible. However, there is no guarantee about the time required for the change to propagate to the slave.
Replication can be used for many reasons. Some of the more common reasons include scalibility, server failover, and for backup solutions.
Scalibility can be achieved due to the fact that you can now do can do SELECT queries across any of the slaves. Write statements however are not improved generally due to the fact that writes have to occur on each of the replication member.
Failover can be implemented fairly easily using an external monitoring utility that uses a heartbeat or similar mechanism to detect the failure of a master server. MySQL does not currently do automatic failover as the logic is generally very application dependent. Keep in mind that due to the fact that replication is asynchronous that it is possible that not all of the changes done on the master will have propagated to the slave.
MySQL replication works very well even across slower connections, and with connections that aren't continuous. It also is able to be used across different hardware and software platforms. It is possible to use replication with most storage engines including MyISAM and InnoDB.
MySQL Cluster
MySQL Cluster is a shared nothing, distributed, partitioning system that uses synchronous replication in order to maintain high availability and performance.
MySQL Cluster is implemented through a separate storage engine called NDB Cluster. This storage engine will automatically partition data across a number of data nodes. The automatic partitioning of data allows for parallelization of queries that are executed. Both reads and writes can be scaled in this fashion since the writes can be distributed across many nodes.
Internally, MySQL Cluster also uses synchronous replication in order to remove any single point of failure from the system. Since two or more nodes are always guaranteed to have the data fragment, at least one node can fail without any impact on running transactions. Failure detection is automatically handled with the dead node being removed transparent to the application. Upon node restart, it will automatically be re-integrated into the cluster and begin handling requests as soon as possible.
There are a number of limitations that currently exist and have to be kept in mind while deciding if MySQL Cluster is the correct solution for your situation.
Currently all of the data and indexes stored in MySQL Cluster are stored in main memory across the cluster. This does restrict the size of the database based on the systems used in the cluster.
MySQL Cluster is designed to be used on an internal network as latency is very important for response time.
As a result, it is not possible to run a single cluster across a wide geographic distance. In addition, while MySQL Cluster will work over commodity network setups, in order to attain the highest performance possible special clustering interconnects can be used.
In Master-Salve configuration the write operations are performed by Master and Read by slave. So all SQL request first reaches the Master and a queue of request is maintained and the read operation get executed only after completion of write. There is a common problem in Master-Salve configuration which i also witnessed is that when queue becomes too large to be maintatined by master then this achitecture collapse and the slave starts behaving like master.
For clusters i have worked on Cassandra where the request reaches a node(table) and a commit hash is maintained which notices the differences made to a node and updates the other nodes based on that commit hash. So here all operations are not dependent on a single node.
We used Master-Salve when write data is not big in size and count otherwise we use clusters.
Clusters are expensive in space and Master-Salve in time so your desicion of what to choose depends on what you want to save.
We can also use both at the same time, i have done this in my current company.
We moved the tables with most write operations to Cassandra and we have written 4 API to perform the CRUD operation on tables in Cassandra. As whenever an HTTP request comes it first hits our web server and from the code running on our web server we can decide which operation has to be performed (among CRUD) and then we call that particular API to make changes to the cassandra database.
My exposure to NoSQL or NewSQL/NeoSQL database servers is extremely limited, only theoretical. I've worked with traditional RDBMSs (like MySQL, Postgres) and directory-server (OpenLDAP), with and without replication.
My application stack is based on JBoss, and I've been tasked with setting up a minimal demo (with our application) that can demonstrate durability and high-availability of data, in VoltDB. Performance testing, is not an objective at all.
Have been going thru the VoltDB Planning Guide, but I am confused between the "+1" or "x2" in terms of number of servers (or VoltDB instances) required. Especially given these 2 statements:-
The easiest way to size hardware for a K-Safe cluster is to size the
initial instance of the database, based on projected throughput and
capacity, then multiply the number of servers by the number of
replicas you desire (that is, the K-Safety value plus one).
Rule of Thumb
When using K-Safety, configure the number of cluster nodes as a whole multiple of the number of copies of the database
(that is, K+1)
Questions:
Now, let's say that I need 1 server given capacity/throughput
requirements. So, to be able to have durability and
high-availability, do I need: 2, 3 or 4 servers ?
OTOH, using just 1 server, what all key features of VoltDB would I
have to forgo ?
Is there any relationship (or conflict) between VoltDB's full
disk-persistence and snapshots ? Say, the availability of disk-persistence
removes the need for snapshots ?
If you use 2 servers, you can keep a synchronous replica of data to protect from data loss, much like a RAID1 hard drive. Your data is double-safe, but there is a catch with availability. With only two servers, it's impossible to differentiate a network split from a failed node. In some cases, VoltDB will shut down a live node when another fails to ensure there will be no split brain. With 3 nodes, this won't be an issue and the cluster will remain available after any single node failure (with k=1 or k=2).
With just 1 server, all you lose is the multiple copies of data on multiple servers and the high-availability features that allow VoltDB to continue running after a node failure. You still have all of the other VoltDB features, including full disk persistence.
(Apologies...I'm new to SQL Server...)
We have SQL Server 2012 Express installed as a default instance which is the data store for our scheduler (JAMS).
I have a small data entry project where SQL Server will be the back end. It will have a small number of users with minor traffic/data entry/edits throughout the day. The data volumes will be small: one table with 300K rows, 3 more with around 5K rows.
I'm wondering whether I should convert the default instance into two named instances to segregate the applications? Or perhaps it's not worth the bother? IIRC, named instances run separate services for each instance. So, I could say restart the data entry instance without affecting the scheduler instance.
High availability isn't really needed, but if I adversely affect the scheduler it wouldn't be good.
Your thoughts? Thanks...
Named instances are required only when you want to run multiple instances. If you choose to run a single instance it doesn't really matter if it is named or default, although I would personally run as the default instance to avoid the need to specify the instance name.
The big advantage of a single instance is that it uses memory resources more efficiently (e.g. shared buffer pool memory). The downside, as you mentioned, is lack of isolation. Separate databases will help mitigate this concern but you will still be locked into the version supported by the vendor. For example, you can't use SQL 2014 until JAMS supports it. For this specific reason, I suggest you avoid sharing user and vendor databases on the same instance unless you are willing to be tied to the SQL Server versions supported by the vendor. Based on your description, I don't think resource utilization will be a concern so multiple instances is probably the best option if you want flexibility.
I've to create a Access 2003 database and share it among 100 users, users won't be doing any modifications, only viewing several reports that are generated daily (and once) using a scheduled task on the host machine.
Would a simultaneous 100 users break the performances down in that context?
What would you advise me regarding this workflow?
Exclude:
Using a database server (sqlserver,...etc) is out of topic
I've already thought about outputting the reports into static html, but now I want to first evaluate the sharing of the whole database (because filtering capability might be needed)
I'd like to avoid replication
You have used the word "host". Remember, Access is not a true client-server engine: it merely provides access to the data; consumers pull the data down to their local machines, where their local Access runtime or local Access development version executes the query against the downloaded data. Entire "freight trains" of data can come down across the wire to the desktop.
Some years ago we had a large database that the customer wanted in Access (eventually moved it to Oracle). Some queries would eat up 90%-100% of available LAN bandwidth for 15-30 seconds, during which time other write operations to completely different databases on the LAN would time-out, and data corruption would result.
So the main concern of your scenario would be the effects of possibly severe degradation on other applications. It will depend on the size of your database and the nature of your queries behind the reports.
I'd recommend "canning" the reports if you can, so that each running of a report does not invoke the query that instantiates the data behind it.
EDIT: An alternative, if one is necessary, would be to have a web server running on the same machine as the Access "host" executing the queries, and serving the end-result reports out to the consumers' browsers as HTML. This would reduce bandwidth consumption. The LAN becomes "the cloud".
If you give each user their own copy of the front end and linked to the data source then you might get away with 100 users if the network is up to scratch. I have about 100 users mostly read only on an access DB but they are not all using it at the same time
You can automate the front end installation using the excellent autoFE updater
www.autofeupdater.com/
I run a very high traffic(10m impressions a day)/high revenue generating web site built with .net. The core meta data is stored on a SQL server. My team and I have a unique caching strategy that involves querying the database for new meta data at regular intervals from a middle tier server, serializing the data to files and sending those to the web nodes. The web application uses the data in these files (some are actually serialized objects) to instantiate objects and caches those in memory to use for real time requests.
The advantage of this model is that it:
Allows the web nodes to cache all data in memory and not incur any IO overhead querying a database.
If the database ever goes down either unexpectedly or for maintenance windows, the web servers will continue to run and generate revenue. You can even fire up a web server without having to retrieve its initial data from the DB because all the data it needs are in files on its own disks.
Allows us to be completely horizontally scalable. If throughput suffers, we can just add a web server.
The disadvantages are that this caching and persistense layers adds complexity in the code that queries the database, packages the data and unpackages it on the web server. Any time our domain model requires us to add entities, more of this "plumbing" has to be coded. This architecture has been in place for four years and there are probably better ways to tackle this.
One strategy I have been considering is using replication to replicate our master sql server database to local database instances installed on each web server. The web server application would use normal sql/ORM techniques to instantiate objects. Here, we can still sustain a master database outage and we would not have to code up specialized caching code and could instead use nHibernate to handle the persistence.
This seems like a more elegant solution and would like to see what others think or if anyone else has any alternatives to suggest.
I think you're overthinking this. SQL Server already has mechanisms available to you to handle these kinds of things.
First, implement a SQL Server cluster to protect your main database. You can fail over from node to node in the cluster without losing data, and downtime is a matter of seconds, max.
Second, implement database mirroring to protect from a cluster failure. Depending on whether you use synchronous or asynchronous mirroring, your mirrored server will either be updated in realtime or a few minutes behind. If you do it in realtime, you can fail over to the mirror automatically inside your app - SQL Server 2005 & above support embedding the mirror server's name in the connection string, so you don't even have to lift a finger. The app just connects to whatever server's live.
Between these two things, you're protected from just about any main database failure short of a datacenter-wide power outage or network outage, and there's none of the complexity of the replication stuff. That covers your high availability issue, and lets you answer the scaling question separately.
My favorite starting point for scaling is using three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of read-only SQL Servers that are getting updated by replication or log shipping. In your original design, these lived on the web servers, but that's dangerous practice and a maintenance nightmare. SQL Server needs a lot of memory (not to mention money for licensing) and you don't want to be tied into adding a database server for every single web server.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, and doesn't involve any coding complexity - just pick the right connection string.
Have you considered memcached? Since it is:
in memory
can run locally
fully scalable horizontally
prevents the need to re-cache on each web server
It may fit the bill. Check out Google for lots of details and usage stories.
Just some addition to what RickNZ proposed above..
Since your master data which you are caching currently won't change so frequently and probably over some maintenance window, here is what should you do first on database side:
Create a SNAPSHOT replication for the master tables which you want to cache. Adding new entities will be equally easy.
On all the webservers, install SQL Express and subscribe to this Publication.
Since, this is not a frequently changing data, you can rest assure, no much server resource usage issue minus network trips for master data.
All your caching which was available via previous mechanism is still availbale minus all headache which comes when you add new entities.
Next, you can leverage .NET mechanisms as suggested above. You won't face memcached cluster failure unless your webserver itself goes down. There is a lot availble in .NET which a .NET pro can point out after this stage.
It seems to me that Windows Server AppFabric is exactly what you are looking for. (AKA "Velocity"). From the introductory documentation:
Windows Server AppFabric provides a
distributed in-memory application
cache platform for developing
scalable, available, and
high-performance applications.
AppFabric fuses memory across multiple
computers to give a single unified
cache view to applications.
Applications can store any
serializable CLR object without
worrying about where the object gets
stored. Scalability can be achieved by
simply adding more computers on
demand. The cache also allows for
copies of data to be stored across the
cluster, thus protecting data against
failures. It runs as a service
accessed over the network. In
addition, Windows Server AppFabric
provides seamless integration with
ASP.NET that enables ASP.NET session
objects to be stored in the
distributed cache without having to
write to databases. This increases
both the performance and scalability
of ASP.NET applications.
Have you considered using SqlDependency caching?
You could also write the data to the local disk at the web tier, if you're concerned about initial start-up time or DB outages. But at least with a SqlDependency, you shouldn't have to poll the DB to look for changes. It can also be made relatively transparent.
In my experience, adding a DB instance on web servers generally doesn't work out too well from a scalability or performance perspective.
If you're concerned about performance and scalability, you might consider partitioning your data tier. The specifics depend on your app, but as an example, you could move read-only data onto a couple of SQL Express servers that are populated with replication.
In case it helps, I talk about this subject at length in my book (Ultra-Fast ASP.NET).