Why the replication in Memgraph? - graph-databases

What’s the major reason for Memgraph to support replication if it’s Cloud ready? What are the benefits? I've also seen that both Azure and AWS support replication as well

So I found out replication is a general concept, but the implementation is quite specific for a given system.
Memgraph implements a custom high-performance storage engine. A custom implementation of replication on top of the mentioned storage engine is required to properly support replicating data (to be correct, performant, etc.).
Of course, it’s possible to use Cloud services to backup data from Memgraph in various ways, but that has limitations when it comes to backup all data and the system’s availability.

Related

Kubernetes and Cloud Databases

Could someone explain the benefits/issues with hosting a database in Kubernetes via a persistent volume claim combined with a storage volume over using an actual cloud database resource?
It's essentially a trade-off: convenience vs control. Take a concrete example: let's say you pay Amazon money to use Athena, which is really just a nicely packaged version of Facebook Presto which AWS kindly operates for you in exchange for $$$. You could run Presto on EKS yourself, but why would you.
Now, let's say you want to or need to use Apache Drill or Apache Impala. Amazon doesn't offer it. Nor does any of the other big public cloud providers at time of writing, as far as I know.
Another thought: what if you want to migrate off of AWS? Your data has gravity as well.
Could someone explain the benefits/issues with hosting a database in Kubernetes ... over using an actual cloud database resource?
As previous excellent answer noted:
It's essentially a trade-off: convenience vs control
In addition to previous example (Athena), take a look at RDS as well and see what you would need to handle yourself (why would you, as said already):
Automatic backups
Multizone deployments
Snapshots
Engine upgrades
Read replicas
and other bells and whistles that come with managed service opposed to self-hosted/managed one.
But there is more to it than just convenience/control that this post I trying to shed light onto:
Kubernetes is adding another layer of abstraction there (pods, services...), and depending on way of handling storage (persistent volumes) you can have two additional considerations:
Access speed (depending on your use case this can be negligent or show stopper).
Storage that you have at hand might not be optimized for relational database type of I/O (or restrict you to schedule pods efficiently). The very same reasons you are not advised to run db on NFS for example.
There are several recent conference talks on kubernetes pointing out that database is big no-no for kubernetes (although this is highly opinionated, we do run average load mysql and postgresql databases in k8s), and large load/fast I/O is somewhat challenge to get right on k8s as opposed to somebody already fine tuned everything for you in managed cloud solution.
In conclusion:
It is all about convenience, controls and capabilities.

NServiceBus & ServiceInsight Sql Server Transport & Persistence

The application we have been building is starting to solidify in that the majority of the functionality is now in place. This has given us some breathing room and we are starting evaluate our persistence model and the management of it. I guess you could say the big elephant in the room is RavenDB. While we functionally have not experienced any issues with it yet, we are not comfortable with managing it. Simple tasks such as executing a query, truncating a collection, etc, are challenging for us as we are new to the platform and document based NoSql solutions in general. Of course we are capable of learning it, but I think it comes down to confidence, time, and leveraging our existing Sql Server skill sets. For example, we pumped millions of events through the system over the course of a few weeks and the successfully processes message were routed to our Audit queue in MSMQ. We also had ServiceInsight installed and it processed the messages in the Audit queue, which chewed up all the disk space on the server. We did not know how to fix this and literally had to delete the Data file that we found for RavenDB. Let me just say, doing that caused all kinds of headaches.
So with that in mind, I have been charged with evaluating the feasibility and benefits of potentially leveraging Sql Server for the Transport and/or Persistence for our Service Endpoints. In addition, I could use some guidance as well for configuring ServiceControl and ServiceInsight to leverage Sql Server. Any information you might be able to provide regarding configuring these and identifying any draw backs or architectural issues that we should consider would be greatly appreciated.
Thank you, Jeffrey
Using SQL persistence requires very little configuration (implementation detail), however, using SQL transport is more of an architectural decision then an infrastructure one as you are changing to a broker style architecture, that has implications you need to consider before going down that route.
ServiceControl and ServiceInsight persistance:
Although the ServiceControl monitors MSMQ as the default transport, you can use ServiceControl to support other transports such as RabbitMQ, SqlServer as well, Here you can find the details of how to do that
At the moment ServiceControl relies on RavenDb for it's persistence and it is not possible to change that to SQL as ServiceControl relies on Raven features.(AFIK)
There is an open issue for expiring data in ServiceControl's data, see this issue in github
HTH
Regarding ServiceControl usage of RavenDB (this is the underlying service that serves the data to ServiceInsight UI):
As Sean Farmar mentioned (above), in the post-beta releases we will be including message expiration, and on-demand audited message deletion commands so that you can have full control of the capacity utilization of SC.
You can also change the drive/path of the ServiceControl database location to allow it to use a larger drive.
Note that ServiceControl (and ServiceInsight / ServicePulse that use it) is intended for analysis, debugging and operational monitoring. Its intended to store a limited amount of audited data (based on your throughput and capacity needs, this may vary significantly when counted as number of messages, but the database storage capacity can be up to 16TB).
If you need a long term storage for audited data, you can hook into ServiceControl HTTP API and transfer the messages' data into various long-term / unlimited-size / low-cost storage solutions (e.g. http://aws.amazon.com/glacier).
Please let us know if this answers your needs and whether you have additional questions
Danny.

GAE datastore settings for non-database applications

I have an application on GAE which does not use any database features, nor do I plan to do so in the future. In Application Settings it says Master/Slave Replication Deprecated!.
My question: Should I still migrate my application to High Replication Datastore for some reason or doesn’t it make any difference?
One thought: What if the database server connected to my application is down? Will my application be stopped too, if it’s not HRD? There might be other issues/considerations, too.
Another thought: It sounds like the SLA is only available for HRD-applications.
Yes, you should definitely upgrade anyway.
Not only will you get the SLA that's only for HRD, you will also have your app served from the newer datacenters. Plus, even if you aren't directly using the datastore, if you are using sessions or task queues, those services use datastore internally to store state for you.
Also, I noticed an improvement on memcache performance after upgrading... So, yes. Do the upgrade, the M/S datastore is deprecated because the HRD infrastructure is built on newer, better technology.

GAE DataStore vs Google Cloud SQL for Enterprise Management Systems

I am building an application that is an enterprise management system using gae. I have built several applications using gae and the datastore, but never one that will require a high volume of users entering transactions along with the need for administrative and management reporting. My biggest fear is that when I need to create cross-tab and other detailed reports (or business intelligence reporting and data manipulation) I will be facing a mountain of problems with gae's datastore querying and data pull limits. Is it really just architectural preference or are there quantitative concerns here?
In the past I have built systems using C++/c#/Java against an Oracle/MySql/MSSql (with a caching layer sprinkled in for some added performance on complex or frequently accessed db results).
I keep reading that we are to throw away the old mentality of relational data and move to the new world of the big McHashTable in the sky... but new isnt always better... Any insight or experience on the above would be helpful.
From the Cloud SQL FAQ:
Should I use Google Cloud SQL or the App Engine Datastore?
This depends on the requirements of the application. Datastore provides NoSQL key-value > storage that is highly scalable, but does not support the complex queries offered by a SQL database. Cloud SQL supports complex queries and ACID transactions, but this means the database acts as a ‘fixed pipe’ and performance is less scalable. Many applications use both types of storage.
If you need a lot of writes (~XXX per/s) to db entity w/ distributed keys, that's where the Google App Engine datastore really shine.
If you need support for complex and random user crafted queries, that's where Google Cloud SQL is more convenient.
What is scare me more in GAE datastore is index number limitation. For example if you need search by some field or sorting - you need +1 index. Totally you can have 200 indexes. If you have entity with 10 searchable fields and you can sort by any field - there will be about 100 combunations. So you need 100 indexes. I have developed few small projects for gae - and this is success stories. But when big one come - this is not for gae.
About cache - you can do it with gae, but they distributed cache works very slow. I prefer to create private single instance of permanent backend with RESTfull API that holds cached values in memory. Frontend instances call this API to get/set values.
Maybe it is posible to build complex system with gae, but this will be a set of small applications/services.

To CouchDB or not to?

Note: (I have investigated CouchDB for sometime and need some actual experiences).
I have an Oracle database for a fleet tracking service and some status here are:
100 GB db
Huge insertion/sec (our received messages)
Reliable replication (via Oracle streams on 4 servers)
Heavy complex queries.
Now the question: Can CouchDB be used in this case?
Note: Why I thought of CouchDB?
I have read about it's ability to scale horizontally very well. That's very important in our case.
Since it's schema free we can handle changes more properly since we have a lot of changes in different tables and stored procedures.
Thanks
Edit I:
I need transactions too. But I can tolerate other solutions too. And If there is a little delay in replication, that would be no problem IF it is guaranteed.
You are enjoying the following features with your database:
Using it in production
The data is naturally relational (related to itself)
Huge insertion rate (no MVCC concerns)
Complex queries
Transactions
These are all reasons not to switch to CouchDB.
Of course, the story is not so simple. I think you have discovered what many people never learn: complex problems require complex solutions. We cannot simply replace our database and take the rest of the month off. Sure, CouchDB (and BigCouch) supports excellent horizontal scaling (and cross-datacenter replication too!) but the cost will be rewriting a production application. That is not right.
So, where can CouchDB benefit you?
I suggest that you begin augmenting your application with CouchDB applications. Deploy CouchDB, import your data into it, and build non mission-critical applications. See where it fits best.
For your project, these are the key CouchDB strengths:
It is a small, simple tool—easy for you to set up on a workstation or server
It is a web server. It integrates very well with your infrastructure and security policies.
For example, if you have a flexible policy, just set it up on your LAN
If you have a strict network and firewall policy, you can set it up behind a VPN, or with your SSL certificates
With that step done, it is very easy to access now. Just make http or http requests. Whether you are importing data from Oracle with a custom tool, or using your web browser, it's all the same.
Yes! CouchDB is an app server too! It has a built-in administrative app, to explore data, change the config, etc. (like a built-in phpmyadmin). But for you, the value will be building admin applications and reports as simple, traditional HTML/Javascript/CSS applications. You can get as fancy or as simple as you like.
As your project grows and becomes valuable, you are in a great position to grow, using replication
Either expand the core with larger CouchDB clusters
Or, replicate your data and applications into different data centers, or onto individual workstations, or mobile phones, etc. (The strategy will be more obvious when the time comes.)
CouchDB gives you a simple web server and web site. It gives you a built-in web services API to your data. It makes it easy to build web apps. Therefore, CouchDB seems ideal for extending your core application, not replacing it.
I don't agree with this answer..
I think CouchDB suits especially well fleet tracking use case, due to their distributed nature. Moreover, the unreliable nature of gprs connections used for transmitting position data, makes the offline-first paradygm of couchapps the perfect partner for your application.
For uploading data from truck, Insertion-rate can take a huge advantage from couchdb replication and bulk inserts, especially if performed on ssd-based couchdb hosting.
For downloading data to truck, couchdb provides filtered replication, allowing each truck to download only the data it really needs, instead of the whole database.
Regarding complex queries, NoSQL database are more flexible and can perform much faster than relation databases.. It's only a matter of structuring and querying your data reasonably.

Resources