[Some background information - possibly skippable]
To begin with, I have barely any understanding of database management
and just shallow experience with mongoose and node in the backend
realm(a couple of udemy courses). Udemy courses made me believe that
mongodb was still a viable choice for a database with relational
properties and off I went working on a backend for a forum-like
website. After learning about transactions, I attempted to implement
them in my backend, since it seemed perfectly necessary to implement a
rollback feature when executing an array of queries. However it turns
out that transactions are only possible on replica sets - which also
appeared to require a minimum of 2. 3 databases for a startup MVP was
obviously considered an overkill.
[The question]
Is it possible to implement transactions with only 1 database? If so how?
If the above is not possible, how would one launch a mongodb database with minimum configuration while implementing transactions, with the fact that the database is for a startup MVP in consideration.
(For anyone who has experience in implementing a production-level mongo database in a similar scenario) If not using transactions was considered, how to send queries editing/creating multiple documents to a mongoDB database safely without transactions, while not spraying every bit of code with try-catches consisting of queries to rollback every point of failure(which I considered as way too much overhead)
I have a tight deadline, and have already done a substantial bit of groundwork and a couple of routes using mongoose, which means ditching mongodb for a relational database will be a difficult option at the moment.
I think I've googled everything related to the subject at matter and even tried blogs / articles in the second page of a google search(which many including myself consider as the dark web /s).
Yet I do think I may have missed what I was looking for and an answer consisting of just links is also welcome.
Thank you for reading!
You need a replica set[*] to use transactions, but you can create a single-node replica set for testing purposes.
The full procedure is described in documentation, for a single-node RS you follow that as written but only configure a single member.
Briefly, you need to pass --replSet argument to mongod and then connect to it through mongo shell and run rs.initiate().
Note that transactions aren't a magic solution to all problems. There are scenarios where they are appropriate and scenarios where MongoDB provides other functionality that would be a better fit.
[*] or a sharded cluster with MongoDB 4.2+, but this involves more setup work.
Related
Why I'd want to have multiple replicas of my DB?
Redundancy: I have > 1 replicas of my app code. Why? In case one node fails, another can fill its place when run behind a load balancer.
Load: A load balancer can distribute traffic to multiple instances of the app.
A/B testing. I can have one node serve one version of the app, and another serve a different one.
Maintenance. I can bring down one instance for maintenance, and keep the other one up with 0 down-time.
So, I assume I'd want to do the same with the backing db if possible too.
I realize that many nosql dbs are better configured for multiple instances, but I am interested in relational dbs.
I've played with operators like this and this but have found problems with the docs, have not been able to get them up and running and found the community a bit lacking. Relying on this kind of thing in production makes me nervous. The Mysql operator has a note even, saying it's not for production use.
I see that native k8s statefulsets have scaling but these docs aren't specific to dbs at all. I assume the complication is that dbs need to write persistently to disk via a volume and that data has to be synced and routed somehow if you have more than one instance.
So, is this something that's non-trivial to do myself? Or, am I better off having a dev environment that uses a one-replica db image in the cluster in order to save on billing, and a prod environment that uses a fully managed db, something like this that takes care of the scaling/HA for me? Then I'd use kustomize to manage the yaml variances.
Edit:
I actually found a postgres operator that worked great. Followed the docs one time through and it all worked, and it's from postgres docs.
I have created this community wiki answer to summarize the topic and to make pertinent information more visible.
As Turing85 well mentioned in the comment:
Do NOT share a pvc to multiple db instances. Even if you use the right backing volume (it must be an object-based storage in order to be read-write many), with enough scaling, performance will take a hit (after all, everything goes to one file system, this will stress the FS). The proper way would be to configure clustering. All major relational databases (mssql, mysql, postgres, oracle, ...) do support clustering. To be on the secure side, however, I would recommend to buy a scalable database "as a service" unless you know exactly what you are doing.
The good solution might be to use a single replica StatefulSet for development, to avoid billing and use a fully managed cloud based sql solution in prod. Unless you have the knowledge or a suffiiciently professional operator to deploy a clustered dbms.
Another solution may be to use a different operator as Aaron did:
I actually found a postgres operator that worked great. Followed the docs one time through and it all worked, and it's from postgres: https://www.kubegres.io/doc/getting-started.html
See also this similar question.
So I'm designing this blog engine and I'm trying to just keep my blog data without considering comments or membership system or any other type of multi-user data.
The blog itself is surrounded around 2 types of data, the first is the actual blog post entry which consists of: title, post body, meta data (mostly dates and statistics), so it's really simple and can be represented by simple json object. The second type of data is the blog admin configuration and personal information. Comment system and other will be implemented using disqus.
My main concern here is the ability of such engine to scale with spiked visits (I know you might argue this but lets take it for granted). So since I've started this project I'm moving well with the rest of my stack except the data layer. Now I've been having this dilemma choosing the database, I've considered MongoDB but some reviews and articles/benchmarking were suggesting slow reads after collections read certain size. Next I was looking at Redis and using its persistence features RDB and AOF, while Redis is good at both fast reading/writing I'm afraid of using it because I'm not familiar with it. And this whole search keeps going on to things like "PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents" etc.
So is there any way I can settle this issue for good? considering that I only need to represent my data in key,value structure and only require fast reading but not writing and the ability to be fault tolerant.
Thank you
If I were you I would start small and not try to optimize for big data just yet. A lot of blogs you read about the downsides of a NoSQL solution are around large data sets - or people that are trying to do relational things with a database designed for de-normalized data.
My list of databases to consider:
Mongo. It has huge community support and based on recent funding - it's going to be around for a while. It runs very well on a single instance and a basic replica set. It's easy to set up and free, so it's worth spending a day or two running your own tests to settle the issue once and for all. Don't trust a blog.
Couchbase. Supports key/value storage and also has persistence to disk. http://www.couchbase.com/couchbase-server/features Also has had some recent funding so hopefully that means stability. =)
CouchDB/PouchDB. You can use PouchDB purely on the client side and it can connect to a server side CouchDB. CouchDB might not have the same momentum as Mongo or Couchbase, but it's an actively supported product and does key/value with persistence to disk.
Riak. http://basho.com/riak/. Another NoSQL that scales and is a key/value store.
You can install and run a proof-of-concept on all of the above products in a few hours. I would recommend this for the following reasons:
A given database might scale and hit your points, but be unpleasant to use. Consider picking a database that feels fun! Sort of akin to picking Ruby/Python over Java because the syntax is nicer.
Your use case and domain will be fairly unique. Worth testing various products to see what fits best.
Each database has quirks and you won't find those until you actually try one. One might have quirks that are passable, one will have quirks that are a show stopper.
The benefit of trying all of them is that they all support schemaless data, so if you write JSON, you can use all of them! No need to create objects in your code for each database.
If you abstract the database correctly in code, swapping out data stores won't be that painful. In other words, your code will be happier if you make it easy to swap out data stores.
This is only an option for really simple CMSes, but it sounds like that's what you're building.
If your blog is super-simple as you describe and your main concern is very high traffic then the best option might be to avoid a database entirely and have your CMS generate static files instead. By doing this, you eliminate all your database concerns completely.
It's not the best option if you're doing anything dynamic or complex, but in this small use case it might fit the bill.
This is quite a common question and yet the answers are unclear to me. I have 2 different databases on 2 different servers. One is a pure xml database and the other a traditional dbms (sql server). Can anybody point me to recent articles or their experience in dealing with transaction management. I have put together a 1pc strategy which works fine for runtime exceptions. However, I am not sure if it is bullet-proof. Secondly, using spring junit test how to specify a default rollback? It only rolls back the first transactionmanager's transactions. The other transactions are stored in the other database.
Sounds like you want to use a ChainedTransactionManager.
Spring has implemented one of these for neo4j, so you can take the code out of the project.
There was a good article on how to do this, but cant find it anymore. But perhaps this is enough to get you started..
I am developing a Analytics tool similar to Google Analytics. That will store keywords, visits and pages in a database.
So the database can grow very quickly because I want to have many people using it.
How should I setup the database? One database for all the accounts and all the websites being monitored? Or it would be better to have one database for every account?
Also, I am planning to start with one dedicated server but I'm sure that I will need more than one server in the future so I have to build it keeping that in mind.
I also know that if I do multiple databases for every account then I will have to run upgrade scripts on all of them when the schema of the app will change.
What kind of database do you plan to use ? There is a BIG difference between relational (PostgreSQL, MySQL) and "NoSQL" (MongoDB, CouchDB)
I'm only going to talk about PostgreSQL on the relational side since it's the only database I have experience with.
First, I would keep everything in one database. There's no benefit in using a database per account.
Second, you should be absolutely sure you WILL outgrow a single machine. Given the kind of application you'll be dealing with a lot more writes than reads, so a master-slave replication will only serve for high availability, and multi-master replication with PostgreSQL is NOT easy.
From my last research the least painful way to do that was to use a tool like Postgres-XC which is designed to be write-scalable, but I have no idea how production-ready it is.
Another solution is using tools like Bucardo or SkyTools. No experience with SkyTools but I had a lot of trouble getting Bucardo to work last year.
The last solution is to do sharding. The naive way to shard is to do something like
shard number = id % 10. However using this you would need to rebalance your cluster whenever you add/remove a shard.
It would require that you write your application "shard-aware" so that you direct the queries to the correct shard.
Anyway like I said before, make sure you will NEED to shard/clusterize first.
Now for the "NoSQL" side, I have no experience with any of the solutions, but I do know that MongoDB and CouchDB handle sharding themselves so it's way easier with those solutions, however you give up quite a lot.
I'll expand a bit on Vincent's answer.
As for sharding we have had good experience with PL/Proxy. And with sharding you can outgrow single machine without issues (read or write).
As for replication Londiste from Skytools is very easy to set up and use. And with it you get PgQ, quite nice messaging solution for Postgres.
We have a new django powered project which have a potential heavy-traffic characteristic(means a heavy db interaction). So we need to consider the database scalability in advance. With some researches, the following questions are still not clear to us:
coarse-grained: how to specify one db table(a django model) to a specific db(maybe in another server)?
fine-grained: how to specify a group of table rows to a specific db(so-called sharding, also can in another db server)?
how to specify write and read to different db?(which will be helpful for future mysql master/slave replication)
We are finding the solution with:
be transparent to application program(means we don't need to have additional codes in views.py)
should be in ORM level(means only needs to specify in models.py)
compatible with the current(or future) django release(to keep a minimal change for future's upgrading of django)
I'm still doing the research. And will share in this thread later if I've got some fruits.
Hope anyone with the experience can answer. Thanks.
Don't forget about caching either. Using memcached to relieve your DB of load is key to building a high performance site.
As alex said, django-core doesn't support your specific requests for those features, though they are definitely on the todo list.
If you don't do this in the application layer, you're basically asking for performance trouble. There aren't any really good open source automation layers for this sort of task, since it tends to break SQL axioms. If you're really concerned about it, you should be coding the entire application for it, not simply hoping that your ORM will take care of it.
There is the GSoC project by Alex Gaynor that in future will allow to use multiple databases in one Django project. But now there is no cross-RDBMS working solution.
There is no solution right now too.
And again - there is no cross-RDBMS solution. But if you are using MySQL you can try excellent third-party Django application called - mysql_replicated. It allows to setup master-slave replication scenario easily.
here for some reason we r using django with sqlalchemy. maybe combination of django and sqlalchemy also works for your needs.