To CouchDB or not to? - database

Note: (I have investigated CouchDB for sometime and need some actual experiences).
I have an Oracle database for a fleet tracking service and some status here are:
100 GB db
Huge insertion/sec (our received messages)
Reliable replication (via Oracle streams on 4 servers)
Heavy complex queries.
Now the question: Can CouchDB be used in this case?
Note: Why I thought of CouchDB?
I have read about it's ability to scale horizontally very well. That's very important in our case.
Since it's schema free we can handle changes more properly since we have a lot of changes in different tables and stored procedures.
Thanks
Edit I:
I need transactions too. But I can tolerate other solutions too. And If there is a little delay in replication, that would be no problem IF it is guaranteed.

You are enjoying the following features with your database:
Using it in production
The data is naturally relational (related to itself)
Huge insertion rate (no MVCC concerns)
Complex queries
Transactions
These are all reasons not to switch to CouchDB.
Of course, the story is not so simple. I think you have discovered what many people never learn: complex problems require complex solutions. We cannot simply replace our database and take the rest of the month off. Sure, CouchDB (and BigCouch) supports excellent horizontal scaling (and cross-datacenter replication too!) but the cost will be rewriting a production application. That is not right.
So, where can CouchDB benefit you?
I suggest that you begin augmenting your application with CouchDB applications. Deploy CouchDB, import your data into it, and build non mission-critical applications. See where it fits best.
For your project, these are the key CouchDB strengths:
It is a small, simple tool—easy for you to set up on a workstation or server
It is a web server. It integrates very well with your infrastructure and security policies.
For example, if you have a flexible policy, just set it up on your LAN
If you have a strict network and firewall policy, you can set it up behind a VPN, or with your SSL certificates
With that step done, it is very easy to access now. Just make http or http requests. Whether you are importing data from Oracle with a custom tool, or using your web browser, it's all the same.
Yes! CouchDB is an app server too! It has a built-in administrative app, to explore data, change the config, etc. (like a built-in phpmyadmin). But for you, the value will be building admin applications and reports as simple, traditional HTML/Javascript/CSS applications. You can get as fancy or as simple as you like.
As your project grows and becomes valuable, you are in a great position to grow, using replication
Either expand the core with larger CouchDB clusters
Or, replicate your data and applications into different data centers, or onto individual workstations, or mobile phones, etc. (The strategy will be more obvious when the time comes.)
CouchDB gives you a simple web server and web site. It gives you a built-in web services API to your data. It makes it easy to build web apps. Therefore, CouchDB seems ideal for extending your core application, not replacing it.

I don't agree with this answer..
I think CouchDB suits especially well fleet tracking use case, due to their distributed nature. Moreover, the unreliable nature of gprs connections used for transmitting position data, makes the offline-first paradygm of couchapps the perfect partner for your application.
For uploading data from truck, Insertion-rate can take a huge advantage from couchdb replication and bulk inserts, especially if performed on ssd-based couchdb hosting.
For downloading data to truck, couchdb provides filtered replication, allowing each truck to download only the data it really needs, instead of the whole database.
Regarding complex queries, NoSQL database are more flexible and can perform much faster than relation databases.. It's only a matter of structuring and querying your data reasonably.

Related

DB recommendation - Portable, Concurrent (multiple read only, one write)

I'm looking for a portable database solution I can use with a website that is designed to handle service outages. I need to nightly retrieve a list of users from SQL Server and upsert their details into a portable database. It's roughly about 250,000 users (and growing) and each one has probably 25 fields that are required. Of those fields, i'd say less than 5 need to be searched on. The rest just need retrieving.
The idea is, in times of a service outage, we can use a website that's designed to work from the portable database rather than SQL Server. Our long term goal, is to move to the cloud and handle things in an entirely different way, but for the short term this is our aim.
The website is going to be a .Net Core web api so will be being accessed by multiple users in multiple threads. The website will only ever need read access, it will not be updating these details what-so-ever.
To keep the portable database up-to-date i'm thinking of having another application that just runs nightly to update the data. Our business is 24 hours (albeit quieter overnight), so there is a potential this updater is in use while the website is in use. While service outage would assume the SQL Server is down, this may not be the case. There are other factors in play that could cause what we would describe as outages. This will be the only piece of software updating the database.
I've tried using LiteDB but I couldn't get it working in a way that worked with my concurrency requirements. It did seem to do some of the job, and was easy to get running. However, i'd often run into locked files due to the nature of web api. I did work out a solution for that, but then the updater app couldn't access the database file.
Does anyone have any recommendations I can look into?
Given the description of the problem (1 table, 250k rows with - I assume - relative fast growth rate) and requirements, I don't think a relational database is what you are looking for.
I think nosql databases, or, more specifically, document oriented databases are more fitted to meet your requirements. There are many choices: Mongo, Cassandra, CouchDB, ... the choice is yours.
Personally I have some experience with ElasticSearch (https://www.elastic.co/elasticsearch), that is quite easy to learn, is portable (runs on Linux, Windows, Containers, etc...), is scalable, and it is fast. I mean, really, really fast, you can get results in 10-20 milliseconds (even less, sometimes).
The NEST nuget package acts as a high level client for working with ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/client/net-api/7.x/nest-getting-started.html)

keeping databases in sync (after write/update) across regions/zones

I have to write a webservice in php to serve at three different zones/(cities or countries). Each zone will have its own machine to run this web service instance behind every webservice is a database which is exact clone/copy in each region, web service serves the clients with data from db. Main reason for multiples instances of web service is to distribute client load.
The clients can make read and write calls via web service APIs.
Write calls will modify the database for that instance but this change has to be applied as soon as possible to all databases in other zones also as all the databases in each zone are clones and exact copies, so changes in one db must be synced in all the databases in other zones.
I presume the write calls must go to some kind of master server which coordinates among all the web services etc. But I am sure this pattern is quite common and some solution is already out there.
Please advise if there is any database or application level technique which would keep the databases in sync when there are write calls so that modification or addition is reflected in all instances of db ? I can choose the database of my choice but primary choice would be mysql server or postgres, but can change to other database which can solve this issue.
You're right, this pattern is quite common and there is a name for it - Synchronous Master-Master replication. Most modern RDBMS support it:
PosgreSQL supports it thru pg_cluster https://wiki.postgresql.org/wiki/PgCluster
MySQL https://www.howtoforge.com/mysql_master_master_replication
But before implementing it straight away I'd recommend reading more about different types of replication, their pros and cons:
https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
https://dev.mysql.com/doc/refman/8.0/en/replication.html
Synchronous Master-Master replication will be quite slow, especially in a multi-zone scenario, so you might consider other techniques:
Asynchronous replication
Sharding/Partitioning
A mix of sharding and replication
There is a very good book on different distributed techniques(including sharding and replication) - "Designing Data Intensive Applications" by Martin Kleppmann.
Replication techniques are definitely worth looking at, but there can be a certain amount of technical overhead and cost to replication. I work for a company called Redactics (https://www.redactics.com), and we came up with a simpler solution that is sort of a near realtime replication based on delta updates using a pure SQL approach.
There are certainly pros and cons to both approaches, I'm not trying to push Redactics hard if this is not the most appropriate solution for your needs, but Redactics simply tracks the most recent primary keys and uses modification timestamps to find new and changed records, and then copies them over. You can run the sync pretty often without a lot of load since it is just a delta update. Obviously any workflow can break, but repairing broken replication can be tricky, so we like this approach and running these sync workflows within your own infrastructure.

Database for a java application in cluster

I'd like to play around with kubernetes, I'm able to start a simple app, but now I'd like to design something more complex. Nevertheless I can't figure out, how to handle the database access in such architecture.
Let's say I have 100 pod replicas of some simple chat application. They all need to access the same database (or more like data set) and perform CRUD operations upon them. How to design it to keep the data consistent and eliminate the risk of deadlocks?
If possible, I'd like to use SQL-like database, so I can comfortably use hibernate and other tools I'm familiar with.
Is this even possible or do I have to use totally a different approach? What is the name of the technology or architecture I'm searching for?
1) You can use a connection pool to reduce this number and make the connection settings more aggressive/elastic;
2) Split your microservices in such way the access to the persistence is a microservice exposing your CRUD service to your persistence(mysql/rdms/nosql/etc). In that way you most likely don't need hundreds of replicas of your pods.
3) Deadlocks / locking strategies - as Andrew mentioned in the comments, it's more related to your software development architecture rather than K8s itself. There are plenty of ways to deal with that with pros/cons.

Pluggable database interface

I am working on a project where we are scoping out the specs for an interface to the backend systems of multiple wholesalers. Here is what we are working with,
Each wholesaler has multiple products, upwards of 10,000. And each wholesaler has customized prices for their products.
The list of wholesalers being accessed will keep growing in the future, so potentially 1000s of wholesalers could be accessed by the system.
Wholesalers are geographically dispersed.
The interface to this system will allow the user to select the wholesaler they wish and browse their products.
Product price updates should be reflected on the site in real time. So, if the wholesaler updates the price it should immediately be available on the site.
System should be database agnostic.
The system should be easy to setup on the wholesalers end, and be minimally intrusive in their daily activities.
Initially, I thought about creating databases for each wholesaler on our end, but with potentially 1000s of wholesalers in the future, is this the best option as far as performance and storage.
Would it be better to query the wholesalers database directly instead of storing their data locally? Can we do this and still remain database agnostic?
What would be best technology stack for such an implementation? I need some kind of ORM tool.
Java based frameworks and technologies preferred.
Thanks.
If you want to create a software that can switch the database I would suggest to use Hibernate (or NHibernate if you use .Net).
Hibernate is an ORM which is not dependent to a specific database and this allows you to switch the DB very easy. It is already proven in large applications and well integrated in the Spring framework (but can be used without Spring framework, too). (Spring.net is the equivalent if using .Net)
Spring is a good technology stack to build large scalable applications (contains IoC-Container, Database access layer, transaction management, supports AOP and much more).
Wiki gives you a short overview:
http://en.wikipedia.org/wiki/Hibernate_(Java)
http://en.wikipedia.org/wiki/Spring_Framework
Would it be better to query the wholesalers database directly instead
of storing their data locally?
This depends on the availability and latency for accessing remote data. Databases itself have several posibilities to keep them in sync through multiple server instances. Ask yourself what should/would happen if a wholesaler database goes (partly) offline. Maybe not all data needs to be duplicated.
Can we do this and still remain database agnostic?
Yes, see my answer related to the ORM (N)Hibernate.
What would be best technology stack for such an implementation?
"Best" depends on your requirements. I like Spring. If you go with .Net the built-in ADO.NET Entity Framework might be fit, too.

Suggest: Non RDBMS database for a noob

For a new application based on Erlang, Python, we are thinking of trying out a non-RDBMS database(just for the sake of it). Some of the databases I've researched are Mongodb, CouchDB, Cassandra, Redis, Riak, Scalaris). Here is a list of simple requirements.
Ease of development - I need to make a quick proof-of-concept demo. So the database needs to have good adapters for Eralang and Python.
I'm working on a new application where we have lots of "connected" data. Somebody recommended Neo4j for graph-like data. Any ideas on that?
Scalable - We are looking at a distributed architecture, hence scalability is important.
For the moment performance(in any form) isn't exactly on top of my list, and I don't think we'll be hitting the limitations of any of the above mentioned databases anytime soon.
I'm just looking for a starting point for non-RDBMS database. Any recommendations?
We have used Mnesia in building an Enterprise Application. Mnesia when in a mode where the tables are Fragmented performs at its best because it would not have table size limits. Mnesia has performed well for the last 1 year and is still on. We have around 15 million records per table on the average and around 24 tables in a given database Schema.
I recommend mnesia Database especially the one that comes shipped within Erlang 14B03 at the Erlang.org website. We have used CouchDB and Membase Server (http://www.couchbase.com)for some parts of the system but mnesia is the main data storage (primary storage). Backups have been automated very well and the system scales well against increasing size of data yet tables running under many checkpoints. Its distribution, auto-replication and Complex Data Model enabled us to build the application very quickly without worrying about replication, scalability and fail-over / take-over of systems.
Mnesia Scales well and it's schema can be configured and changes while the database is running. Tables can be moved, copied, altered e.t.c while the system is live. Generally, it has all features of powerful systems built on top of Erlang/OTP. When you google mnesia DBMS, you will get a number of books and papers that will tell you more.
Most importantly, our application is Web based, powered by Yaws web server (yaws.hyber.org) and we are impressed with Mnesia's performance. Its record look up speeds are very good and the system feels so light yet renders alot of data. Do give mnesia a try and you will not regret it.
EDIT: To quickly use it in your application, look at the answer given here
Ease of development - I need to make a quick proof-of-concept demo. So the database needs to have good adapters for Eralang and Python.
Riak is written in Erlang => speaks Erlang natively
I'm working on a new application where we have lots of "connected" data. Somebody recommended Neo4j for graph-like data. Any ideas on that?
Neo4j is great for "connected" data. It has Python bindings, and some Erlang adapters How to Use Neo4j From Erlang. Thing to note, Neo4j is not as easy to Scale Out, at least for free. But.. it is fully transactional ( even JTA ), it persists things to disk, it is baked into Spring Data.
Scalable - We are looking at a distributed architecture, hence scalability is important.
For the moment performance(in any form) isn't exactly on top of my list, and I don't think we'll be hitting the limitations of any of the above mentioned databases anytime soon.
I believe given your input, Riak would be the best choice for you:
Written in Erlang
Naturally Distributed
Very easy to develop for/with
Lots of features ( secondary indicies, virtual nodes, fully modular, pluggable persistence [LevelDB, Bitcask, InnoDB, flat file, etc.. ], extremely reliable, built in full text search, etc.. )
Has an extremely passionate and helpful community with Basho backing it up

Resources