we are designing a distributed system that needs a highly available key-value store. in this system, we have to store the latest state of client records so they will be updated frequently.
Is there any update-friendly NoSQL?
I would start at Mongo or Redis, there are likely more requirements you have.
And assuming you are not looking at pure open source providers i.e. Couch/Cassandra etc.
Related
Can etcd be used as reliable database replacement? Since it is distributed and stores key/value pairs in a persistent way, it would be a great alternative nosql database. In addition, it has a great API. Can someone explain why this is not a thing?
etcd
etcd is a highly available key-value store which Kubernetes uses for persistent storage of all of its objects like deployment, pod, service information.
etcd has high access control, that it can be accessed only using API in master node. Nodes in the cluster other than master do not have access to etcd store.
nosql database
There are currently more than than 255 nosql databases, which can be broadly classified into Key-Value based, Column based, Document based and Graph based. Considering etcd as an key-value store, lets see the available nosql key-value data stores.
Redis, memcached and memcacheDB are popular key-value stores. These are general-purpose distributed memory caching system often used to speed up dynamic database-driven websites by caching data and objects in memory.
Why etcd not an alternative
etcd cannot be stored in memory(ram) they can only be persisted in disk storage, whereas redis can be cached in ram and can also be persisted in disk.
etcd does not have various data types. It is made to store only kubernetes objects. But redis and other key-value stores have data-type flexibility.
etcd guarantees only high availabilty, but does not give you the fast querying and indexing. All the nosql key-value stores are built with the goal of fast querying and searching.
Eventhough it is obvious that etcd cannot be used as an alternative nosql database, I think the above explanation will prove it cannot be an suitable alternative.
From the ETCD.IO site:
etcd is a strongly consistent, distributed key-value store that
provides a reliable way to store data that needs to be accessed by a
distributed system or cluster of machines. It gracefully handles
leader elections during network partitions and can tolerate machine
failure, even in the leader node.
It has a simple interface using http and json. It is NOT just for Kubernetes. Kubernetes is just an example of a critical application that uses it.
You are right it should be a thing. A nice reliable data store with an easy to use API and a nice way of telling you when things change using raft protocol. This is great for feature toggles and other items where everything needs to know and is much better than things like putting a trigger in an sql database and getting it to send an event to an external application or really horrible polling.
So if you are writing something like the kubernetes use case >> it is perfect a well proven store for a distributed application.
If you are writing something very different to the kubernetes use case, then you are comparing with all the other no-sql databases. But is very different to something like mongodb so it may be better for you if mongodb or similar does not work for you.
Other example users
M3, a large-scale metrics platform for Prometheus created by Uber, uses etcd for rule storage and other functions
Consistency
There is a nice comparison of NOSQL database consistency by Jepson at https://jepsen.io/analyses
ETCD sum up their result at https://etcd.io/blog/jepsen-343-results/
The only answer I've come to see are those between our ears. Guess we need to show first that it can be done, and what the benefits are.
My colleagues seem to shy off it because "it's for storing secrets, and common truth". The etcd v3 revise made etcd capable of much more, but the news hasn't simply rippled down, yet.
Let's make some show cases, success stories. Personally, I like etcd because of the reasons you mentioned, and because of its focus on dependable performance.
First, no. Etcd is not the next nosql replacement. But there are some sort of scenarios, where it can come in handy.
Let's imagine you have (configuration) data, that is mostly static but may change on runtime. Maybe your frontend needs to know the backend endpoints based on the customers country to comply with legal and you know the world wide rollout is done in phases.
So you could just use a k8s configMap to store the array of data (country -> endpoint) and let your backend watch this configMap for changes.
On change, the application just reads in the list and provides a repository to allow access to the data from your service layer.
All operations need to be implemented in the repository (search, get, update, ...) but your data will be in memory (probably a linked hash map). So it will be very quick to retrieve (like a local cache).
If data get changed by the application just serialize the list and patch the configMap. Any other application watching the configMap will update their internal state.
However there is no locking. So quick changes may result in race conditions.
etcd allows for 1Mb to be stored. That's enough for almost static data.
Another application might be feature toggles. They do not changed that much but when they do, every application needs to know quickly and polling sucks.
See if this checklist of limitations of etcd compared to a more full-featured database will work for you:
Your database size is going to be within 2 GB (extensible to max 8 GB)
No sharding and hence data scalability that NoSQL db clusters (Mongo, Redis,...) provide
Meant for simple value stores with payloads limited to 1.5 MB. Can be increased but impacts other queries. Most dbs can store large BLOBs. Redis can store a value of 512 MB.
No query language for more complex searches beyond key prefix. Other databases provide more complex data types like document, graph storage with querying and indexing. Even key-value db Redis supports more complex types through modules along with querying and search capabilities
No ACID transactions
Having a hammer, everything may look like a potential nail. You need to make sure it is indeed one.
I ask this question apprehensively because it is not a pure programming question, and because I am seeking a (well informed) suggestion.
I have an analytic front end, written in JavaScript, with lots of aggregations and charting happening in the browser (dimple.js, even stats.js, ...)
I want to feed this application with JSON or delimited data from some high performance data structure server. No writes except for loading. Data will be maybe 1-5 GB in size and there could be dozens, if not hundreds concurrent readers, but only in peak hours. This data is collected from and fed by Apache Hive.
Now my question is about the selection of a database/datastore server choices for this.
(I have pretty good command of SQL/NoSQL choices, so I am really seeking advice for the very specific requirements)
Requirements and specifications for this datastore are:
Mostly if not all queries will be reads, initiated by the web, JS-based front end.
Data can be served as JSON or flat tabular csv, psv, tsv.
Total data size on this store will be 1-5 GB, with possible future growth, but nothing imminent (6-12 months)
Data on this datastore will be refreshed/loaded into this store daily. Probably never in a real time.
Data will/can be accessed via some RESTful web services, Socket IO, etc.
Faster read access, the better. Speed matters.
There has to be a security/authentication method for sensitive data protection.
It needs to be reasonably stable, not a patching-requiring bleeding edge.
Liberal, open source license.
So far, my initial candidates for examination were Postgres (optimized for large cache) and Mongo. Just because I know them pretty well.
I am also familiar with Redis, Couch.
I did not do benchmark myself, but I have seen benchmarks where Postgres was faster than Mongo (while offering JSON format). Mongo is web-friendlier.
I am considering in-memory stores with persistence such as Redis, Aerospike, Memcached. Redis 3.0 is my favorite so far.
So, I ask you here if you have any recommendations for the production quality datastore that would fit well what I need.
Any civil and informed suggestions are welcome.
What exactly does your data look like? Since you said CSV like exports, I'm assuming this is tabular, structured data that would usually be found in a relational database?
Some options:
1. Don't use a database
Given the small dataset, just serve it out of memory. You can probably spend a few hours to write a quick app with any decent web framework that just loads up the data into memory (for example, from a flat file) and then searches and returns this data in whatever format and way you need.
2. Use an embedded database
You can also try an embedded database like SQLite which gives you in-memory performance but with a reliable SQL interface. Since it's just a single-file database, you can have another process generate a new DB file, then swap it out when you update the data for the app.
3. Use a full database system
Use a regular relational database. mySQL, PostgreSQL, SQL Server (Express Edition) are all free and can handle that dataset easily and will just cache it all in RAM. If it's read queries, I don't see any issues with a few hundred concurrent users. You can also use memSQL community edition if you need more performance. They all support security, are very reliable, and you can't beat SQL for data access.
Use a key/value system if your data isn't relational or tabular and is more of a fit as simple values or documents. However remember KV stores aren't great at scans or aggregations and don't have joins. Memcached is just a distributed cache, don't use it for real data. Redis and Aerospike are both great key/value systems with Redis giving you lots of nice data structures to use. Mongo is good for data flexibility. Elasticsearch is a good option for advanced search-like queries.
If you're going to these database systems though, you will still need a thin app layer somewhere to interface with the database and then return the data in the proper format for your frontend.
If you want to skip that part, then just use CouchDB or Riak instead. Both are document oriented and have a native HTTP interface with JSON responses so you can consume it directly from your frontend, although this might cause security issues since anyone can see the javascript calls.
I am working on a project where we are scoping out the specs for an interface to the backend systems of multiple wholesalers. Here is what we are working with,
Each wholesaler has multiple products, upwards of 10,000. And each wholesaler has customized prices for their products.
The list of wholesalers being accessed will keep growing in the future, so potentially 1000s of wholesalers could be accessed by the system.
Wholesalers are geographically dispersed.
The interface to this system will allow the user to select the wholesaler they wish and browse their products.
Product price updates should be reflected on the site in real time. So, if the wholesaler updates the price it should immediately be available on the site.
System should be database agnostic.
The system should be easy to setup on the wholesalers end, and be minimally intrusive in their daily activities.
Initially, I thought about creating databases for each wholesaler on our end, but with potentially 1000s of wholesalers in the future, is this the best option as far as performance and storage.
Would it be better to query the wholesalers database directly instead of storing their data locally? Can we do this and still remain database agnostic?
What would be best technology stack for such an implementation? I need some kind of ORM tool.
Java based frameworks and technologies preferred.
Thanks.
If you want to create a software that can switch the database I would suggest to use Hibernate (or NHibernate if you use .Net).
Hibernate is an ORM which is not dependent to a specific database and this allows you to switch the DB very easy. It is already proven in large applications and well integrated in the Spring framework (but can be used without Spring framework, too). (Spring.net is the equivalent if using .Net)
Spring is a good technology stack to build large scalable applications (contains IoC-Container, Database access layer, transaction management, supports AOP and much more).
Wiki gives you a short overview:
http://en.wikipedia.org/wiki/Hibernate_(Java)
http://en.wikipedia.org/wiki/Spring_Framework
Would it be better to query the wholesalers database directly instead
of storing their data locally?
This depends on the availability and latency for accessing remote data. Databases itself have several posibilities to keep them in sync through multiple server instances. Ask yourself what should/would happen if a wholesaler database goes (partly) offline. Maybe not all data needs to be duplicated.
Can we do this and still remain database agnostic?
Yes, see my answer related to the ORM (N)Hibernate.
What would be best technology stack for such an implementation?
"Best" depends on your requirements. I like Spring. If you go with .Net the built-in ADO.NET Entity Framework might be fit, too.
I'm attempting to use CouchDB for a system that requires full auditing of all data operations. Because of its built in revision-tracking, couch seemed like an ideal choice. But then I read in the O'Reilly textbook that "CouchDB does not guarantee that older versions are kept around."
I can't seem to find much more documentation on this point, or how couch deals with its revision-tracking internally. Is there any way to configure couch either on a per-database, or per-document level to keep all versions around forever? If so, how?
The revisions in CouchDB are not revisions in the way you are thinking of them. They are an artifact of the way it appends updated data to the database, and are cleaned up upon compaction. This is a common misunderstanding.
You need to implement the revision-tracking as part of the schema/document design of your application.
couch is storing all versions, but if you click in futon on "compact database" link, all previous versions will be deleted. So if notime click on compact database, all versions will be preserverd I think:)
There are two situations when the previous versions of documents in CouchDB are removed:
Replication
Compaction
Thus, if you do not want to store lots of data, meaning terabytes, then you probably do not need replication. Anyway, master to master replication in CouchDB is it one of the most important features. The size of CouchDB on disk is greater than a traditional database, so probably in future you will need compaction.
As aforementioned: you need to implement the revision-tracking as part of the schema/document design of your application.
I hear a lot about couchdb, but after reading some documents about it, I still don't get why to use it and how.
Could you clarify this mystery for me?
It's a non-relational database, open-source, distributed (incremental, bidirectional replication), schema-free. A CouchDB database is a collection of documents; each document is a bunch of string "keys" and corresponding "values" (which can be numbers, strings, lists, dates, ...). You can have indices, queries, views.
If a relational DB feels confining to you (you find schemas too rigid, can't spread the DB engine work around a very large numbers of servers, etc), CouchDB is worth considering (it's one of the most interesting of the many non-relational DBs that are emerging these days).
But if all of your work happily fits in a relational database, that's what you probably want to continue using for production work (even though "playing around" with some non-relational DB is still well worth your time, just for personal growth and edification, that's quite different from transferring huge production systems over from a relational DB!-).
It sounds like you should be reading Why CouchDB
To quote from wikipedia
It is not a relational database management system. Instead of storing data in rows and columns, the database manages a collection of JSON documents. The documents in a collection need not share a schema, but retain query abilities via views.
CouchDB provides a different model for data storage than a traditional relational database in that it does not represent data as rows within tables, instead it stores data as "documents" in JSON format.
This difference in data storage model is what differenciates CouchDB from products like MySQL and SQL Server.
In terms of programatic access to CouchDB, it exposes a REST API which you can access by sending HTTP requests from your code
I hope this has been somewhat helpful, though I acknowlege it may not be given my minimal familiarity with the product
I'm far from an expert(all I've done is play around with it some...) but here's how I'm thinking of using it:
Usually when I'm designing an app I've got a bunch of app servers behind a load balancer. Often times, I've got sticky sessions so that each user will go back to the same app server during that session. What I'm thinking of doing is have a couchdb instance tied to each app server.
That way you can use that local couchdb to access user preferences, product data...whatever data you've got that doesn't have to be perfectly up to date.
So...now you've got data on these local CouchDBs. CouchDB allows replication. So, every fixed time period, merge the data back(every X seconds?) into it's peers to keep them up to date.
As a whole you shouldn't have to worry about conflicts b/c each appserver has it's own CouchDB and users are attached to the appserver, and you've got eventual consistency because you've got replication.
Does that answer your question?
A good example is when you say have to deal with people data in either a website or application. If you set off wishing to design the data and keep the individuals' information seperate, that makes a good case for CouchDB, which stores data in documents rather than relational tables. In a production deployment, my users may end up adding adhoc data about 10% of the people and some other funny details for another selected 5%. In a relational context, this could add up to loads of redundancy but not for CouchDB.
And it's not just about the fact that CouchDB is non-relational: if you're too focus on that, you're missing the point. CouchDB is plugged into the web, all you need to start with is HTTP for creating and making queries (GET/PUT/POST/DELETE...), and it's RESTful, plus the fact that it's portable and great for peer to peer sharing. It can also serve up web applications in what is termed as 'CouchApps', where CouchDB totally holds the images, CSS, markup as data stored under special documents called design documents.
Check out this collection of videos introducing non-relational databases, the one on CouchDB should give you a better idea.