Which key-value database to use for BLOB storage? [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
We have a service that currently runs on top of a MySQL database and uses JBoss to run the application. The rate of the database growth is accelerating and I am looking to change the setup to improve scaling. The issue is not in a large number of rows nor (yet) a particularly high volume of queries but rather in the large number of BLOBs stored in the db. Particularly the time it takes to create or restore a backup (we use mysqldump and Percona Xtrabackup ) is a concern as well as the fact that we will need to scale horizontally to keep expanding the disk space in the future. At the moment the db size is around 500GB.
The kind of arrangement that I figure would work well for our future needs is a hybrid database that uses both MySQL and some key-value database. The latter would only store the BLOBs. The meta data as well as data for user management and business logic of the application would remain in the MySQL db and benefit from structured tables and full consistency. The application itself would handle the issue of consistency between the databases.
The question is which database to use? There are lots of NoSQL databases to choose from. Here are some points on what qualities I am looking for:
Distributed over multiple nodes, which are flexible to add or remove.
Redundancy of storage, with the database automatically making sure each value object is stored on at least two different nodes.
Value objects' size could range from a few dozen bytes to around 100MB.
The database is accessed from a java EJB application on top of JBoss as well as a program written in C++ that processes the data in the db. Some sort of connector for each would be needed.
No need for structure for the data. A single string or even just a large integer would suffice for the key, pure byte array for the value.
No updates for the value objects are needed, only inserts and deletes. If a particular object is made obsolete by a new object that fulfills the same role, the old object is deleted and a new object with a new key is inserted.
Having looked around a bit, Riak sounds good except for its problems with storing large value objects. Which database would you recommend?

Related

Database to store large binary files of malware [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
So, I am trying to create a database that can store thousands of malware binary files with sizes ranging anywhere from kb's to 50 mb. I am currently testing with cassandra using blobs but of course with files that big cassandra isn't handling it that well. Does anyone have any good ideas maybe for a better database or maybe a better way to go about using cassandra. I am relatively new to databases so please be as detailed as possible.
Thank You
If you have your heart set on cassandra you would want to store the blob files outside cassandra as the large file sizes will cause problems with your compaction and repairs. Ideally you would store the blob files on a network store somewhere outside cassandra. That said apparently walmart did do it previously
Cassandra setup:
CREATE TABLE [IF NOT EXISTS] malware_table (
malware_hash varchar,
filepath varchar,
date_found timestamp,
object blob,
other columns...
PRIMARY KEY (malware_hash, filepath)
What we're doing here is creating a composite key based on the malware hash. So you can do SELECT * FROM malware_table WHERE malware_hash = ?. If there was a collision you have two files to look at. Additionally this lookup will be super fast as its a key value lookup. Keep in mind with cassandra you can only query by your primary key.
As its not likely that you're going to be updating files in the past you're going to want to run Size based compaction. For faster lookups in the long run. This will be more expensive on harddrive space since you'll need to have 50% of your harddrives free at any given time.
Alternative solution:
I would probably store this in s3/gcs or some network store. Create a folder to represent the hash of the folder and then store the files inside each folder. Use the api to determine if the file is there. If this is something being hit 1000s of times a second you would want to create a caching layer infront of it to reduce lookup times. The cost of a object store is going to be VASTLY cheaper than a cassandra cluster and will likely scale better.

Which database support scalability and availability? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
What database software can do these?
Scalability via data partitioning, such as consistent hash.
Redundancy for fail over. Both memory cache and disk storage together. Key-value data. Value is document type such as JSON.
Prefers A and P in CAP theory.
I heard that MemcacheD can do these all, but I am not sure.
Here's details:
Data storage volume is, <30KB JSON document for each key. Keys shall be >100,000,000.
Data is accessed >10K times for a second.
Persistence is needed for every key-value data.
No need for transaction.
Development environment is C#, but other languages are ok if the protocol spec is known.
Map reduce is not needed.
This is too short a spec description to choose a database. There are tons of other contraints to consider (data storage volume, data transfer volume, persistence requirement, needs for transactions, development environnment, map reduce, etc.).
That being said:
Memcachedor Redis are memory database which means that you cannot store more than what your computer memory can hold. This is less true now that distributed capabilities have been added to redis.
Document database (such as MongoDB or Microsoft Document Db) support everything. And you can add memcached or redis in front. That's how most people use them.
I would like to add that any SQL can now deal with JSON. So that works too. With a cache up front if needed.
Some link of interest for JSON oriented database. But once again. That's too short a spec to choose a database.

Strategies to building a database of 30m images [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Summary
I am facing the task of building a searchable database of about 30 million images (of different sizes) associated with their metadata. I have no real experience with databases so far.
Requirements
There will be only a few users, the database will be almost read-only, (if things get written then by a controlled automatic process), downtime for maintenance should be no big issue. We will probably perform more or less complex queries on the metadata.
My Thoughts
My current idea is to save the images in a folder structure and build a relational database on the side that contains the metadata as well as links to the images themselves. I have read about document based databases. I am sure they are reliable, but probably the images would only be accessible through a database query, is that true? In that case I am worried that future users of the data might be faced with the problem of learning how to query the database before actually getting things done.
Question
What database could/should I use?
Storing big fields that are not used in queries outside the "lookup table" is recommended for certain database systems, so it does not seem unusual to store the 30m images in the file system.
As to "which database", that depends on the frameworks you intend to work with, how complicated your queries usually are, and what resources you have available.
I had some complicated queries run for minutes on MySQL that were done in seconds on PostgreSQL and vice versa. Didn't do the tests with SQL Server, which is the third RDBMS that I have readily available.
One thing I can tell you: Whatever you can do in the DB, do it in the DB. You won't even nearly get the same performance if you pull all the data from the database and then do the matching in the framework code.
A second thing I can tell you: Indexes, indexes, indexes!
It doesn't sound like the data is very relational so a non-relational DBMS like MongoDB might be the way to go. With any DBMS you will have to use queries to get information from it. However, if your worried about future users, you could put a software layer between the user and DB that makes querying easier.
Storing images in the filesystem and metadata in the DB is a much better idea than storing large Blobs in the DB (IMHO). I would also note that the filesystem performance will be better if you have many folders and subfolders rather than 30M images in one big folder (citation needed)

Are there any design patterns for bitemporal NoSQL databases? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm curious if anyone has implemented or even knows of any bitemporal databases built on NoSQL platforms (e.g., riak).
I don't know of any NoSQL datastore that are specifically designed to handle temporal data.
In order to put the valid and transaction time periods onto data in Riak you would need to either:
Wrap your documents/values with a structure that can hold metadata like:
{
meta:{
valid:["2001-11-08", "2001-11-09"],
transaction:["2011-01-29 10:27:00", "2011-01-29 10:28:00"]
}
payload:"This is the actual document/value I want to store!"
}
Create a "meta-document" for each document and use Riak Links to link them up.
I think this is a little bit cleaner but if you need to retrieve these times often then this method may be too slow.
If you want to retrieve documents by time then I don't think Riak (or any other key/value datastores that I know of) will be the right datastore to use. SQL or possibly some BigTable system may be your only good option.
I have written a small bitemporal, open source database layer based on Mongodb:
https://github.com/1123/bitemporaldb
When storing Scala or Java objects, the object is wrapped into a generic bitemporal object with bitemporal meta-information (valid time, transaction time). Subsequently it is serialized to json and stored as BSON in MongoDB.
It handles temporal and non-temporal updates to objects transparently. Search by bitemporal context is possible.
Document-oriented databases for bitemporal data are beneficial, since document oriented storage reduces the number of joins for data retrieval. Joins in a bitemporal context can be inefficient and hard to code by hand.
Feedback, contribution and feature-requests are very welcome.
To support a bitemporal (or temporal db model), you need acid transactions to perform the proper DML to update and insert records on two time dimensions (valid/effective time and transaction/system time). See for details on temporal modeling.
The popular NoSQL database like Cassandra, MongoDB, Couchbase, for example, don't have ACID support to perform the necessary record update/insert operations needed to support bitemporal record manipulation. With temporal and bitemporal databases records must never overlap and records must properly be terminated when superseded by succeeding valid/transaction time records.
MarkLogic NoSQL database claims support for bitemporal, but never tried it and is not open source. But you can roll your own solution by using ACID database that effectively functions as a valid/transaction time tracking journal and then use NoSQL for the actual data store. See high-level description of this approach here.
From Wikipedia:
"Bitemporal data is a concept used in a temporal database. It denotes both the valid time and transaction time of the data.
In a database table bitemporal data is often represented by four extra table-columns StartVT and EndVT, StartTT and EndTT. Each time interval is closed at its lower bound, and open at its upper bound."
So you can't just put these four values onto your data?

Is there a database with git-like qualities? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.
There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.
Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.
It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Resources