I'm looking for a distributed file (or other storage) system for managing a very large number of mutable documents. Each document can be rather large (1-100MB). Some reads need to be guaranteed to be working from the latest data, and some can be read from eventually-consistent replicated data. Each document could be a self-contained file (say, a SQLite database or other custom file format).
For optimal performance, the node of the distributed file system on which writes happen for each document must be different. In other words, server A is the master for document 1 and server B is replicating it, but server B is the master for document 2 and server A is replicating it. For my application, a single server is not going to be able to handle all of the write traffic for the whole system, so having a single master for all data is not acceptable.
Each document should be replicated across some number of servers (say, 3). So if I have 1000 documents and 10 servers, each server would have a copy of 300 documents, and be the master for 100 of those. Ideally, the cluster would automatically promote servers to be masters for documents whose master server had crashed, and re-balance the storage load as new servers are added to the cluster.
I realize this is a pretty tall order... is there something available that meets most of my core needs?
I think HDFS would fit the criteria you listed above.
Related
Hypothetical scenario - I have a 10 node Greenplum cluster with 100 TB of data in 1000 tables that needs to be offloaded to S3 for reasons. Ideally, the final result is a .csv file that corresponds to each table in the database.
I have three possible approaches, each with positives and negatives.
COPY - There is a question that already answers the how, but the problem with psql COPY in a distributed architecture, is it all has to go through the master, creating a bottleneck for the movement of 100TB of data.
gpcrondump - This would create 10 files per table and the format is TAB Delimited, which would require some post-gpcrondump ETL to unite the files into a single .csv, but it takes full advantage of the distributed architecture and automatically logs successful/failed transfers.
EWT - Takes advantage of the distributed architecture and writes each table to a single file without holding it in local memory until the full file is built, yet will probably be the most complicated of the scripts to write because you need to implement the ETL, you can't do it separately, after the dump.
All the options are going to have different issues with table locks as we move through the database and figuring out which tables failed so we can re-address them for a complete data transfer.
Which approach would you use and why?
I suggest you use the S3 protocol.
http://www.pivotalguru.com/?p=1503
http://gpdb.docs.pivotal.io/43160/admin_guide/load/topics/g-s3-protocol.html
I'm setting up a ClickHouse server in cluster, but one of the things that doesn't appear in the documentation is how to manage very large amount of data, it says that it can handle up to petabytes of data, but you can't store that much data in single server. You usually will have a few teras in each.
So my question is, how can I handle it to store in a node of the cluster, and then when it requires more space, add another, will it handle the distribution to the new server automatically or will I have to play with the weights in the shard distribution.
When you have more than 1 disk in one server, how can it use them all to store the data?
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
What solution would you find to this? Handling an ever exapanding database to avoid disk space issues in the future.
Thanks
I will assume that you use standard configuration for the ClickHouse cluster: several shards consisting of 2-3 replica nodes, and on each of these nodes a ReplicatedMergeTree table containing data for its respective shard. There are also Distributed tables created on one or more nodes that are configured to query the nodes of the cluster (relevant section in the docs).
When you add a new shard, old data is not moved to it automatically. Recommended approach is indeed to "play with the weights" as you have put it, i.e. increase the weight of the new node until the volume of data is even. But if you want to rebalance the data immediately, you can use the ALTER TABLE RESHARD command. Read the docs carefully and keep in mind various limitations of this command, e.g. it is not atomic.
When you have more than 1 disk in one server, how can it use them all to store the data?
Please read the section on configuring RAID in the administration tips.
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
MergeTree tables in ClickHouse are partitioned by month. You can use ALTER TABLE DETACH/ATTACH PARTITION commands to manipulate partitions. You can e.g. at the start of each month detach the partition for some older month and back it up to Amazon S3. Or you can setup a cluster of cheaper machines with ample disk space and manually move old partitions there. If your queries always include a filter on date, irrelevant partitions will be skipped automatically, else you can setup two Distributed tables: table_recent and table_all (with the cluster config including the nodes with old partitions).
Version 19.15 introduced multidisk strorage configuration. 20.1 introduces time-based data rearrangements.
My exposure to NoSQL or NewSQL/NeoSQL database servers is extremely limited, only theoretical. I've worked with traditional RDBMSs (like MySQL, Postgres) and directory-server (OpenLDAP), with and without replication.
My application stack is based on JBoss, and I've been tasked with setting up a minimal demo (with our application) that can demonstrate durability and high-availability of data, in VoltDB. Performance testing, is not an objective at all.
Have been going thru the VoltDB Planning Guide, but I am confused between the "+1" or "x2" in terms of number of servers (or VoltDB instances) required. Especially given these 2 statements:-
The easiest way to size hardware for a K-Safe cluster is to size the
initial instance of the database, based on projected throughput and
capacity, then multiply the number of servers by the number of
replicas you desire (that is, the K-Safety value plus one).
Rule of Thumb
When using K-Safety, configure the number of cluster nodes as a whole multiple of the number of copies of the database
(that is, K+1)
Questions:
Now, let's say that I need 1 server given capacity/throughput
requirements. So, to be able to have durability and
high-availability, do I need: 2, 3 or 4 servers ?
OTOH, using just 1 server, what all key features of VoltDB would I
have to forgo ?
Is there any relationship (or conflict) between VoltDB's full
disk-persistence and snapshots ? Say, the availability of disk-persistence
removes the need for snapshots ?
If you use 2 servers, you can keep a synchronous replica of data to protect from data loss, much like a RAID1 hard drive. Your data is double-safe, but there is a catch with availability. With only two servers, it's impossible to differentiate a network split from a failed node. In some cases, VoltDB will shut down a live node when another fails to ensure there will be no split brain. With 3 nodes, this won't be an issue and the cluster will remain available after any single node failure (with k=1 or k=2).
With just 1 server, all you lose is the multiple copies of data on multiple servers and the high-availability features that allow VoltDB to continue running after a node failure. You still have all of the other VoltDB features, including full disk persistence.
I have inherited a legacy content delivery system and I need to re-design & re-build it. The content is delivered by content suppliers (e.g. Sony Music) and is ingested by a legacy .NET app into a SQL Server database.
Each content has some common properties (e.g. Title & Artist Name) as well as some content-type specific properties (e.g. Bit Rate for MP3 files and Frame Rate for video files).
This information is stored in a relational database in multiple tables. These tables might have null values in some of their fields because those fields might not belong to a property of the content. The database is constantly under write operations because the content ingestion system is constantly receiving content files from the suppliers and then adds their metadata to the database.
Also, there is a public facing web application which lets end users buy the ingested contents (e.g. musics, videos etc). This web application totally relies on an Elasticsearch index. In fact this application does not see the database at all and uses the Elasticsearch index as the source of data. The reason is that SQL Server does not perform as fast and as efficient as Elasticsearch when it comes to text-search.
To keep the database and Elasticsearch in sync there is a Windows service which reads the updates from SQL Sever and writes them to the Elasticsearch index!
As you can see there are a few problems here:
The data is saved in a relational database which makes the data hard to manage. e.g. there is a table of 3 billion records to store metadata of each contents as a key value pairs! To me using a NoSQL database or index would make a lot more sense as they allow to store documents with different formats in them.
The Elasticsearch index needs to be kept in Sync with the database. If the Windows services does not work for any reason then the index will not get updated. Also when there are too many inserts/updates in the database it takes a while for the index to get updated.
We need to maintain two sources of data which has cost overhead.
Now my question: is there a NoSQL database which has these characteristics?
Allows me to store documents with different structures in it?
Provides good text-search functions and performance? e.g. Fuzzy search etc.
Allows multiple updates to be made to its data concurrently? Based on my experience Elasticsearch has problems with concurrent updates.
It can be installed and used at Amazon AWS infrastructure because our new products will be hosted on AWS. Auto scaling and clustering is important. e.g. DynamoDB.
It would have a kind of GUI so that support staff or developers could modify the data to some extent.
A combination of DynamoDB and ElasticSearch may work for your use case.
DynamoDB certainly supports characteristics 1, 3, 4, and 5.
There is now a Logstash Input Plugin for DynamoDB that can be combined with an ElasticSearch output plugin to keep your table and index in sync in real time. ElasticSearch provides characteristic 2.
I have an application deployed into multiple zones and there are some issues with opening larger documents (20-50MB) across the WAN.
Currently the documents are stored in Zone 1 (Americas) and a link stored in the database to the docs.
I have heard some things about blobs in oracle and store binary in MS SQL Server 2005 and then perhaps copying the database to other zones.
Any other suggestions or good results with one of the described options?
Your best option here may be caching the document in the requested zone the first time it is requested, and pinging the source document's last modified each time the cached document is requested in order to determine if it needs refreshed. In this case you're only requesting a small piece of information (a date) across the WAN most of the times the document is accessed. This works best for a subset of documents that are frequently requested.
If you have a large set of documents, each infrequently requested by a disparate group, then you may want to look into replicating the documents in each of your zones each time the master is updated. This may best be accomplished by storing the document as binary data in your master database and having the slaves pull from the master.
If you're running on Windows you could look at Distributed File Systems