Infinispan data persistence sharding? - database

Capedwarf uses Infinispan to store data, in what way can Infinispan be configured to persist data on a node machine with maximum disk space, for example, each server hosting Capedwarf only have 1TB of mounted block storage, how do you configure Infispan such that if the "overall" data exceeds 1TB it would be "sharded" across different server?
Running Capedwaft it stores infinispan data in: $\CapeDwarf_WildFly_2.0.0.Final\standalone\data\infinispan\capedwarf

When using local storage (single file store, soft index store or rocksdb store) in combination with a distributed cache, the data is already "sharded" based on ownership: each node will approximately store TOTAL DATA / NUM_NODES * NUM_OWNERS.
For example, store 1GB of data on a 5-node cluster in a distributed cache with 2 owners (the default), each node would require approximately 400MB. As the data is not perfectly balanced, allow for a certain margin of difference (typically 10-15%).
Alternatively you can use a shared store (jdbc, cloud) which would store data externally.

Related

How do I configure multiple hard disks for a dolphindb data node

Because the storage capacity of a hard disk is limited, I want to configure multiple storage hard disks for my dolphindb data node. How should I configure them?
Please set the parameter volumes for a data node. If you would like to configure multiple disks, just put the paths there separated by comma.
volumes=/hdd/disk1,/hdd/disk2

How are sparse files handled in Google Cloud Storage?

We have a 200GB sparse file which is about 80GB in actual size (VMware disk).
How does Google calculate the space for this file, 200GB or 80GB?
What would be the best practice to store it in the Google Cloud using gsutil (similar to rsync -S)
Would it be solved by using tar cSf, and then upload via gsutil? How slow could it be?
We have a 200GB sparse file which is about 80GB in actual size (VMware disk).
How does Google calculate the space for this file, 200GB or 80GB?
Google Cloud Storage does not introspect your files to understand what they are, so it's the actual size (80GB) that it takes on disk that matters.
What would be the best practice to store it in the Google Cloud using gsutil (similar to rsync -S)
There's gsutil rsync but it does not support -S so that won't be very efficient. Also, Google Cloud Storage is not storing files as blocks which can be accessed and rewritten randomly, but as blobs keyed by the bucket name + object name, so you'll essentially be uploading the entire 80GB file every time.
One alternative you might consider is to use Persistent Disks which provide block-level access to your files with the following workflow:
One-time setup:
create a persistent disk and use it only for storage of your VM image
Pre-sync setup:
create a Linux VM instance with its own boot disk
attach the persistent disk in read-write mode to the instance
mount the attached disk as a file system
Synchronize:
use ssh+rsync to synchronize your VM image to the persistent disk on the VM
Post-sync teardown:
unmount the disk within the instance
detach the persistent disk from the instance
delete the VM instance
You can automate the setup and teardown steps with scripts so it should be very easy to run on a regular basis whenever you want to do the synchronization.
Would it be solved by using tar cSf, and then upload via gsutil? How slow could it be?
The method above will be limited by your network connection, and would be no different from ssh+rsync to any other server. You can test it out by, say, throttling your bandwidth artificially to another server on your own network to match your external upload speed and running rsync over ssh to test it out.
Something not covered above is pricing, so I'll just leave these pointers for you to consider that as well, as that may be relevant for you to consider in your analysis.
Using Google Cloud Storage mode, you'll incur:
Google Cloud Storage pricing: currently $0.026 / GB / month
Network egress (ingress is free): varies by total amount of data
Using the Persistent Disk approach, you'll incur:
Persistent Disk pricing: currently $0.04 / GB / month
VM instance: needs to be up only while you're running the sync
Network egress (ingress is free): varies by total amount of data
The actual amount of data you will download should be small, since that's what rsync is supposed to minimize, so most of the data should be uploaded rather than downloaded, and hence your network cost should be low, but that is based on the actual rsync implementation which I cannot speak for.
Hope this helps.

JSONStore Worklight - Size Limit

JSONStore provides us with a great way to sync data with a server and track changes a user makes while offline. Is there a limit as to how much information could be saved on JSONStore? I found that Webkit database has a limit of 5 MB where as SQLLite database has no limit. Also wondering where JSONStore uses WebKit database or SQLLite to store its underlying information
JSONStore ultimately stores information on the file system. The only bounds would be the space remaining on the device or any file size limits imposed the the devices operating system. We have easily created JSONStore instances that were hundreds and hundres of MB on disk.

What methods are available to implement a local cache of a large DB driven data?

My company maintains a number of large time series databases of process data. We implement a replica of a subset at a pseudo-central location. I access the data from my laptop. The data access over our internal WAN even to the pseudo-central server is fairly expensive (time).
I would like to cache data requests locally on my laptop so that when I access it for a second time I actually pull from a local db.
There is a fairly ugly client side DAO that I can wrap to maintain the cache but I'm unsure how I can get the "official" client applications to talk to the cache easily. I have the freedom to write my own "client" graphing/plotting system, and already have a custom application that does some data mining already implemented. The custom application dumps data into .csv files that are manually moved around on a very ad-hoc basis.
What is the best approach to this sort of caching/synchonization? What tools could implement the cache?
For further info, the raw data set I have estimated at approx 5-8Tb of RAW time series data per year, with at least half of the data being very compressible. I only want to cache say a few hundred Mb locally. When ad-hoc queries are made on the data it tends to be very repetitive over very small chunks of the data.

What is happening to such distributed in-memory cloud databases as Hazelcast and Scalris if there is more Data to store than RAM in the cluster?

What is happening to such distributed in-memory cloud databases as
Hazelcast
Scalaris
if there is more Data to store than RAM in the cluster?
Are they going to Swap? What if the Swap space is full?
I can't see a disaster recovery strategy at both databases! Maybe all data is lost if the memory is full?
Is there a availability to write things down to the hard-disk for memory issues?
Are there other databases out there, which offer the same functionality as Hazelcast or Scalaris with backup features / hdd-storage / disaster recovery?
I don't know what the state of affairs was when the accepted answer by Martin K. was published, but Scalaris FAQ now claims that this is supported.
Can I store more data in Scalaris than ram+swapspace is available in the cluster?
Yes. We have several database
backends, e.g. src/db_ets.erl (ets)
and src/db_tcerl (tokyocabinet). The
former uses the main memory for
storing data, while the latter uses
tokyocabinet for storing data on disk.
With tokycoabinet, only your local
disks should limit the total size of
your database. Note however, that this
still does not provide persistency.
For instructions on switching the
database backend to tokyocabinet see
Tokyocabinet.
Regarding to the teams of Hazelcast and Scalaris, they say both, that writing more Data than RAM is available isn't supported.
The Hazlecast team is going to write a flatfile store in the near future.

Resources