Is it trivial? I will be using Bitcask and file backups (of the files on each node).
Let's say my initial ring size is 256 with 16 nodes. Now if I am required to expand to a ring of 1024, can I setup 16 new instances configured with a ring-size of 1024, copy the backup files for the old cluster into these 16 new instances and start Riak up? Will Riak be able to pick up this old data?
I guess not, since the partition ids and their mapping to individual nodes may also change once the ring size is changed. But what other way is there? Will riak-backup work in this case (when the ring size changes)?
I just want to know that the choice I've made is future-proof enough. Obviously at some point when the requirements change drastically or the user base balloons, the entire architecture might need to be changed. But I do hope to be able to make these sort of changes (to the ring size) at some point - naturally with SOME effort involved, but - without it being impossible.
Migrating clusters to a different ring size is difficult to do with node-based file backups (meaning, if you just back up the /data directories on each node, like it's recommended in Backing Up Riak). Because as you've suspected, the backend data files depend on the mapping of nodes and partitions to a given ring size.
What should you do instead?
You have to use "logical" backups of the entire cluster, using one of these two tools:
riak-admin backup and restore (which does in fact work with
clusters of different ring sizes), or
the Riak Data Migrator
Using either one basically dumps the contents of the entire cluster into one location (so be careful not to run out of disk space, obviously). Which you can then transfer, and restore to your new cluster with a different ring size.
Things to watch out for:
Only do backups of non-live clusters. Meaning, either take the cluster down, or at least make sure no new writes are happening to the old cluster while backup is taking place. Otherwise, if you start backup but new writes are still coming in, there is no guarantee that they'll make it into the backed up data set.
Be sure to transfer the app.config and custom bucket settings to the new cluster before doing backup/restore.
Hopefully this helps. So, it's not trivial (meaning, it'll take a while and will require a lot of disk space, but that's true whenever you're transferring large amounts of data), but it's not extremely complicated either.
I know this is an old question, but with Riak 2.x it is now possible to resize the ring dynamically without shutting down the cluster:
riak-admin cluster resize-ring <new_size>
riak-admin cluster plan
riak-admin cluster commit
Note: The size of a Riak ring should always be a 2n integer, e.g. 16, 32, 64, etc.
http://docs.basho.com/riak/latest/ops/advanced/ring-resizing/
Related
When someone says "I have a cluster size of 6TB", what do they mean?
In terms of database size? Or in terms of the amount of data being processed at any given time? I don't know what this means.
That may depends of context.
If someone mentioned 6TB they probably are talking about storage.
Strictly cluster should be related to data chunks to be stored where related data should be stored in same cluster.
To understand database cluster you should understand first understand file system cluster which may be 4-8-16-32-64 ... KB for exemple where your data is stored in one or many clusters. For example, if your file system is planified to store big files then your cluster size should be bigger, performance is improved. If you want to store small files then your cluster should be smaller in order to optimize space usage.
In database your goal should be to be able to store related data to same cluster, in this case performance is optimized.
Anyway, 6TB is not anything which could make sense, so, this is probably related to storage space (or many storage sum).
you may want to check following doc to get idea :
https://docs.oracle.com/database/121/ADMIN/clustrs.htm#ADMIN018
We're adding a new datacenter to our Cassandra cluster. Currently, we have a 15-node DC with RF=3 resulting in about 50TB~ of data.
We are adding another datacenter in a different country and we want both data centers to contain all the data. Obviously, synchronizing 50TB of data across the internet will take a gargantuan amount of time.
Is it possible to copy a full back to a few disks, ship that to the new DC and then recover? I'm just wondering what would be the procedure to do so.
Could someone give me a few pointers on this operation, if possible at all?
Or any other tips?
Our new DC is going to be smaller (6 nodes) for the time being, although enough space will be available. The new DC is mostly meant as a live-backup/failover and will not be the primary cluster for writing, generally speaking.
TL;DR; Due to the topology (node count) change between the two DCs, avoiding streaming the data in isn't possible AFAIK.
Our new DC is going to be smaller (6 nodes) for the time being
The typical process isn't going to work due to token alignment on the nodes being different (new cluster's ring will change). So just copying the existing SSTables wont work, as the nodes that hold those tables, might not have the tokens corresponding to the data in the files and so C* wont be able to find said data.
Bulk loading the data to the new DC is out too, as you'll be overwriting the old data if you re-insert it.
To give you an overview of the process if you were to retain the topology:
snapshot the data from the original DC
Configure the new DC. It's extremely important that you set initial_token for each machine. You can get a list of what tokens you need by running nodetool ring on the original cluster. This is why you need the same number of nodes. As importantly, when copying the SSTable files over, you need the files and the tokens to be from the same node.
ship the data to the new DC (Remember if the new node 10.0.0.1 got it's tokens from 192.168.0.100 in the old dc, then it also has to get it's snapshot data from 192.168.0.100).
Start the new DC and ensure both DCs see eachother ok.
Rebuild and repair system_distributed and system_auth (assuming you have authentication enabled)
Update client consistency to whatever you need. (Do you want to write to both DCs? From your description sounds like a no so you might be all good).
Update the schema, ensure that you're using NetworkTopologyStrategy for any keyspce that you want to be shared, then add some replication for the new DC.
ALTER KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'oldDC' : 3, 'newDC':3 };
Run a full repair on each node in the new dc.
Can you please tell me, which data structure has an InfluxDB und which data model InfluxDB use? Is this key-value model. I read the full documentation and I didn't catch that.
Thank you in advance!
1. Data model and terminology
An InfluxDB database stores points. A point has four components: a measurement, a tagset, a fieldset, and a timestamp.
The measurement provides a way to associate related points that might have different tagsets or fieldsets. The tagset is a dictionary of key-value pairs to store metadata with a point. The fieldset is a set of typed scalar values—the data being recorded by the point.
The serialization format for points is defined by the [line protocol] (which includes additional examples and explanations if you’d like to read more detail). An example point from the specification helps to explain the terminology:
temperature,machine=unit42,type=assembly internal=32,external=100 1434055562000000035
The measurement is temperature.
The tagset is machine=unit42,type=assembly. The keys, machine and type, in the tagset are called tag keys. The values, unit42 and assembly, in the tagset are called tag values.
The fieldset is internal=32,external=100. The keys, internal and external, in the fieldset are called field keys. The values, 32 and 100, in the fieldset are called field values.
Each point is stored within exactly one database within exactly one retention policy. A database is a container for users, retention policies, and points. A retention policy configures how long InfluxDB keeps points (duration), how many copies of those points are stored in the cluster (replication factor), and the time range covered by shard groups (shard group duration). The retention policy makes it easy for users (and efficient for the database) to drop older data that is no longer needed. This is a common pattern in time series applications.
We’ll explain replication factor, shard groups, andshards later when we describe how the write path works in InfluxDB.
There’s one additional term that we need to get started: series. A series is simply a shortcut for saying retention policy + measurement + tagset. All points with the same retention policy, measurement, and tagset are members of the same series.
You can refer to the [documentation glossary] for these terms or others that might be used in this blog post series.
2. Receiving points from clients
Clients POST points (in line protocol format) to InfluxDB’s HTTP /write endpoint. Points can be sent individually; however, for efficiency, most applications send points in batches. A typical batch ranges in size from hundreds to thousands of points. The POST specifies a database and an optional retention policy via query parameters. If the retention policy is not specified, the default retention policy is used. All points in the body will be written to that database and retention policy. Points in a POST body can be from an arbitrary number of series; points in a batch do not have to be from the same measurement or tagset.
When the database receives new points, it must (1) make those points durable so that they can be recovered in case of a database or server crash and (2) make the points queryable. This post focuses on the first half, making points durable.
3. Persisting points to storage
To make points durable, each batch is written and fsynced to a write ahead log (WAL). The WAL is an append only file that is only read during a database recovery. For space and disk IO efficiency, each batch in the WAL is compressed using [snappy compression] before being written to disk.
While the WAL format efficiently makes incoming data durable, it is an exceedingly poor format for reading—making it unsuitable for supporting queries. To allow immediate query ability of new data, incoming points are also written to an in-memory cache. The cache is an in-memory data structure that is optimized for query and insert performance. The cache data structure is a map of series to a time-sorted list of fields.
The WAL makes new points durable. The cache makes new points queryable. If the system crashes or shut down before the cache is written to TSM files, it is rebuilt when the database starts by reading and replaying the batches stored in the WAL.
The combination of WAL and cache works well for incoming data but is insufficient for long-term storage. Since the WAL must be replayed on startup, it is important to constrain it to a reasonable size. The cache is limited to the size of RAM, which is also undesirable for many time series use cases. Consequently, data needs to be organized and written to long-term storage blocks on disk that are size-efficient (so that the database can store a lot of points) and efficient for query.
Time series queries are frequently aggregations over time—scans of points within a bounded time range that are then reduced by a summary function like mean, max, or moving windows. Columnar database storage techniques, where data is organized on disk by column and not by row, fit this query pattern nicely. Additionally, columnar systems compress data exceptionally well, satisfying the need to store data efficiently. There is a lot of literature on column stores. [Columnar-oriented Database Systems] is one such overview.
Time series applications often evict data from storage after a period of time. Many monitoring applications, for example, will store the last month or two of data online to support monitoring queries. It needs to be efficient to remove data from the database if a configured time-to-live expires. Deleting points from columnar storage is expensive, so InfluxDB additionally organizes its columnar format into time-bounded chunks. When the time-to-live expires, the time-bounded file can simply be deleted from the filesystem rather than requiring a large update to persisted data.
Finally, when InfluxDB is run as a clustered system, it replicates data across multiple servers for availability and durability in case of failures.
The optional time-to-live duration, the granularity of time blocks within the time-to-live period, and the number of replicas are configured using an InfluxDB retention policy:
CREATE RETENTION POLICY <retention_policy_name> ON <database_name> DURATION <duration> REPLICATION <n> [SHARD DURATION <duration>] [DEFAULT]
The duration is the optional time to live (if data should not expire, set duration to INF). SHARD DURATION is the granularity of data within the expiration period. For example, a one- hour shard duration with a 24 hour duration configures the database to store 24 one-hour shards. Each hour, the oldest shard is expired (removed) from the database. Set REPLICATION to configure the replication factor—how many copies of a shard should exist within a cluster.
Concretely, the database creates this physical organization of data on disk:
'' Database director /db
'' Retention Policy directory /db/rp
'' Shard Group (time bounded). (Logical)
'' Shard directory (db/rp/Id#)
'' TSM0001.tsm (data file)
'' TSM0002.tsm (data file)
'' …
The in-memory cache is flushed to disk in the TSM format. When the flush completes, flushed points are removed from the cache and the corresponding WAL is truncated. (The WAL and cache are also maintained per-shard.) The TSM data files store the columnar-organized points. Once written, a TSM file is immutable. A detailed description of the TSM file layout is available in the [InfluxDB documentation].
4. Compacting persisted points
The cache is a relatively small amount of data. The TSM columnar format works best when it can store long runs of values for a series in a single block. A longer run produces both better compression and reduces seeks to scan a field for query. The TSM format is based heavily on log-structured merge-trees. New (level one) TSM files are generated by cache flushes. These files are later combined (compacted) into level two files. Level two files are further combined into level three files. Additional levels of compaction occur as the files become larger and eventually become cold (the time range they cover is no longer hot for writes.) The documentation reference above offers a detailed description of compaction.
There’s a lot of logic and sophistication in the TSM compaction code. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries.
Refer: https://www.influxdata.com/blog/influxdb-internals-101-part-one/
It is essentially key-value, key being time, where value can be one or more fields/columns. Values can also optionally be indexed columns, called tags in influxdb, that are optimised for searching along with time which is always required. At least one non-indexed value is required.
See schema design documentation for more details.
Much like Cassandra, in fact, though influx is essentially schema-on-write while developers write schema for Cassandra.
Storage engine wise again very similar to Cassandra, using a variation of SSTables as used in Cassandra, optimised for time series data.
I am not sure if the following influx document was there when you were looking for your answer:
https://docs.influxdata.com/influxdb/v1.5/concepts/key_concepts/
But it really helped me understanding the data structure of influxdb.
I have a website where users can submit text messages, dead simple data structure...
Name <-- Less than 20 characters
Message <-- Max 150 characters
Timestamp
IP
Hidden <-- Bool (True or False)
On the previous version of the website they are stored in MySQL database which is very big, lots of tables, and am wanting to simplify the database. So I heard Redis is good for simple data structures and non relational information...
Would Redis be a good option for this kind of data and how would it perform, with memory usage and read times when talking about 100,000+ records a year...
redis is really only good for in-memory problem sets. It DOES have a page-to-disk capability - but then you're at the mercy of the OS swapper - namely you're RAM will be in competition with system-caches. Also, I think the keys always have to fit in RAM. So you're NOT going to want to store 1G+ log records - mysql-archive-table is MUCH better for that.
redis has a master-slave functionality, similar to mysql. So you can perform various tricks such as sorting on a slave to keep the master responsive. While I haven't used it, I'd speculate that for in-memory databases, mysql-cluster is probably far more advanced - but with corresponding extra complexity / resource-costs.
If you have large values for your key-value set, you can perform client-side compression/decompression. There isn't much the server can do to search on the values of those 'blobs' anyway.
One common way to get around the RAM limitation is to perform client-side sharding (partitioning). Namely, if you KNOW your upper bounds, and you don't have enough RAM to throw at the problem for some reason (say you already have 64GB of RAM), then you could 'shard' based on the primary key.. If it's a sequence counter, you could take the bottom 3 bits (or some hashing function + partition function), and distribute amongst 4,8,16, etc server nodes. That scales linearly, though if you need to re-partition, that could be painful. You COULD take advantage of the 'slots' in redis to start off with fewer machines.. Say 1 machine with 16 slots.. Then later, dump slots 7-15 and restore on a different machine and remap all the clients to point to the two machines (with the same slot numbers). And so forth to 16-way sharding. At which point, you'd need to remap ALL your data to go to 32-way.
Obviously first evaluate the command-set of redis to see if ALL your data-storage and reporting needs can be met. There are equivalents to "select * from foo for update", but they're not obvious. Not all RDBMS queries can be reproduced efficiently with key-value stores. But for simple natural-primary-key record-structures it should do fine.
Additionally, it should be easy to extend the redis command-set to perform custom operations.. Just keep in mind, it's designed around no-pause single-threaded execution (avoids locking /context-switching overhead).
But things I really like are the FIFOs, pub/sub, data-time-outs, atomic-mutations (inc/dec), lazy-sorting (e.g. on client with read-only nodes), maps of maps. It's simple enough that instead of using name-spaces, you just launch separate redis processes on different ports / UNIX-sockets (my preference if possible).
It's meant to replace memcached more than anything else, but has a very nice background persistent framework.
I've been playing around with database programming lately, and I noticed something a little bit alarming.
I took a binary flat file saved in a proprietary, non-compressed format that holds several different types of records, built schemas to represent the same records, and uploaded the data into a Firebird database. The original flat file was about 7 MB. The database is over 70 MB!
I can understand that there's some overhead to describe the tables themselves, and I've got a few minimal indices (mostly PKs) and FKs on various tables, and all that is going to take up some space, but a factor of 10 just seems a little bit ridiculous. Does anyone have any ideas as to what could be bloating up this database so badly, and how I could bring the size down?
From Firebird FAQ:
Many users wonder why they don't get their disk space back when they delete a lot of records from database.
The reason is that it is an expensive operation, it would require a lot of disk writes and memory - just like doing refragmentation of hard disk partition. The parts of database (pages) that were used by such data are marked as empty and Firebird will reuse them next time it needs to write new data.
If disk space is critical for you, you can get the space back by doing backup and then restore. Since you're doing the backup to restore right away, it's wise to use the "inhibit garbage collection" or "don't use garbage collection" switch (-G in isql), which will make backup go A LOT FASTER. Garbage collection is used to clean up your database, and as it is a maintenance task, it's often done together with backup (as backup has to go throught entire database anyway). However, you're soon going to ditch that database file, and there's no need to clean it up.
Gstat is the tool to examine table sizes etc, maybe it will give you some hints what's using space.
In addition, you may also have multiple snapshots or other garbage in database file, it depends on how you add data to the database. The database file never shrinks automatically, but backup/restore cycle gets rid of junk and empty space.
Firebird fill pages in some factor not full.
e.g. db page can contain 70% of data and 30% free space to speed up future record updates, deletes without moving to new db page.
CONFIGREVISIONSTORE (213)
Primary pointer page: 572, Index root page: 573
Data pages: 2122, data page slots: 2122, average fill: 82%
Fill distribution:
0 - 19% = 1
20 - 39% = 0
40 - 59% = 0
60 - 79% = 79
80 - 99% = 2042
The same is for indexes.
You can see how really db size is when you do backup and restore with option
-USE_ALL_SPACE
then database will be restored without this space preservation.
You must know also that not only pages with data are allocated but also some pages are preallocated (empty) for future fast use without expensive disc allocation and fragmentation.
as "Peter G." say - database is much more then flat file and is optimized to speed up thinks.
and as "Harriv" say - you can get details about database file with gstat
use command like gstat -
here are details about its output