Optimizing network bandwidth over distributed database aggregation jobs - database

I have a distributed/federated database structured as follows:
The databases are spread across three geographic locations ("nodes")
Multiple databases are clustered at each node
The relational databases are a mix of PostgreSQL, MySQL, Oracle, and MS SQL Server; the non-relational databases are either MongoDB or Cassandra
Loose coupling within each node and across the node federation is achieved via RabbitMQ, with each node running a RabbitMQ broker
I am implementing a readonly inter-node aggregation job system for jobs that span the node federation (i.e. for jobs that are not local to a node). These jobs only perform "get" queries - they do not modify the databases. (If the results of the jobs are intended to go into one or more of the databases then this is accomplished by a separate job that is not part of the inter-node job system I am trying to optimize.) My objective is to minimize the network bandwidth required by these jobs (first to minimize the inter-node / WAN bandwidth, then to minimize the intra-node / LAN bandwidth); I assume a uniform cost for each WAN link, and another uniform cost for each LAN link. The jobs are not particularly time-sensitive. I perform some CPU load-balancing within a node but not between nodes.
The amount of data transported across the WAN/LAN for the aggregation jobs is small relative to the amount of database writes that are local to a cluster or to a specific database, so it would not be practical to fully distribute the databases across the federation.
The basic algorithm I use for minimizing network bandwidth is:
Given a job that runs on a set of data that is spread across the federation, the manager node sends a message to each the other nodes that contains the relevant database queries.
Each node runs its set of queries, compresses them with gzip, caches them, and sends their compressed sizes to the manager node.
The manager moves to the node containing the plurality of the data (specifically, to the machine within the cluster that has the most data and that has idle cores); it requests the rest of the data from the other two nodes and from the other machines within the cluster, then it runs the job.
When possible the jobs use a divide-and-conquer approach to minimize the amount of data co-location that is needed. For example, if the job needs to compute the sums of all Sales figures across the federation, then each node locally calculates its Sales sums which are then aggregated at the manager node (rather than copying all of the unprocessed Sales data to the manager node). However, sometimes (such as when performing a join between two tables that are located at different nodes) data co-location is needed.
The first thing I did to optimize this was aggregate the jobs, and to run the aggregated jobs at ten minute epochs (the machines are all running NTP, so I can be reasonably certain that "every ten minutes" means the same thing at each node). The goal is for two jobs to be able to share the same data, which reduces the overall cost of transporting the data.
Given two jobs that query the same table, I generate each job's resultset, and then I take the intersection of the two resultsets.
If both jobs are scheduled to run on the same node, then the network transfer cost is calculated as the sum of the two resultsets minus the intersection of the two resultsets.
The two resultsets are stored to PostgreSQL temporary tables (in the case of relational data) or else to temporary Cassandra columnfamilies / MongoDB collections (in the case of nosql data) at the node selected to run the jobs; the original queries are then performed against the combined resultsets and the data delivered is to the individual jobs. (This step is only performed on combined resultsets; individual resultset data is simply delivered to its job without first being stored on temporary tables/column families/collections.)
This results in an improvement to network bandwidth, but I'm wondering if there's a framework/library/algorithm that would improve on this. One option I considered is to cache the resultsets at a node and to account for these cached resultsets when determining network bandwidth (i.e. trying to reuse resultsets across jobs in addition to the current set of pre-scheduled co-located jobs, so that e.g. a job run in one 10-minute epoch can use a cached resultset from a previous 10-minute resultset), but unless the jobs use the exact same resultsets (i.e. unless they use identical where clauses) then I don't know of a general-purpose algorithm that would fill in the gaps in the resultset (for example, if the resultset used the clause "where N > 3" and a different job needs the resultset with the clause "where N > 0" then what algorithm could I use to determine that I need to take the union of the original resultset and with the resultset with the clause "where N > 0 AND N <= 3") - I could try to write my own algorithm to do this, but the result would be a buggy useless mess. I would also need to determine when the cached data is stale - the simplest way to do this is to compare the cached data's timestamp with the last-modified timestamp on the source table and replace all of the data if the timestamp has changed, but ideally I'd want to be able to update only the values that have changed with per-row or per-chunk timestamps.

I've started to implement my solution to the question.
In order to simplify the intra-node cache and also to simplify CPU load balancing, I'm using a Cassandra database at each database cluster ("Cassandra node") to run the aggregation jobs (previously I was aggregating the local database resultsets by hand) - I'm using the single Cassandra database for the relational, Cassandra, and MongoDB data (the downside is that some relational queries run slower on Cassandra, but this is made up for by the fact that the single unified aggregation database is easier to maintain than the separate relational and non-relational aggregation databases). I am also no longer aggregating jobs in ten minute epochs since the cache makes this algorithm unnecessary.
Each machine in a node refers to a Cassandra columnfamily called Cassandra_Cache_[MachineID] that is used to store the key_ids and column_ids that it has sent to the Cassandra node. The Cassandra_Cache columnfamily consists of a Table column, a Primary_Key column, a Column_ID column, a Last_Modified_Timestamp column, a Last_Used_Timestamp column, and a composite key consisting of the Table|Primary_Key|Column_ID. The Last_Modified_Timestamp column denotes the datum's last_modified timestamp from the source database, and the Last_Used_Timestamp column denotes the timestamp at which the datum was last used/read by an aggregation job. When the Cassandra node requests data from a machine, the machine calculates the resultset and then takes the set difference of the resultset and the table|key|columns that are in its Cassandra_Cache and that have the same Last_Modified_Timestamp as the rows in its Cassandra_Cache (if the timestamps don't match then the cached data is stale and is updated along with the new Last_Modified_Timestamp). The local machine then sends the set difference to the Cassandra node and updates its Cassandra_Cache with the set difference and updates the Last_Used_Timestamp on each cached datum that was used to compose the resultset. (A simpler alternative to maintaining a separate timestamp for each table|key|column is to maintain a timestamp for each table|key, but this is less precise and the table|key|column timestamp is not overly complex.) Keeping the Last_Used_Timestamps in sync between Cassandra_Caches only requires that the local machines and remote nodes send the Last_Used_Timestamp associated with each job, since all data within a job uses the same Last_Used_Timestamp.
The Cassandra node updates its resultset with the new data that it receives from within the node and also with the data that it receives from the other nodes. The Cassandra node also maintains a columnfamily that stores the same data that is in each machine's Cassandra_Cache (except for the Last_Modified_Timestamp, which is only needed on the local machine to determine when data is stale), along with a source id indicating if the data came from within the within the node or from another node - the id distinguishes between the different nodes, but does not distinguish between the different machines within the local node. (Another option is to use a unified Cassandra_Cache rather than using one Cassandra_Cache per machine plus another Cassandra_Cache for the node, but I decided that the added complexity was not worth the space savings.)
Each Cassandra node also maintains a Federated_Cassandra_Cache, which consists of the {Database, Table, Primary_Key, Column_ID, Last_Used_Timestamp} tuples that have been sent from the local node to one of the other two nodes.
When a job comes through the pipeline, each Cassandra node updates its intra-node cache with the local resultsets, and also completes the sub-jobs that can be performed locally (e.g. in a job to sum data between multiple nodes, each node sums its intra-node data in order to minimize the amount of data that needs to be co-located in the inter-node federation) - a sub-job can be performed locally if it only uses intra-node data. The manager node then determines on which node to perform the rest of the job: each Cassandra node can locally compute the cost of sending its resultset to another node by taking the set difference of its resultset and the subset of the resultset that has been cached according to its Federated_Cassandra_Cache, and the manager node minimizes the cost equation ["cost to transport resultset from NodeX" + "cost to transport resultset from NodeY"]. For example, it costs Node1 {3, 5} to transport its resultset to {Node2, Node3}, it costs Node2 {2, 2} to transport its resultset to {Node1, Node3}, and it costs Node3 {4, 3} to transport its resultset to {Node1, Node2}, therefore the job is run on Node1 with a cost of "6".
I'm using an LRU eviction policy for each Cassandra node; I was originally using an oldest-first eviction policy because it is simpler to implement and requires fewer writes to the Last_Used_Timestamp column (once per datum update instead of once per datum read), but the implementation of an LRU policy turned out not to be overly complex and the Last_Used_Timestamp writes did not create a bottleneck. When a Cassandra node reaches 20% free space it evicts data until it reaches 30% free space, hence each eviction is approximately the size of 10% of the total space available. The node maintains two timestamps: the timestamp of the last-evicted intra-node data, and the timestamp of the last-evicted inter-node / federated data; due to the increased latency of inter-node communication relative to that of intra-node communication, the goal of the eviction policy to have 75% of the cached data be inter-node data and 25% of the cached data be intra-node data, which can be quickly approximated by having 25% of each eviction be inter-node data and 75% of each eviction be intra-node data. Eviction works as follows:
while(evicted_local_data_size < 7.5% of total space available) {
evict local data with Last_Modified_Timestamp <
(last_evicted_local_timestamp += 1 hour)
update evicted_local_data_size with evicted data
}
while(evicted_federated_data_size < 2.5% of total space available) {
evict federated data with Last_Modified_Timestamp <
(last_evicted_federated_timestamp += 1 hour)
update evicted_federated_data_size with evicted data
}
Evicted data is not permanently deleted until eviction acknowledgments have been received from the machines within the node and from the other nodes.
The Cassandra node then sends a notification to the machines within its node indicating what the new last_evicted_local_timestamp is. The local machines update their Cassandra_Caches to reflect the new timestamp, and send a notification to the Cassandra node when this is complete; when the Cassandra node has received notifications from all local machines then it permanently deletes the evicted local data. The Cassandra node also sends a notification to the remote nodes with the new last_evicted_federated_timestamp; the other nodes update their Federated_Cassandra_Caches to reflect the new timestamp, and the Cassandra node permanently deletes the evicted federated data when it receives notifications from each node (the Cassandra node keeps track of which node a piece of data came from, so after receiving an eviction acknowledgment from NodeX the node can permanently delete the evicted NodeX data before receiving an eviction acknowledgment from NodeY). Until all machines/nodes have sent their notifications, the Cassandra node uses the cached evicted data in its queries if it receives a resultset from a machine/node that has not evicted its old data. For example, the Cassandra node has a local Table|Primary_Key|Column_ID datum that it has evicted, and meanwhile a local machine (which has not processed the eviction request) has not included the Table|Primary_Key|Column_ID datum in its resultset because it thinks that the Cassandra node already has the datum in its cache; the Cassandra node receives the resultset from the local machine, and because the local machine has not acknowledged the eviction request the Cassandra node includes the cached evicted datum in its own resultset.

Related

ClickHouse replica out of sync

I have a cluster of 3 ClickHouse servers with a table using ReplicatedMergeTree. Two of the servers are out of sync and the count of queue in 'system.replication_queue' keeps increasing. I can see this error in logs.
Not executing log entry for part e87a3a2d13950a90846a513f435c2560_2428139_2436934_22 because source parts size (470.12 MiB) is greater than the current maximum (4.45 MiB).
How do I increase the source parts size? I could not find it in settings.
Update:
I read the source code, it is auto calculated based on the available resources. I am also getting this message
Not executing log entry for part de77ce6a2937ce543cd003eb289fdb7e_8097652_8107495_1904 because another log entry for the same part is being processed. This shouldn't happen often.
The servers which are getting the above message in log have high CPU usage and latency in inserts.
Replication Queue gets cleared once I stop insertion.
I found the solution. It happens because of "merges are processing significantly slower than inserts" as suggested by #vladimir
I was inserting data in big batches but that does not mean that clickhouse will also store data in on big file. Clickhouse stores data based on
number of partitions * number of columns * (times 2 for every nullable column)
So even for insertion with a single large batch multiple files are created. I solved this issue by reducing the number of partitions by removing a partition key thereby reducing the number of files being created.

Does Snowflake execute a "single" single query on multiple nodes or a single node in a Cluster?

When a "single" query is executed on a Snowflake cluster, will it use (if available) as many as the nodes in parallel to execute the query, or just one single node in the cluster?
I am specifically looking for scaling strategy on how to speed up the following query
INSERT INTO x SELECT FROM y
Most of the time, Snowflake will try to run the query in parallel and use all nodes in the cluster, but in rare cases, it may run on only a partition of nodes. For example, if the data source is so small, if there's one file to ingest with COPY command, or you are calling a JavaScript stored procedure for processing data.
Here is a simple demonstration. The following query will run on only 1 node, no matter how many nodes the cluster has:
create or replace table dummy_test (id varchar) as
select randstr(2000, random()) from table(generator(rowcount=>500000));
Because the data source is a generator (which can not be read in parallel). You may try to run it on various sized warehouses and you will see that it will complete around 55 seconds (in case there is no other workload in warehouse).
As Simeon and Mike mentioned, a query can be executed in one cluster in multi-cluster warehouses. Multi-cluster warehouses are for increasing concurrency.
In the context of a multi-cluster warehouse, just a single node.
So large questions are better run on larger sized node, and large volumes of queries run best against clusters of correctly sized nodes (from a average wait time) but of course this costs more.. but if you had a fixed pool of queries, that total cost should be the same running them on a wider cluster, just less wall clock-time.
This is a good read also on the topic of scaling

Data Partitioning and Replication on Cassandra cluster

I have a 3 node Cassandra cluster with RF=3. Now when I do nodetool status I get the owns for each node in the cluster as 100%.
But when I have 5 nodes in the cluster wit RF=3. The owns is 60%(approx as shown in image below).
Now as per my understanding the partitioner will calculate the hash corresponding to first replica node and the data will also be replicated as per the RF on the other nodes.
Now we have a 5 node cluster and RF is 3.
Shouldn't 3 nodes be owning all the data evenly(100%) as partitioner will point to one node as per the partitoning strategy and then same data be replicated to remaining nodes which equals RF-1? It's like the data is getting evenly distributed among all the nodes(5) even though RF is 3.
Edit1:
As per my understanding the reason for 60%(approx) owns for each node is because the RF is 3. It means there will be 3 replicas for each row. It means there will be 300% data. Now there are 5 nodes in the cluster and partitioner will be using the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
But now the issue is that we checked all the nodes of our cluster and all the nodes contain all the data even though the RF is 3.
Edit2:
#Aaron I did as specified in the comment. I created a new cluster with 3 nodes.
I created a Keyspace "test" and set the class to simplestrategy and RF to 2.
Then I created a table "emp" having partition key (id,name).
Now I inserted a single row into the first node.
As per your explanation, It should only be in 2 nodes as RF=2.
But when I logged into all the 3 nodes, i could see the row replicated in all the nodes.
I think since the keyspace is getting replicated in all the nodes therefore, the data is also getting replicated.
Percent ownership is not affected (at all) by actual data being present. You could add a new node to a single node cluster (RF=1) and it would instantly say 50% on each.
Percent ownership is purely about the percentage of token ranges which a node is responsible for. When a node is added, the token ranges are recalculated, but data doesn't actually move until a streaming event happens. Likewise, data isn't actually removed from its original node until cleanup.
For example, if you have a 3 node cluster with a RF of 3, each node will be at 100%. Add one node (with RF=3), and percent ownership drops to about 75%. Add a 5th node (again, keep RF=3) and ownership for each node correctly drops to about 3/5, or 60%. Again, with a RF of 3 it's all about each node being responsible for a set of primary, secondary, and tertiary token ranges.
the default random hashing algorithm which will distribute the data evenly across all the nodes in the cluster.
Actually, the distributed hash with Murmur3 partitioner will evenly distribute the token ranges, not the data. That's an important distinction. If you wrote all of your data to a single partition, I guarantee that you would not get even distribution of data.
The data replicated to another nodes when you add them isn't cleared up automatically - you need to call nodetool cleanup on the "old" nodes after you add the new node into cluster. This will remove the ranges that were moved to other nodes.

What is "compute to control node" in Azure SQL DW Query plan?

What exactly is "Compute to Control Node" step in Azure SQL DW Query execution plan? Does that mean ADW is moving the data to control node and then performing the JOIN. I understand Shuffle operation which redistributes data among the compute nodes. But I did not get in what situation does the data flow from Compute to Control node for a JOIN.
All 3 high cost operations in screenshot are associated with moving 2 Fact tables and the biggest dimension tables.
Query_Plan
Thanks
You can have portions of a query sent to the control node in operations such as PartitionMoves. For example, this might occur when you do a GroupBy on a column that's not a distribution column and the optimizer thinks the result set is small enough to send up to the control node for final aggregations.

InfluxDB data structure & database model

Can you please tell me, which data structure has an InfluxDB und which data model InfluxDB use? Is this key-value model. I read the full documentation and I didn't catch that.
Thank you in advance!
1. Data model and terminology
An InfluxDB database stores points. A point has four components: a measurement, a tagset, a fieldset, and a timestamp.
The measurement provides a way to associate related points that might have different tagsets or fieldsets. The tagset is a dictionary of key-value pairs to store metadata with a point. The fieldset is a set of typed scalar values—the data being recorded by the point.
The serialization format for points is defined by the [line protocol] (which includes additional examples and explanations if you’d like to read more detail). An example point from the specification helps to explain the terminology:
temperature,machine=unit42,type=assembly internal=32,external=100 1434055562000000035
The measurement is temperature.
The tagset is machine=unit42,type=assembly. The keys, machine and type, in the tagset are called tag keys. The values, unit42 and assembly, in the tagset are called tag values.
The fieldset is internal=32,external=100. The keys, internal and external, in the fieldset are called field keys. The values, 32 and 100, in the fieldset are called field values.
Each point is stored within exactly one database within exactly one retention policy. A database is a container for users, retention policies, and points. A retention policy configures how long InfluxDB keeps points (duration), how many copies of those points are stored in the cluster (replication factor), and the time range covered by shard groups (shard group duration). The retention policy makes it easy for users (and efficient for the database) to drop older data that is no longer needed. This is a common pattern in time series applications.
We’ll explain replication factor, shard groups, andshards later when we describe how the write path works in InfluxDB.
There’s one additional term that we need to get started: series. A series is simply a shortcut for saying retention policy + measurement + tagset. All points with the same retention policy, measurement, and tagset are members of the same series.
You can refer to the [documentation glossary] for these terms or others that might be used in this blog post series.
2. Receiving points from clients
Clients POST points (in line protocol format) to InfluxDB’s HTTP /write endpoint. Points can be sent individually; however, for efficiency, most applications send points in batches. A typical batch ranges in size from hundreds to thousands of points. The POST specifies a database and an optional retention policy via query parameters. If the retention policy is not specified, the default retention policy is used. All points in the body will be written to that database and retention policy. Points in a POST body can be from an arbitrary number of series; points in a batch do not have to be from the same measurement or tagset.
When the database receives new points, it must (1) make those points durable so that they can be recovered in case of a database or server crash and (2) make the points queryable. This post focuses on the first half, making points durable.
3. Persisting points to storage
To make points durable, each batch is written and fsynced to a write ahead log (WAL). The WAL is an append only file that is only read during a database recovery. For space and disk IO efficiency, each batch in the WAL is compressed using [snappy compression] before being written to disk.
While the WAL format efficiently makes incoming data durable, it is an exceedingly poor format for reading—making it unsuitable for supporting queries. To allow immediate query ability of new data, incoming points are also written to an in-memory cache. The cache is an in-memory data structure that is optimized for query and insert performance. The cache data structure is a map of series to a time-sorted list of fields.
The WAL makes new points durable. The cache makes new points queryable. If the system crashes or shut down before the cache is written to TSM files, it is rebuilt when the database starts by reading and replaying the batches stored in the WAL.
The combination of WAL and cache works well for incoming data but is insufficient for long-term storage. Since the WAL must be replayed on startup, it is important to constrain it to a reasonable size. The cache is limited to the size of RAM, which is also undesirable for many time series use cases. Consequently, data needs to be organized and written to long-term storage blocks on disk that are size-efficient (so that the database can store a lot of points) and efficient for query.
Time series queries are frequently aggregations over time—scans of points within a bounded time range that are then reduced by a summary function like mean, max, or moving windows. Columnar database storage techniques, where data is organized on disk by column and not by row, fit this query pattern nicely. Additionally, columnar systems compress data exceptionally well, satisfying the need to store data efficiently. There is a lot of literature on column stores. [Columnar-oriented Database Systems] is one such overview.
Time series applications often evict data from storage after a period of time. Many monitoring applications, for example, will store the last month or two of data online to support monitoring queries. It needs to be efficient to remove data from the database if a configured time-to-live expires. Deleting points from columnar storage is expensive, so InfluxDB additionally organizes its columnar format into time-bounded chunks. When the time-to-live expires, the time-bounded file can simply be deleted from the filesystem rather than requiring a large update to persisted data.
Finally, when InfluxDB is run as a clustered system, it replicates data across multiple servers for availability and durability in case of failures.
The optional time-to-live duration, the granularity of time blocks within the time-to-live period, and the number of replicas are configured using an InfluxDB retention policy:
CREATE RETENTION POLICY <retention_policy_name> ON <database_name> DURATION <duration> REPLICATION <n> [SHARD DURATION <duration>] [DEFAULT]
The duration is the optional time to live (if data should not expire, set duration to INF). SHARD DURATION is the granularity of data within the expiration period. For example, a one- hour shard duration with a 24 hour duration configures the database to store 24 one-hour shards. Each hour, the oldest shard is expired (removed) from the database. Set REPLICATION to configure the replication factor—how many copies of a shard should exist within a cluster.
Concretely, the database creates this physical organization of data on disk:
'' Database director /db
'' Retention Policy directory /db/rp
'' Shard Group (time bounded). (Logical)
'' Shard directory (db/rp/Id#)
'' TSM0001.tsm (data file)
'' TSM0002.tsm (data file)
'' …
The in-memory cache is flushed to disk in the TSM format. When the flush completes, flushed points are removed from the cache and the corresponding WAL is truncated. (The WAL and cache are also maintained per-shard.) The TSM data files store the columnar-organized points. Once written, a TSM file is immutable. A detailed description of the TSM file layout is available in the [InfluxDB documentation].
4. Compacting persisted points
The cache is a relatively small amount of data. The TSM columnar format works best when it can store long runs of values for a series in a single block. A longer run produces both better compression and reduces seeks to scan a field for query. The TSM format is based heavily on log-structured merge-trees. New (level one) TSM files are generated by cache flushes. These files are later combined (compacted) into level two files. Level two files are further combined into level three files. Additional levels of compaction occur as the files become larger and eventually become cold (the time range they cover is no longer hot for writes.) The documentation reference above offers a detailed description of compaction.
There’s a lot of logic and sophistication in the TSM compaction code. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries.
Refer: https://www.influxdata.com/blog/influxdb-internals-101-part-one/
It is essentially key-value, key being time, where value can be one or more fields/columns. Values can also optionally be indexed columns, called tags in influxdb, that are optimised for searching along with time which is always required. At least one non-indexed value is required.
See schema design documentation for more details.
Much like Cassandra, in fact, though influx is essentially schema-on-write while developers write schema for Cassandra.
Storage engine wise again very similar to Cassandra, using a variation of SSTables as used in Cassandra, optimised for time series data.
I am not sure if the following influx document was there when you were looking for your answer:
https://docs.influxdata.com/influxdb/v1.5/concepts/key_concepts/
But it really helped me understanding the data structure of influxdb.

Resources