there is a configuration parameter "balance" in /etc/taos/taos.cfg, the default value is 1, I am wondering what is it and how to use it?
# enable/disable load balancing
# balance 1
TDengine's data is distributed on different vnodes. After a long time, there may be uneven data distribution. At this time, this switch can be used to automatically migrate data on different vnodes to achieve balanced distribution.
Related
I am in the process of creating an ETL and fraud management module using flink to analyze a sequence of real time credit card transactions.
All transactions are received by an exposed API that pushes the data into a Kafka topic.
First, the received data needs to be checked and cleaned, and then stored in a database.
The next step is a fraud analysis of these transactions.
In this first step, with Flink, I have to check in the Card database that the card is known before continuing. The problem is, there are around a billion cards in this database and new card could be added over time.
So I'm not sure if I could cache the entire card number in memory or how to effectively handle this check: Is Flink able to handle some kind of sliding cache to check the card for existence in batch?
What you might do is to mirror the card database into Flink's key-partitioned state, either on-heap, or using RocksDB if you want to this to spill to disk. Key-partitioned state is sharded across the cluster, so if you do want to keep the entire card database in memory, you can scale up the cluster until that's feasible.
To keep only recently seen values, you could rely on state TTL to expire records that haven't been accessed recently.
An alternative: Flink SQL has support for doing streaming lookup joins against JDBC databases, and you can configure caching for that.
We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.
Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.
First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?
AWS has shared time-series data processing architecture:
Very simplified I think of it this way:
Devices push data directly to DynamoDB using HTTP API
Metrics are stored in one table per 24 hours
At the end of the day some procedure runs on Elastic Map Reduce and
produces ready JSON files with data required to show graphs per time
period.
Old tables are stored in RedShift for further applications.
Has anyone already done something similar before? Maybe there is simpler architecture?
This requires bigdata infrastructure like
1) Hadoop cluster
2) Spark
3) HDFS
4) HBase
You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.
IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.
The reason I am suggesting this architecture is for
scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.
This is proven architecture in big data analytics application.
We need to move our solr cloud cluster from one cloud vendor to another, the cluster is composed of 8 shards with 2 replica factor spread among 8 servers with roughly a total of 500GB worth of data.
I wonder what are the common approaches to migrate the cluster but specially its data with the less impact in availability and performance etc..
I was thinking in some sort of initial dump copy to then synchronize them catching up the diff (which could be huge) after keeping them in sync just switch whenever everything is ready from the other side.
Is that something doable? what tools should/could I use?
Thanks!
You have multiple choices depending on your existing setup and Solr version:
As mentioned earlier, make use of backup and restore APIs from Collections API
If you have Solr 6 and above, I would recommend exploring the option of CDCR, which is Solr's native Cross Data Centre Replication.
Reindexing onto the new cluster and then leverage Solr Collection Aliasing to change your application end points to the target provider upon the completion of reindexing
Currently we have 2 servers with a load-balancer before them. We want to be able to turn 1 machine off and later on, without the user noticing it.
Our application also uses solr and now i wanted to install & configure solr on both servers and the question is how do i configure a master-master replication?
After my initial research i found out that it's not possible :(
But what are my options here? I want both indices to stay in sync and when a document is commited on one server it should also go to the other.
Thanks for your help!
Not certain of your specific use case (why turn 1 server on and off?), there is no specific "master-master" replication. Solr does however support distributed indexing and querying via SolrCloud. From the documentation for SolrCloud:
Replication ensures redundancy for your data, and enables you to send
an update request to any node in the shard. If that node is a
replica, it will forward the request to the leader, which then
forwards it to all existing replicas, using versioning to make sure
every replica has the most up-to-date version. This architecture
enables you to be certain that your data can be recovered in the event
of a disaster, even if you are using Near Real Time searching.
It's a bit complex so I'd suggest you spend some time going thru the documentation as it's not quite as simple as setting up a couple of masters and load balancing between them. It is a big step up from the previous master/slave replication that Solr used, so even if it's not a perfect fit it will be a lot closer to what you need.
https://cwiki.apache.org/confluence/display/solr/SolrCloud
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
You can just create a simple master - slave replication as described here:
https://cwiki.apache.org/confluence/display/solr/Index+Replication
But be sure you send your inserts, deletes, updates directly to the master, but selects can go through the load balancer.
The other alternative is to create a third server as a master, and 2 slaves, and the lode balancer can be in front of the two slaves.
I would like to run two Solr instances on different computers as a cluster.
My main interest is High availability - meaning, in case one server crashes or is down there will be always another one.
(my performances on a single instance are great. I do not need to split the data to two servers.)
Questions:
1. What is the best practice?
Is it different than clustering for index splitting? Do I need Shards?
2. Do I need zoo keeper?
3. Is it a container based configuration (different for jetty and tomcat)
4, Do I need an external NLB for that ?
5. When one computer is up after crashing. how dows it updates its index?
You can define numShards=1 and that's it. You need a single slice replicated for that. If you want automated cluster management and hot replication - yes, you need SolrCloud mode and ZooKeeper. Speaking about load balancing, it depends on your architecture. If you are going to use SolrJ, there is a basic load balancing implementation there.
When a node initializes, it enters the recovery stage. During the recovery stage it synchronizes with the other existing replicas as well as with its own transaction log. If its index version is old, it gets a newer version from other server.