Cassandra performance issue/understanding - solr

We have 3 cassandra nodes on which we are facing a problem where our application is working quite slow most of the times. There are many writes that we do per second and that could be one of the reason. Perhaps what we are doing wrong is we are continuously writing on one node (Let's say Node 1) and reading (through SOLR) from the same node (Node 1). So a possible solution that we think of is, to build some logic to change the connection string where we write on all three nodes and read from all three nodes. Will this work or is there a better way?
We have 100GB storage on each of the nodes and there is approximately 10 GB of data on each node. We also saw a case where data on Node 1 where are writing has more data and other 2 nodes has less data. Even this is something we are trying to figure out.

Related

ClickHouse - How to remove node from cluster for reading?

Background
I'm beginning work to set up a ClickHouse cluster with 3 CH nodes. The first node (Node A) would be write-only, and the remaining 2 (Nodes B + C) would be read-only. By this I mean that writes for a given table to Node A would automatically replicate to Nodes B + C. When querying the cluster, reads would only be resolved against Nodes B + C.
The purpose for doing this is two-fold.
This datastore serves both real-time and background jobs. Both are high volume, only on the read side, so it makes sense to segment the traffic. Node A would be used for writing to the cluster and all background reads. Nodes B + C would be strictly used for the UX.
The volume of writes is very low, perhaps 1 write per 10,000 reads. Data is entirely refreshed once per week. Background jobs need to be certain that the most current data is being read before they can be kicked off. Reading off of replicas introduces eventual consistency as a concern, so reading from the node directly (rather than the cluster) from Node A guarantees the data to be strongly consistent.
Question
I'm not finding much specific information in the CH documentation, and am wondering whether this might be possible. If so, what would the cluster configuration look like?
Yes, it is possible to do so. But wouldn't the best solution be to read and write to each server sequentially using the Distributed table?

Maximum recommended children in a Firebase Database node

I'm starting off with using Firebase Database for the first time, and I think I've got a decent structure planned out, but I'm worried about the number of child nodes I might end up with.
Is there a recommended limit or average known-good number of child values which can be added to a node without running into noticeable performance problems? I've not had much database experience at all, and I've not been able to find any information on what an acceptable value would be, so I have no idea if my planned structure will scale well.
As a rough estimate, I'm expecting a rough maximum of 30,000 children all-in. I'll only be requesting data from around 10 of those, but as far as I know, Firebase will retrieve the entire node, before filtering out any results, which is why I'm worried about the performance impact of retrieving the entire node. Any help with this would be massively appreciated! Thanks!
As a rough estimate, I'm expecting a rough maximum of 30,000 children all-in.
That's not really a very large number of child nodes.
as far as I know, Firebase will retrieve the entire node, before filtering out any results
If you query the database using a field with an index, the nodes will be filtered on the server. You can create an index to avoid performance problems for larger numbers of child nodes.

What to do when nodes in a Cassandra cluster reach their limit?

I am studying up Cassandra and in the process of setting up a cluster for a project that I'm working on. Consider this example :
Say I setup a 5 node cluster with 200 gb space for each. That equals up to 1000 gb ( round about 1 TB) of space overall. Assuming that my partitions are equally split across the cluster, I can easily add nodes and achieve linear scalability. However, what if these 5 nodes start approaching the SSD limit of 200 gb? In that case, I can add 5 more nodes and now the partitions would be split across 10 nodes. But the older nodes would still be writing data, as they are part of the cluster. Is there a way to make these 5 older nodes 'read-only'? I want to shoot off random read-queries across the entire cluster, but don't want to write to the older nodes anymore( as they are capped by a 200 gb limit).
Help would be greatly appreciated. Thank you.
Note: I can say that 99% of the queries will be write queries, with 1% or less for reads. The app has to persist click events in Cassandra.
Usually when cluster reach its limit we add new node to cluster. After adding a new node, old cassandra cluster nodes will distribute their data to the new node. And after that we use nodetool cleanup in every node to cleanup the data that distributed to the new node. The entire scenario happens in a single DC.
For example:
Suppose, you have 3 node (A,B,C) in DC1 and 1 node (D) in DC2. Your nodes are reaching their limit. So, decided to add a new node (E) to DC1. Node A, B, C will distribute their data to node E and we'll use nodetool cleanup in A,B,C to cleanup the space.
Problem in understanding the question properly.
I am assuming you know that by adding new 5 nodes, some of the data load would be transferred to new nodes as some token ranges will be assigned to them.
Now, as you know this, if you are concerned that old 5 nodes would not be able to write due to their limit reached, its not going to happen as new nodes have shared the data load and hence these have free space now for further write.
Isolating the read and write to nodes is totally a different problem. But if you want to isolate read to these 5 nodes only and write to new 5 nodes, then the best way to do this is to add new 5 nodes in another datacenter under the same cluster and then use different consistency levels for read and write to satisfy your need to make old datacenter read only.
But the new datacenter will not lighten the data load from first. It will even take the same load to itself. (So you would need more than 5 nodes to accomplish both problems simultaneously. Few nodes to lighten the weight and others to isolate the read-write by creating new datacenter with them. Also the new datacenter should have more then 5 nodes). Best practice is to monitor data load and fixing it before such problem happen, by adding new nodes or increasing data limit.
Considering done that, you will also need to ensure that the nodes you provided for read and write should be from different datacenters.
Consider you have following situation :
dc1(n1, n2, n3, n4, n5)
dc2(n6, n7, n8, n9, n10)
Now, for read you provided with node n1 and for write you provided with node n6
Now the read/write isolation can be done by choosing the right Consistency Levels from bellow options :
LOCAL_QUORUM
or
LOCAL_ONE
These basically would confine the search for the replicas to local datacenter only.
Look at these references for more :
Adding a datacenter to a cluster
and
Consistency Levels

Is it possible to get notified when an OrientDB master node is up to date?

I mean when we have a multi-master cluster with 3 master nodes then after we added a new master node (fourth node) we have to be somehow notified when it will be up to date or may be ask it every 5 seconds to find out if it up to date.
Is it possible to realize?
This feature currently is not present, and looking at the roadmap, i don't think there will be.
Consider the fact that if you get to have a situation with 5 nodes (therefore the quorum is 3), so the sync takes place only between two nodes, while the other 3 have a quorum to write data. So you can always keep doing insert or other and you do not need a mechanism to tell you that for example the node 5 is ready or not. This of course if you can get a situational so.

question about mnesia distribution

I have two nodes running mnesia. I created schema and some tables on node 1, and used mnesia:add_table_copy on node 2 to copy the tables from node 1 to node 2.
Everything works well until I call q() on node 1 and then q() on node 2. I found that when I start node 1 again, mnesia:wait_for_tables([sometable], infinity) won't return. It will only return when I start node 2 again.
Is there a way to fix this? This is a problem because I won't be able to start node 1 again if node 2 is down.
In this discussion a situation similar to the one you're facing is presented.
Reading from that source:
At startup Mnesia tries to connect
with the other nodes and if that
suceeds it loads its tables from
them. If the other nodes are down, it
looks for mnesia_down marks in its
local transaction log in order to
determine if it has a consistent
replica or not of its tables. The node
that was shutdown last has
mnesia_down's from all the other
nodes. This means that it safely can
load its tables. If some of the other
nodes where started first (as in your
case) Mnesia will wait indefinitely
for another node to connect in order
to load its tables
You're shutting down node 1 first, so it doesn't have the mnesia_down from the other node. What happens if you reverse the shutting down order?
Also, it should be possible to force the table loading via the force_load_table/1 function:
force_load_table(Tab) -> yes | ErrorDescription
The Mnesia algorithm for table load
might lead to a situation where a
table cannot be loaded. This situation
occurs when a node is started and
Mnesia concludes, or suspects, that
another copy of the table was active
after this local copy became inactive
due to a system crash.
If this situation is not acceptable,
this function can be used to override
the strategy of the Mnesia table load
algorithm. This could lead to a
situation where some transaction
effects are lost with a inconsistent
database as result, but for some
applications high availability is more
important than consistent data.

Resources