I have different availability requirements (e.g. 90%, 95%, 99% guaranteed uptime) for certain subsets of data. For example, the subsets of data are split (sharded?) based on the possible shard key values A, B, C.
What happens if the key shard value is updated? Will the corresponding record be migrated to another shard? And does this have a negative impact on the performance?
Initially I thought of just using replication instead of sharding but I don't see how I can differentiate the availability requirements for the different subsets of data. Meaning I need to guarantee 99% for everything.
Related
My company uses an out of the box software, and that software export logs to Elasticsearch (and uses these logs). The software create an index per day for every data type, for example:
"A" record data => A_Data_2022_12_13, A_Data_2022_12_14 and so on..
Because this data storing method our Elastic has thousands of shards for 100GB of data.
I want to merge all those shards into a small amount of shards, 1 or 2 for every data type.
I thought about reindex, but I think it is overkill for my purpose, because I want the data to stay the same as it is now, but merged into one shard.
What is the best practice to do it?
Thanks!
I tried reindex, but it takes a lot of time, and I think it is not the right solution.
Too many shards can cause over-heap usage. Unbalanced shards can cause hot spots in clusters. Your decision is true and you should combine small indices into one or multiple indexes. Thus, you will have more stable shards, that is, a more stable cluster.
What you can do?
Create a rollover index and point your indexer to that index. In
that way, new data will store in the new index, so you need only be
concerned about the existing data.
Use filtered alias to search your data.
Reindex or wait. The new data is indexing into a new index, but what
are you gonna do for the existing indices? There are 2 ways for this. I
assume you have an index retention period, so you can wait until all
separated indices are deleted or you can directly reindex your data.
Note: You can tune the reindex speed with slice and set the number_of_replicas to 0.
A new business need has emerged in our firm, where a relatively "big" data set needs to be accessed by online processes (with typical latency of up to 1 second). There is only one key with a high granularity / rows count measured in tens of millions and the expected number of columns / fields / value columns will likely exceed hundreds of thousands.
The key column is shared among all value columns, so key-value storage, while scalable, seems rather wasteful here. Is there any hope for using Cassandra / ScyllaDB (to which we gradually narrowed down our search) for such a wide data set, while ideally reducing also data storage needs by half (by storing the common key only once)?
If I understand your use case correctly, your use case will have tens of millions of partitions (what you called rows), and each will have hundreds of thousands of different values in each of them (each those would be a clustering row in modern CQL - CQL no longer supports un-schema-ed wide rows). This is a fairly reasonable data set for Scylla and Cassandra.
But I want to add that I'm not sure the storage saving you are hoping for will really be there. Yes, Scylla/Cassandra will not need to store the partition key multiple times, but unless the partition key is very long, this will be often be negligible compared to the other overheads of storing the data on disk.
Another thing you should consider is your expected queries. How will you read from this database? If you'll want to read all 100,000 columns of a particular key, or a contiguous range of them, then the data model you described is perfect. However, if the expected use case is that you always plan to read a single column from a specific key, then this data model will be inefficient - a random-access read from the middle of a long partition is slower than reading the value from a short partition.
I've read in a book that
Flink maintains one state instance per keyvalue and partitions all records with the same key to the
operator task that maintains the state for this key.
my question is:
lets say i have 4 tasks with 2 slots each.
and there's a key that belongs to 95% of the data.
does it means that 95% the data is routed to the same machine?
Yes, it does mean that. If you have a hot key, then partitioning by key doesn't scale well.
In some cases, there are ways to work around this limitation. For example, if you are computing analytics (e.g., you want to count page views per page per minute, and one page gets 95% of the page views), you can do pre-aggregation -- split the work for the hot key across several parallel instances, and then do one final, non-parallel reduction of the partial results. (This is just standard map/reduce logic.)
This is called "data skew" and it is the bane of scalable applications everywhere.
It's also possible that the entire (100%) load goes to the same machine. There's no guarantee that the data is spread as evenly as possible by key, only that each key gets processed on a single machine. Technically, each key gets mapped to a key group (the number of key groups is the max parallelism for the topology) and each key group gets handled by a specific instance of an operator.
One way to handle this situation involves adding a second field to the key, resulting in a greater number of possible keys and possibly reducing the data skew across the keys. Then aggregate the results in a subsequent operator using just the one original key.
I'm using DSE for Cassandra/Solr integration so that data are stored in Cassandra and indexed in Solr. It's very natural to use Cassandra to handle CRUD operation and use Solr for full text search respectively, and DSE can really simplify data synchronization between Cassandra and Solr.
When it comes to query, however, there are actually two ways to go: Cassandra secondary/manual configured index vs. Solr. I want to know when to use which method and what's the performance difference in general, especially under DSE setup.
Here is one example use case in my project. I have a Cassandra table storing some item entity data. Besides the basic CRUD operation, I also need to retrieve items by equality on some field (say category) and then sort by some order (in my case here, a like_count field).
I can think of three different ways to handle it:
Declare 'indexed=true' in Solr schema for both category and like_count field and query in Solr
Create a denormalized table in Cassandra with primary key (category, like_count, id)
Create a denormalized table in Cassandra with primary key (category, order, id) and use an external component, such as Spark/Storm,to sort the items by like_count
The first method seems to be the simplest to implement and maintain. I just write some trivial Solr accessing code and the rest heavy lifting are handled by Solr/DSE search.
The second method requires manual denormalization on create and update. I also need to maintain a separate table. There is also tombstone issue as the like_count can possibly be updated frequently. The good part is that the read may be faster (if there are no excessive tombstones).
The third method can alleviate the tombstone issue at the cost of one extra component for sorting.
Which method do you think is the best option? What is the difference in performance?
Cassandra secondary indexes have limited use cases:
No more than a couple of columns indexed.
Only a single indexed column in a query.
Too much inter-node traffic for high cardinality data (relatively unique column values)
Too much inter-node traffic for low cardinality data (high percentage of rows will match)
Queries need to be known in advance so data model can be optimized around them.
Because of these limitations, it is common for apps to create "index tables" which are indexed by whatever column is desired. This requires either that data be duplicated from the main table to each index table, or an extra query will be needed to read the index table and then read the actual row from the main table after reading the main key from the index table. Queries on multiple columns will have to be manually indexed in advance, making ad hoc queries problematic. And any duplicated will have to be manually updated by the app into each index table.
Other than that... they will work fine in cases where a "modest" number of rows will be selected from a modest number of nodes, and queries are well specified in advance and not ad hoc.
DSE/Solr is better for:
A moderate number of columns are indexed.
Complex queries with a number of columns/fields referenced - Lucene matches all specified fields in a query in parallel. Lucene indexes the data on each node, so nodes query in parallel.
Ad hoc queries in general, where the precise queries are not known in advance.
Rich text queries such as keyword search, wildcard, fuzzy/like, range, inequality.
There is a performance and capacity cost to using Solr indexing, so a proof of concept implementation is recommended to evaluate how much additional RAM, storage, and nodes are needed, which depends on how many columns you index, the amount of text indexed, and any text filtering complexity (e.g., n-grams need more.) It could range from 25% increase for a relatively small number of indexed columns to 100% if all columns are indexed. Also, you need to have enough nodes so that the per-node Solr index fits in RAM or mostly in RAM if using SSD. And vnodes are not currently recommended for Solr data centers.
The question says it all.
Example: I'm planning to shard a database table. The table contains customer orders which are flagged as "active", "done" and "deleted". I also have three shards, one for each flag.
As far as I understand a row has to be moved to the right shard, when the flag is changed.
Am I right?
What's the best way to do this?
Can triggers be used?
I thought about not moving the row immediately, but only at the end of the day/week/month, but then it is not determined, in which shard a rows with a specific flag resides and searches have to be done always over all shards.
EDIT: Some clarification:
In general I have to choose on a criterum to decide, in which shard a row resides. In this case I want it to be the flag described above, because it's the most natural way to shard this kind of data. (In my opinion) There is only a limited number of active orders which is accessed very often. There is a large number of finished orders, which are seldom accessed and there's a very huge number of data rows which are almost never accessed.
If I want to now where a specific data row resides I dont have to search all shards. If the user wants to load an active order, I know already in which database I have to look.
Now the flag, which is my sharding criterium, changes and I want to know the best way to deal with this case. If I'd just keep the record in its original database, eventually all data would accumulate in a single table.
In my opinion keeping all active record in single shard may not be a good idea. In such sharding strategy all IOs will be performed on single database instance leaving all other highly underutilized.
Alternate sharding strategy can be to distribute the newly created rows among the shards using some kind of hash function. This will allow
Quick look up of row
Distribute IO on all the shard instances.
No need to move the data from one shard to another (except the case when you want to increase the number of shards).
Sharding usually refer to separating them in different databases on different servers. Oracle can do what you want using a feature called partitioned tables.
If you're using triggers (after/before_update/insert), it would be an immediate move, other methods would result in having different types of data in the first shard (active), until it is cleaned-up.
I would also suggest doing this by date (like a monthly job that moves anything that's inactive and older than a month to another "Archive" Database).
I'd like to ask you to reconsider doing this if you're doing it to increase performance (Unless you have terabytes of data in this table). Please tell us why you want to shard and we'll all think about ways to solve your problem.