Elasticsearch merge data from multiple indexes into merged index - database

My company uses an out of the box software, and that software export logs to Elasticsearch (and uses these logs). The software create an index per day for every data type, for example:
"A" record data => A_Data_2022_12_13, A_Data_2022_12_14 and so on..
Because this data storing method our Elastic has thousands of shards for 100GB of data.
I want to merge all those shards into a small amount of shards, 1 or 2 for every data type.
I thought about reindex, but I think it is overkill for my purpose, because I want the data to stay the same as it is now, but merged into one shard.
What is the best practice to do it?
Thanks!
I tried reindex, but it takes a lot of time, and I think it is not the right solution.

Too many shards can cause over-heap usage. Unbalanced shards can cause hot spots in clusters. Your decision is true and you should combine small indices into one or multiple indexes. Thus, you will have more stable shards, that is, a more stable cluster.
What you can do?
Create a rollover index and point your indexer to that index. In
that way, new data will store in the new index, so you need only be
concerned about the existing data.
Use filtered alias to search your data.
Reindex or wait. The new data is indexing into a new index, but what
are you gonna do for the existing indices? There are 2 ways for this. I
assume you have an index retention period, so you can wait until all
separated indices are deleted or you can directly reindex your data.
Note: You can tune the reindex speed with slice and set the number_of_replicas to 0.

Related

Get only N months data indexed in a Collection. It should be on rolling based

Currently I am facing an issue that a MongoDB collection might have billion records which contains document based on some rapid event happening in the system. These events get logged in the DB collection.
Since we have some 2-3 composite indexing on the same collection, the search definitely becomes slow.
The escape point to this is our customer has agreed if we can index only N months data in the MongoDB, then the efficiency for read can increase instead of having 2-3 years data indexed and we perform read operation.
My thoughts on solution 1: we can do TTL indexes and set expiry. After this expiry the data gets deleted from main collection. we can some how do backup for that expired records. This way we can only have specific data required in main collection.
My thoughts on solution 2: I can remove all the indexes, create the indexes again based on time frame, for example, Drop current indexes and again create indexes based on condition that indexes must be created only till past N months data only. This way I can maintain limited index. But I am not sure how much is it possible.
Question: I need more help on this on how can I achieve selective indexing. Also it must be rolling as everyday records gets added so does indexing.
If you're on Mongo 3.2 or above, you should be able to use a partial index to create the "selective index" that you want -- https://docs.mongodb.com/manual/core/index-partial/#index-type-partial You'll just need to be sure that your queries share the same partial filter expression that the index has.
(I suspect that there might also be issues with the indexes you currently have, and that reducing index size won't necessarily have a huge impact on search duration. Mongo indexes are stored in a B-tree, so the time to navigate the tree to find a single item is going to scale relative to the log of the number of items. It might be worth examining the explain output for the queries that you have to see what mongo is actually doing.)

When to use Cassandra vs. Solr in DSE?

I'm using DSE for Cassandra/Solr integration so that data are stored in Cassandra and indexed in Solr. It's very natural to use Cassandra to handle CRUD operation and use Solr for full text search respectively, and DSE can really simplify data synchronization between Cassandra and Solr.
When it comes to query, however, there are actually two ways to go: Cassandra secondary/manual configured index vs. Solr. I want to know when to use which method and what's the performance difference in general, especially under DSE setup.
Here is one example use case in my project. I have a Cassandra table storing some item entity data. Besides the basic CRUD operation, I also need to retrieve items by equality on some field (say category) and then sort by some order (in my case here, a like_count field).
I can think of three different ways to handle it:
Declare 'indexed=true' in Solr schema for both category and like_count field and query in Solr
Create a denormalized table in Cassandra with primary key (category, like_count, id)
Create a denormalized table in Cassandra with primary key (category, order, id) and use an external component, such as Spark/Storm,to sort the items by like_count
The first method seems to be the simplest to implement and maintain. I just write some trivial Solr accessing code and the rest heavy lifting are handled by Solr/DSE search.
The second method requires manual denormalization on create and update. I also need to maintain a separate table. There is also tombstone issue as the like_count can possibly be updated frequently. The good part is that the read may be faster (if there are no excessive tombstones).
The third method can alleviate the tombstone issue at the cost of one extra component for sorting.
Which method do you think is the best option? What is the difference in performance?
Cassandra secondary indexes have limited use cases:
No more than a couple of columns indexed.
Only a single indexed column in a query.
Too much inter-node traffic for high cardinality data (relatively unique column values)
Too much inter-node traffic for low cardinality data (high percentage of rows will match)
Queries need to be known in advance so data model can be optimized around them.
Because of these limitations, it is common for apps to create "index tables" which are indexed by whatever column is desired. This requires either that data be duplicated from the main table to each index table, or an extra query will be needed to read the index table and then read the actual row from the main table after reading the main key from the index table. Queries on multiple columns will have to be manually indexed in advance, making ad hoc queries problematic. And any duplicated will have to be manually updated by the app into each index table.
Other than that... they will work fine in cases where a "modest" number of rows will be selected from a modest number of nodes, and queries are well specified in advance and not ad hoc.
DSE/Solr is better for:
A moderate number of columns are indexed.
Complex queries with a number of columns/fields referenced - Lucene matches all specified fields in a query in parallel. Lucene indexes the data on each node, so nodes query in parallel.
Ad hoc queries in general, where the precise queries are not known in advance.
Rich text queries such as keyword search, wildcard, fuzzy/like, range, inequality.
There is a performance and capacity cost to using Solr indexing, so a proof of concept implementation is recommended to evaluate how much additional RAM, storage, and nodes are needed, which depends on how many columns you index, the amount of text indexed, and any text filtering complexity (e.g., n-grams need more.) It could range from 25% increase for a relatively small number of indexed columns to 100% if all columns are indexed. Also, you need to have enough nodes so that the per-node Solr index fits in RAM or mostly in RAM if using SSD. And vnodes are not currently recommended for Solr data centers.

Search using Solr vs Map Reduce on Files - which is reliable?

I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.

Best way to move a data row to another shard?

The question says it all.
Example: I'm planning to shard a database table. The table contains customer orders which are flagged as "active", "done" and "deleted". I also have three shards, one for each flag.
As far as I understand a row has to be moved to the right shard, when the flag is changed.
Am I right?
What's the best way to do this?
Can triggers be used?
I thought about not moving the row immediately, but only at the end of the day/week/month, but then it is not determined, in which shard a rows with a specific flag resides and searches have to be done always over all shards.
EDIT: Some clarification:
In general I have to choose on a criterum to decide, in which shard a row resides. In this case I want it to be the flag described above, because it's the most natural way to shard this kind of data. (In my opinion) There is only a limited number of active orders which is accessed very often. There is a large number of finished orders, which are seldom accessed and there's a very huge number of data rows which are almost never accessed.
If I want to now where a specific data row resides I dont have to search all shards. If the user wants to load an active order, I know already in which database I have to look.
Now the flag, which is my sharding criterium, changes and I want to know the best way to deal with this case. If I'd just keep the record in its original database, eventually all data would accumulate in a single table.
In my opinion keeping all active record in single shard may not be a good idea. In such sharding strategy all IOs will be performed on single database instance leaving all other highly underutilized.
Alternate sharding strategy can be to distribute the newly created rows among the shards using some kind of hash function. This will allow
Quick look up of row
Distribute IO on all the shard instances.
No need to move the data from one shard to another (except the case when you want to increase the number of shards).
Sharding usually refer to separating them in different databases on different servers. Oracle can do what you want using a feature called partitioned tables.
If you're using triggers (after/before_update/insert), it would be an immediate move, other methods would result in having different types of data in the first shard (active), until it is cleaned-up.
I would also suggest doing this by date (like a monthly job that moves anything that's inactive and older than a month to another "Archive" Database).
I'd like to ask you to reconsider doing this if you're doing it to increase performance (Unless you have terabytes of data in this table). Please tell us why you want to shard and we'll all think about ways to solve your problem.

Searching across shards?

Short version
If I split my users into shards, how do I offer a "user search"? Obviously, I don't want every search to hit every shard.
Long version
By shard, I mean have multiple databases where each contains a fraction of the total data. For (a naive) example, the databases UserA, UserB, etc. might contain users whose names begin with "A", "B", etc. When a new user signs up, I simple examine his name and put him into the correct database. When a returning user signs in, I again look at his name to determine the correct database to pull his information from.
The advantage of sharding vs read replication is that read replication does not scale your writes. All the writes that go to the master have to go to each slave. In a sense, they all carry the same write load, even though the read load is distributed.
Meanwhile, shards do not care about each other's writes. If Brian signs up on the UserB shard, the UserA shard does not need to hear about it. If Brian sends a message to Alex, I can record that fact on both the UserA and UserB shards. In this way, when either Alex or Brian logs in, he can retrieve all his sent and received messages from his own shard without querying all shards.
So far, so good. What about searches? In this example, if Brian searches for "Alex" I can check UserA. But what if he searches for Alex by his last name, "Smith"? There are Smiths in every shard. From here, I see two options:
Have the application search for Smiths on each shard. This can be done slowly (querying each shard in succession) or quickly (querying each shard in parallel), but either way, every shard needs to be involved in every search. In the same way that read replication does not scale writes, having searches hit every shard does not scale your searches. You may reach a time when your search volume is high enough to overwhelm each shard, and adding shards does not help you, since they all get the same volume.
Some kind of indexing that itself is tolerant of sharding. For example, let's say I have a constant number of fields by which I want to search: first name and last name. In addition to UserA, UserB, etc. I also have IndexA, IndexB, etc. When a new user registers, I attach him to each index I want him to be found on. So I put Alex Smith into both IndexA and IndexS, and he can be found on either "Alex" or "Smith", but no substrings. In this way, you don't need to query each shard, so search might be scalable.
So can search be scaled? If so, is this indexing approach the right one? Is there any other?
There is no magic bullet.
Searching each shard in succession is out of the question, obviously, due to the incredibly high latency you will incur.
So you want to search in parallel, if you have to.
There are two realistic options, and you already listed them -- indexing, and parallelized search. Allow me to go into a little more detail on how you would go about designing them.
The key insight you can use is that in search, you rarely need the complete set of results. You only need the first (or nth) page of results. So there is quite a bit of wiggle room you can use to decrease response time.
Indexing
If you know the attributes on which the users will be searched, you can create custom, separate indexes for them. You can build your own inverted index, which will point to the (shard, recordId) tuple for each search term, or you can store it in the database. Update it lazily, and asynchronously. I do not know your application requirements, it might even be possible to just rebuild the index every night (meaning you will not have the most recent entries on any given day -- but that might be ok for you). Make sure to optimize this index for size so it can fit in memory; note that you can shard this index, if you need to.
Naturally, if people can search for something like "lastname='Smith' OR lastname='Jones'", you can read the index for Smith, read the index for Jones, and compute the union -- you do not need to store all possible queries, just their building parts.
Parallel Search
For every query, send off requests to every shard unless you know which shard to look for because the search happens to be on the distribution key. Make the requests asynchronous. Reply to the user as soon as you get the first page-worth of results; collect the rest and cache locally, so that if the user hits "next" you will have the results ready and do not need to re-query the servers. This way, if some of the servers are taking longer than others, you do not need to wait on them to service the request.
While you are at it, log the response times of the sharded servers to observe potential problems with uneven data and/or load distribution.
I'm assuming you are talking about shards a la :
http://highscalability.com/unorthodox-approach-database-design-coming-shard
If you read that article he goes into some detail on exactly your question, but long answer short, you write custom application code to bring your disparate shards together. You can do some smart hashing to both query individual shards and insert data into shards. You need to ask a more specific question to get a more specific answer.
You actually do need every search to hit every shard, or at least every search needs to be performed against an index that contains the data from all shards, which boils down to the same thing.
Presumably you shard based on a single property of the user, probably a hash of the username. If your search feature allows the user to search based on other properties of the user it is clear that there is no single shard or subset of shards that can satisfy a query, because any shard could contain users that match the query. You can't rule out any shards before performing the search, which implies that you must run the query against all shards.
You may want to look at Sphinx (http://www.sphinxsearch.com/articles.html). It supports distributed searching. GigaSpaces has parallel query and merge support. This can also be done with MySQL Proxy (http://jan.kneschke.de/2008/6/2/mysql-proxy-merging-resultsets).
To build a non-sharded indexed kinds of defeats the purpose of the shard in the first place :-) A centralized index probably won't work if shards were necessary.
I think all the shards need to be hit in parallel. The results need to be filtered, ranked, sorted, grouped and the results merged from all the shards. If the shards themselves become overwhelmed you have to do the usual (reshard, scale up, etc) to underwhelm them again.
RDBMs are not good tool for textual search. You will be much better off looking at Solr. Performance difference between Solr and database will be in the order of magnitude of 100X.

Resources