Solr Architecture/Performance

Solr Architecture/Performance - solr

Number of rows in SQL Table (Indexed till now using Solr): 1 million
Total Size of Data in the table: 4GB
Total Index Size: 3.5 GB
Total Number of Rows that I have to index: 20 Million (approximately 100 GB Data) and growing
What is the best practices with respect to distributing the index? What I mean to say here is when should I distribute and what is the magic number that I can have for index size per instance?
For 1 million itself Solr instance running on a VM is taking roughly 2.5 hrs to index for me. So for 20 million roughly it would take 60 -70 hrs. That would be too much.
What would be the best distributed architecture for my case? It will be great if people may share their best practices and experience.

Related

Is the AWS DynamoDB's free tier with 25 RCUs sufficient to read nearly 100Mb of data from 8 tables totally at any given instance?

I have text data on the DynamoDB's tables and there are 8 tables totally with max size of 256Kb - 300 Kb each. That makes 2Mb - 2.5Mb the total size of the DB.
I am reading the table from an app and I am making nearly 50 reads for all the tables from the app at any given instance of time. So that means at any given instance the total reads will be of size 100Mb
So will the 25 RCU's provided by the AWS DynamoDB's free tier be sufficient to carry out the above mentioned task, or will I be billed at the end of the month.

I need 50 for each table and eventually consistent read
Then you need 400 (50*8) eventually consistent (ES) reads per seconds. 1 RCU is 2 ES reads per second. Which means that performing 400 ES will require 200 RCU, resulting in being way over your 25 RCUs.
Subsequently, you will have to pay for the excess RCUs that you use.

SQL Server optimization

My application (industrial automation) uses SQL Server 2017 Standard Edition on a Dell T330 server, has the configuration:
Xeon E3-1200 v6
16gb DDR4 UDIMMs
2 x 2tb HD 7200RPM (Raid 1)
In this bank, I am saving the following tables:
Table: tableHistory
Insert Range: Every 2 seconds
410 columns type float
409 columns type int
--
Table: tableHistoryLong
Insert Range: Every 10 minutes
410 columns type float
409 columns type int
--
Table: tableHistoryMotors
Insert Range: Every 2 seconds
328 columns type float
327 columns type int
--
Table: tableHistoryMotorsLong
Insert Range: Every 10 minutes
328 columns type float
327 columns type int
--
Table: tableEnergy
Insert Range: Every 700 milliseconds
220 columns type float
219 columns type int
Note:
When I generate reports / graphs, my application inserts the inclusions in the buffer. Because the system cannot insert and consult at the same time. Because queries are well loaded.
A columns, they are values of current, temperature, level, etc. This information is recorded for one year.
Question
With this level of processing can I have any performance problems?
Do I need better hardware due to high demand?
Can my application break at some point due to the hardware?

Your question may be closed as too broad but I want to elaborate more on the comments and offer additional suggestions.
How much RAM you need for adequate performance depends on the reporting queries. Factors include the number of rows touched, execution plan operators (sort, hash, etc.), number of concurrent queries. More RAM can also improve performance by avoiding IO, especially costly with spinning media.
A reporting workload (large scans) against a 1-2TB database with traditional tables needs fast storage (SSD) and/or more RAM (hundreds of GB) to provide decent performance. The existing hardware is the worst case scenario because data are unlikely to be cached with only 16GB RAM and a singe spindle can only read about 150MB per second. Based on my rough calculation of the schema in your question, a monthly summary query of tblHistory will take about a minute just to scan 10 GB of data (assuming a clustered index on a date column). Query duration will increase with the number of concurrent queries such that it would take at least 5 minutes per query with 5 concurrent users running the same query due to disk bandwidth limitations. SSD storage can sustain multiple GB per second so, with the same query and RAM, a data transfer time for the query above will take under 5 seconds.
A columnstore (e.g. a clustered columnstore index) as suggested by #ConorCunninghamMSFT will reduce the amount of data transferred from storage greatly because only data for the columns specified in the query are read and inherent columnstore compression
will reduce both the size of data on disk and the amount transferred from disk. The compression savings will depend much on the actual column values but I'd expect 50 to 90 percent less space compared to a rowstore table.
Reporting queries against measurement data are likely to specify date range criteria so partitioning the columnstore by date will limit scans to the specified date range without a traditional b-tree index. Partitioning will also also facilitate purging for the 12-month retention criteria with sliding window partition maintenenace (partition TRUNCATE, MERGE, SPLIT) and thereby greatly improve performance of the process compared to a delete query.

Snowflake auto scaling and concurrency

I have a medium snowflake warehouse with below conf.
min cluster :1
max cluster : 6
Max concurrency level : 12
I am running 100 parallel queries from 100 sessions ( 1 session 1 query). With 100 queries on moderate data set is taking max 21 seconds to process all the queries. Most of the queries are having more than 90% time as queuing time. Even we have 6 max clusters , snowflake is adding only 2 clusters for all the workload. I am pretty confused as i was expecting all the clusters to be active instead of having more queuing time with 2 clusters. Can you please help me here and let me know you experiences on auto scaling.

How to decide Solr cloud shard per node?

We have 16 64 GB RAM, 4 core machines. The index size is around 200 GB. Initially we decided to have 64 shards, ie 4 shards per node. We came to 4 shards per nodes because we have 4 core machine (4 core can process 4 shards at a time). When we tested qtime of the query was pretty high. We re-ran performance test on reduced shards. One for 32 total shards(2shards per node) and 16 total shards(1shard per node). The qtime has gone down drastically(by upto 90%) for 16 shards .
So How is shards per nodes decided? Is there a formula based on machine config and index volume?

One other thing you will want to review is the type and volume of queries you are sending to Solr. There is no single magic formula that you can use, my best advice would be to just test a few different alternatives to see which one performs the best.
One thing to keep in mind is the JVM size and index size per server. I think it'd be nice if you could cache the entire index in memory on each box.
Additionally, make sure you are testing query response time with the queries you will actually be running, not just made up things. Things like grouping and faceting will make a huge difference.

Long Binary Array Compression

There is an Array of Binary Numbers and the number of elements in this array is around 10^20.
The number of "ones" in the array is around 10^10, and these numbers are randomly distributed.
Once the data is generated and saved, it won't be edited: it will remain in read-only mode in its whole life cycle.
Having this data saved, requests will be received. Each request contains the index of the array, and the response should be the value in that particular index. The indexes of these requests are not in order (they may be random).
Question is: how to encode this info saving space and at the same time have a good performance when serving requests?
My thoughts so far are:
To have an array of indexes for each of the "ones". So, I would have an array of 10^10 elements, containing indexes in the range: 0 - 10^20. Maybe not the best compression method, but, it is easy to decode.
The optimal solution in compression: to enumerate each of the combinations (select 10^10 numbers from a set of 10^20 available numbers), then, the data is just the "id" of this enumeration... but, this could be a problem to decode, I think.

Look up "sparse array". If access speed is important, a good solution is a hash table of indices. You should allocate about 2x the space, requiring a 180 GB table. The access time would be O(1).
You could have just a 90 GB table and do a binary search for an index. The access time would be O(log n), if you're happy with that speed.
You can pack the indices more tightly, to less than 84 GB to minimize the size of the single-table approach.
You can break it up into multiple tables. E.g. if you had eight tables, each representing the possible high three bits of the index, then the tables would take 80 GB.
You can break it up further. E.g. if you have 2048 tables, each representing the high 11 bits of the index, the total would be 70 GB, plus some very small amount for the table of pointers to the sub-tables.
Even further, with 524288 tables, you can do six bytes per entry for 60 GB, plus the table of tables overhead. That would still be small in comparison, just megabytes.
The next multiple of 256 should still be a win. With 134 million subtables, you could get it down to 50 GB, plus less than a GB for the table of tables. So less than 51 GB. Then you could, for example, keep the table of tables in memory, and load a sub-table into memory for each binary search. You could have a cache of sub-tables in memory, throwing out old ones when you run out of space. Each sub-table would have, on average, only 75 entries. Then the binary search is around seven steps, after one step to find the sub-table. Most of the time will be spent getting the sub-tables into memory, assuming that you don't have 64 GB of RAM. Then again, maybe you do.