We are facing high fingerprint match solr query time. Following is our setup Info:
echonest/echoprint-server running on single node (solr 1.0) running on amazon ec2 instance m3.2x large box with 30G RAM
& 8 cores
2.5 million tracks(segment count 19933333) ingested with solr 1.0 index size around 91G.
Applied optimization HashQueryComponent.java https://github.com/playax/echoprint-server/commit/706d26362bbe9141203b2b6e7846684e7a417616#diff-f9e19e870c128c0d64915f304cf43677
Also tried to capture stats of eval method, some of the loop iterations of sequential subreader of index reader took more than 1 second to iterate over all the terms.
Any suggestions or pointers in the right directions will be very helpful.
Related
I am researching snowflake database and have a data aggregation use case, where i need to expose the aggregated data via a Rest API. While the data ingestion and aggregation seems to be well defined, is snowflake a system that can be used as an operational data store for servicing high throughput apis?
Or is this an anti pattern for this system
Updating based on your recent comment.
Here's some quick test results I did on large tables we have in production. *Changed the table names for display.
vLookupView records = 175,760,316
vMainView records = 179,035,026
SELECT
LP.REGIONCODE
, SUM(L.VALUE)
FROM DBO.vLookupView AS LP
INNER JOIN DBO.vMainView AS M
ON LP.PK = M.PK
GROUP BY LP.REGIONCODE;
Results:
SQL SERVER
Production box - 2:04 minutes
**Snowflake:**
By Warehouse (compute) size
XS - 17.1 seconds
Small - 9.9 seconds
Medium - 7.1s seconds
Large - 5.4 seconds
Extra Large - 5.4 seconds
When I added a WHERE condition
WHERE L.ENTEREDDATE BETWEEN '1/1/2018' AND '6/1/2018'
the results were:
SQL SERVER
Production box - 5 seconds
**Snowflake:**
By Warehouse (compute) size
XS - 12.1 seconds
Small - 3.9 seconds
Medium - 3.1s seconds
Large - 3.1 seconds
Extra Large - 3.1 seconds
Im trying to improve performance of my Solr 6.0 Index.
Originally we were indexing 45m rows that was using a select statement joining 7 table and taking 7+ hours to index. This caused us to get a snapshot too old error while the jdbc connection is open for the entire duration of the indexing. Causing our full index to fail.
We were able to archive about 10m rows and build an external table from the original 7 join select. This simplified the query solr was using so a select * from 1 table.
Now are indexing 35m rows using a Select * from ONE_BIG_External-TABLE now and it's taking ~4-5 hrs # 2.3k docs/s +-250. Since we are using an external table we shouldn't be getting the snap shot too old because of the UNDO stack.
We have 77 columns we are indexing.
So we found a solution for our initial issue but now I'm looking to increase our indexing speed when doing clean fulls.
Referencing SolrPerformanceFactors I have tried:
Batch Sizes:
2000 - no change
6000 - no change
4000 - no change
Example:
<dataSource jndiName="xxxxxx batchSize="2000" type="JdbcDataSource"/>
Autocommit:
Every 1 hour - no change
MergeFactor:
20 vs 10 default - shed off 20 mins
Indexed Fields:
Cut out 11 indexed fields - nothing
EDIT: Adding some information per questions below. I did auto-commits to every hour which didn't help any. Also soft commit every second. I copied a much smaller solr core we have here that had these parameters and they said they have been running well.
<autoCommit>
<maxTime>3600000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
Is there any gotchas that I'm missing other than throwing hardware at this?
Let me know if you need more info, I'll try answer questions as best as I'm allowed.
I'm running 5 nodes in one DC of Cassandra 3.10.
As I'm trying to maintain those nodes I'm running on daily basis on every node
nodetool repair -pr
and weekly
nodetool repair -full
This is only table I have difficulties:
Table: user_tmp
SSTable count: 4
Space used (live): 366.71 MiB
Space used (total): 366.71 MiB
Space used by snapshots (total): 216.87 MiB
Off heap memory used (total): 5.28 MiB
SSTable Compression Ratio: 0.4690289976332873
Number of keys (estimate): 1968368
Memtable cell count: 2353
Memtable data size: 84.98 KiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 1108
Local read count: 62938927
Local read latency: 0.324 ms
Local write count: 62938945
Local write latency: 0.018 ms
Pending flushes: 0
Percent repaired: 76.94
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 4.51 MiB
Bloom filter off heap memory used: 4.51 MiB
Index summary off heap memory used: 717.62 KiB
Compression metadata off heap memory used: 76.96 KiB
Compacted partition minimum bytes: 51
Compacted partition maximum bytes: 654949
Compacted partition mean bytes: 194
Average live cells per slice (last five minutes): 2.503074492537404
Maximum live cells per slice (last five minutes): 179
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 19 bytes
Percent repaired is never above 80% on this table on this and one more node but on others is above 85%. RF is 3, and strategy is SizeTieredCompactionStrategy
gc_grace_period is on 10days and as I somewhere in that period I'm getting writetimeout on exactly this table but after consumer which got this timeout is immediately replaced with another one everything keep going like nothing happened. Its like one time writetimeout.
My question is: Are you maybe have suggestion for better repair strategy because I'm kind of a noob and every suggest is a big win for me + any other for this table?
Maybe repair -inc instead of repair -pr
The nodetool repair command in Casandra 3.10 defaults to running incremental repair. There have been some major issues with incremental repair and it's currently not recommended by the community to run incremental repair. Please see this article for some great insight into repair and the issues with incremental repair: http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
I would recommend, as does many others, to run:
nodetool repair -full -pr
Please be aware that you need to run repair on every node in your cluster. This means that if you run repair on one node per day you can have a max of 7 nodes (since with default gc_grace you should aim to finish repair within 7 days). And you also have to rely on that nothing goes wrong when doing repair since you would have to restart any failing jobs.
This is why tools like Reaper exist. It solves these issues with ease, it automates repair and makes life simpler. Reaper runs scheduled repairs and provides a web interface to make administration easier. I would highly recommend using reaper for routine maintance and nodetool repair for unplanned activities.
Edit: Link http://cassandra-reaper.io/
I set up a Solr server for indexing my trending data (4 million records).
When I tested it, it was very fast (0.3 seconds). Then I found that the result looks different to the result I query in MySQL.
In MySQL it gets about 4000 records that belong to a set of user IDs. Then I sort and group them using JavaScript.
Solr return about 4000 records with the same set of IDs. After I sort and group them in JavaScript I found some records are missing (about 10). I then added 'rows': 5000000 to the Solr query and I got the same result as MySQL returns but the time increased from 0.3 seconds to 1.7 seconds.
So I wonder is adding 'rows': 5000000 the only way to make the Solr to return the data I need? If yes, am I doing it right? If not, what else should I use?
My program fetches ~100 entries in a loop. All entries are fetched using get_by_key_name(). Appstats show that some get_by_key_name() requests are taking as much as 750ms! (other big values are 355ms, 260ms, 230ms). Average for other fetches ranges from 30ms to 100ms. These times are in real_time and hence contribute towards 'ms' and not 'cpu_ms'.
Due to the above, total time taken to return the webpage is very high ms=5754, where cpu_ms=1472. (above times are seen repeatedly for back to back requests.)
Environment: Python 2.7, webapp2, jinja2, High Replication, No other concurrent requests to the server, Frontend Instance Class is F1, No memcache set yet, max idle instances is automatic, min pending latency is automatic, using db (NOT NDB).
Any help will be greatly appreciated as I based whole database design on fetching entries from the datastore using only get_by_key_name()!!
Update:
I tried profiling using time.clock() before and immediately after every get_by_key_name() method call. The difference I get from time.clock() for every single call is 10ms! (Just want to clarify that the get_by_key_name() is called on different Kinds).
According to time.clock() the total execution time (in wall-clock time) is 660ms. But the real-time is 5754 (=ms), and cpu_ms is 1472 per GAE logs.
Summary of Questions:
*[Update: This was addressed by passing list of keys] Why get_by_key_name() is taking that long?*
Why ms of 5754 is so much more than cpu_ms of 1472. Is task execution in halted/waiting-state for 75% (1-1472/5754) of the time due to which real-time (wall clock) time taken is so long as far as end user is concerned?
If the above is true, then why time.clock() shows that only 660ms (wall-clock time) elapsed between start of the first get_by_key_name() request and the last (~100th) get_by_key_name() request; although GAE shows this time as 5754ms?