Perticular Solr core fills up all hard disk space - solr

I have a SOLR instance running with a couple of cores, each one of them having between 15 to 25 million documents. Normally the size of each core index (on the disk) is around 30-50 GB each, but there is one particular core index that keeps increasing until hard disk space is full (raising up to 200 GB and more).
When I look at other indexes all files are from the current day, but this one core keeps files also from 4-5 days ago (I guess the data is always duplicated on every import).
What could be causing such behavior and what should I look for when debugging it? Thanks.

Related

Writing to DocumentDB timed out because of large dataset

I have a large dataset comprised of over 1000 .csv files that are very large in size (approximately 700 mb each) and I had scheduled it for upload to DocumentDB with AWS Glue, and after 48 hours it timed out. I know some of the data had made the upload but some was left out because of the time out.
I haven’t tried anything because I am not sure where to go from here. I only want one copy of the data in the DocuDB and if I reupload I will likely get 1.5 times the amount of data. I also know that the connection was not at fault because I had seen the DocuDB CPU spike, and checked to see if data was moving there.

why does CouchDBs _dbs.couch keep growing when purging/compacting DBs?

The setup:
A CouchDB 2.0 running in Docker on a Raspberry PI 3
A node-application that uses pouchdb, also in Docker on the same PI 3
The scenario:
At any given moment, the CouchDB has at max 4 Databases with a total of about 60 documents
the node application purges (using pouchdbs destroy) and recreates these databases periodically (some of them every two seconds, others every 15 minutes)
The databases are always recreated with the newest entries
The reason for purging the databases, instead of deleting their documents is, that i'd otherwise have a huge amount of deleted documents, and my web-client can't handle syncing all these deleted documents
The problem:
The file var/lib/couchdb/_dbs.couch always keeps growing, it never shrinks. Last time i left it alone for three weeks, and it grew to 37 GB. Fauxten showed, that the CouchDB only contains these up to 60 Documents, but this file still keeps growing, until it fills all the space available
What i tried:
running everything on an x86 machine (osx)
running couchdb without docker (because of this info)
using couchdb 2.1
running compaction manually (which didn't do anything)
googling for about 3 days now
Whatever i do, i always get the same result: the _dbs.couch keeps growing. I also wasn't really able to find out, what that files purpose is. googling that specific filename only yields two pages of search-results, none of which are specific.
The only thing i can currently do, is manually delete this file from time to time, and restart the docker-container, which does delete all my databases, but that is not a problem as the node-application recreates them soon after.
The _dbs database is a meta-database. It records the locations of all the shards of your clustered databases, but since it's a couchdb database too (though not a sharded one) it also needs compacting from time to time.
try;
curl localhost:5986/_dbs/_compact -XPOST -Hcontent-type:application/json
You can enable the compaction daemon to do this for you, and we enable it by default in the recent 2.1.0 release.
add this to the end of your local.ini file and restart couchdb;
[compactions]
_default = [{db_fragmentation, "70%"}, {view_fragmentation, "60%"}]

Solr schema overhead with lots of cores

I have a Solr server which index billions of logs per day (50k/sec).
For performance reasons I prefer to index every 60M documents (around 20 minutes) in a different core since afterwards the insert rate goes down.
This means I'll have 72 cores per day.
Most searches will be only on the last day, but on some occasions I'll need to search 7-30 days back.
All cores share the exact same schema / settings.
Does Solr knows to hold the schema / settings once for all cores or it will need to create it for each core it loads for query (I'm planning using transient cores and holding only 30 up in a given moment).
I'm specifically concerned about the overhead of loading the schema/settings for tens / hundreds of cores for performing queries.

Shared memcache vs. dedicated memcache & quota calculation

I have an app that stores ~20000 entries in the memcache. Each entry is a Serializable with one String and two integers. I set the expiration time to 6 hours.
So I was using the shared/free memcache. It only seemed to store ~5000 entries -> ~7mb. The oldest entry is always just some minutes old. Why is that?
Then I thought: let's switch to dedicated memcache. Then the cache runs fine, it stores all entries, oldest entry is 6 hours old, everything is as expected. Except for the quota. After just some hours it says that I already used 18 "Gbyte hours".
Well my total cache size is ~11mb. So I would guess that the cost would be ($0.12 / Gbyte / hr) -> $0.12*~0.01Gb*24hrs per day, which would be just ~$0.03.
What am I doing wrong? Is my calculation wrong? Do I misunderstand the meaning of "Gbyte hours"?
AppEngine dedicated memcache is priced per 1GB chunks, not based on what you use in your dedicated allocated 1GB of storage. See here https://developers.google.com/appengine/docs/adminconsole/memcache
Your dedicated memcache will cost you a flat $2.88 per day.
The "dedicated" gives you space but operations per seconds (10,000 ops per second).
Regarding your experience on the shared (and free) memcache, what you are seeing here is not what you should generally expect. You are unfortunately most likely on an AppEngine cluster where some other apps are abusing the shared memcache.
The dedicated memcache bills by the gigabyte. So anything less than 1GB is billed as 1GB, and you probably had it running for 18hrs. Yeah, kinda sucks.

Performance of Solr Shared is worse than Solr Unsharded

I have been tasked with building a test index of about 100million small records using Solr. I have been running this on my laptop since some time yesterday, incrementing it by 10million records at a time and running queries at the major "milestones" (10m, 20m... etc). I have reached about 70million records, and all is going well... The laptop specs are as follows:
Quad Core i7
8Gb RAM
Windows 7
Tomcat 7 + latest version of Solr.
As a test, I decided to see what happens when i run a simular workload on my home workstation (Dual Proc, Quad Core Xeon, 12Gb RAM, 2x10K RPM Disks in RAID 0 for index, Windows 2008 R2, same software). Only difference is now i am using Multi Cores... using the same schema and conf directory from the laptop, modified the solr.xml...
Anyway, on the laptop, at about 70 million records, i am getting results of less than 500ms. thats 150 queries, 100 of which are one word, 50 are 2 word queries. only one field is queried (name field). all good... On my workstation, using multi cores and the following querystring, i am getting times in excess of 4-5 seconds!
http://localhost:8080/solr/core0/select?shards=localhost:8080/solr/core0,localhost:8080/solr/core1,localhost:8080/solr/core2,localhost:8080/solr/core3&q=Name:Test Name
This is a new Index i have generated: i am doing a loop, from 0 to 100,000,000 and every time i hit i % 10000 == 0 i add the documents to a solr core. Each time I hit that loop, I incrememt a commitID, and when commitID %4 == 0, go to core0, when 1 go to core1, etc...
I am pretty sure its a config issue somewhere... but i just want to make sure... Should I be expecting this to be a lot faster? both processors (Laptop and Workstation) are in around the 2.2Gz range. Both are new enough architectures (Nehalem on the Workstation, i7 from 2010 on the laptop). So, any ideas what i should be looking at?

Resources