Find number of already exist documents in solr with solrindexing job in nutch - solr

In nutch, In solrindex job how we can calculate the number of documents which have been updated in solr and the number of documents which have been indexed as new documents.

You can use this to see stats and status (fetched, not_modified, gone...)
bin/nutch readdb crawl/crawldb/ -stats
Or else you can dump crawldb to see all urls that have been crawled with their status
bin/nutch readdb crawl/crawldb/ -dump whole_db
vi whole_db/part-r-00000

Related

It is possible to restore document in solr usinf tlog file

Every time when i add/update document in solr , tlog file maintains the request query for each commit.
example:
when i commit using update command:
curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/sample_list/update' --data-binary '{"add":{"doc":{ "id":"7","name":"test dfasdata jan565765"}},"commit":{}}'
tlog file content look likes:
^B
^B)SOLR_TLOGA'strings<83>"id$name)version^#^#^#)<83>A^G^VYîÖ·^#^#^#^P^C^H?<80>^#^#á!7â7test dfasdata jan565765ã^G^VYîÖ·^#^#^#^#^#^#8<83>D`-SOLR_TLOG_END^#^#^#^Q
It is possible to execute the request query in tlog file for recovery purpose which is not in readable format?
I got the same problem and I found a solution / workaround:
open your tlog file
remove at the end of the file "ƒD`-SOLR_TLOG_END "
start solr
the content of the tlog file is indexed/committed to the index.
In my testcase I only had one tlog file in the folder.
To find out the chars which you should delete, just add in the admin webinterface under "Documents" a example document with "commit within" 100000 value afer that stop immediately solr.
It is important that you stop solr before solr does the commit (commit within time should be high).
Then copy the new tlog file on your desktop or an other folder.
Start solr again and compare the two files, then you will see after the tlog is added to the index solr appends the "ƒD`-SOLR_TLOG_END " string.
In short:
tlog is used to add documents to the index if solr goes down before committing; if solr gets up again the tlog is used to add the documents not yet commited to the index.

Solr field UUID is not unique

I'm running Solr 5.1.0 on our Plone 4.2.6 system. In my schema.xml I set UID to be the uniqueKey. On the system there are approx 82.000 documents. After building an Index, I found the following amount of indexed docs:
numDocs: 54537
maxDocs: 82561
deletedDocs: 28024
Furthermore under 'core-specific-tools > Schema Browser > Load Term Info > Histogram' I found that for the field UID there are 26,514 values of UID contained in only one document but also 28,022 values contained in 2 documents. Therefore that amount of indexed documents were tagged as deletable.
Can anyone tell me what reason there could be for that many UIDs to be the same although they should be distinct?

QueryingSolr : getRequestHandler returns result but not selectHandler

I have id field in solr that uniquely identifies a solr document
When querying solr using getHandler :
solr/{collection}/get?id=p_1266762970&fl=*
Result:
"doc":
{
"lastIndexed":"2014-12-25T09:48:56.509Z",
"id":"1266762970",
"solrId":"p_1266762970",
.....
}
But when querying using solr admin - selectHandler - no documents are returned.
Solr query looks like:
solr/{collection}/select?q=:&fq=solrId:p_1266762970
solr/{collection}/select?q=:&fq=id:1266762970
I tried doing a hard commit and it returned successfully but still the same results
I have other documents in solr as well that shows up correct results.This issue exist for some of the ids (8 out of 2.3 million) only.
Updated: UniqueKey is
<uniqueKey>solrId</uniqueKey>

How to delete a doc at a specific shard in Solr

I want to delete a specific doc at a specific shard in Solr, below is my query:
http://localhost:8080/solr/collections_1_replica1/update?stream.body=<delete><query>id:1</query></delete>&commit=true&distrib=false
But this still effect to collections_2_replica1, so what is the correct query in this case.
If you use the default Solr Cloud collection configurations, Solr choose where to put the document according to the document id (docId.hash() % number of shards).
In other words, you're not supposed to delete from a specific shard because you can't be sure whether the document is there or on the other shards.
If I'm not wrong, the distrib=false parameter is not effective in updated.
Tyr out this one :
curl http://xx.xx.xx.xx:8983/solr/collection_name/update/?commit=true -H "Content-Type: text/xml" --data-binary '<delete><query>id:1</query></delete>'

Speeding up SOLR search

The SOLR search response is extremely slow using SOLR Apache Lucene 3.6.
Some performance enhancement techniques I'm experimenting with are
SOLR Pagination
mergeFactor currently set to 10 in solrConfig.xml
SOLR Facet queries
filterCache in solrconfig.xml set to size 512 and using
solr.FastLRUCache and autowarm = 0;
queryResultCache in solfconfig.xml set to size 512 with
autowarmCount=0
newSearcher, firstSearcher, and useColdSearcher
single segment index for 100,000 documents
single machine SOLR server for 100,000 documents
How can I optimize items 1-7 to increase SOLR search response for a term/query?
Are there any other optimization parameters to consider not mentioned above?
You can also check below :-
SolrPerformanceFactors
ImproveSearchingSpeed
ImproveIndexingSpeed
SolrCaching
The Seven Deadly Sins of Solr

Resources