Solr URL with '#' causing issue in multicore - solr

Why does Solr url has a '#' as part of its url? There were quite a few posts about the same question in the past eg. http://lucene.472066.n3.nabble.com/Curious-why-Solr-Jetty-URL-has-a-sign-td4069434.html but with no proper workaround.
I never had any problem when I was just using a single core but when I made my Solr as a multicore that is when I have issues with Solr url having '#' (pound sign).
For eg.,
solr url - http://localhost:8983/solr/
(when the above solr admin url loads in a browser, it changes to this - http://localhost:8983/solr/#/)
When I click on individual collections to get their url, this is what I get as seen below -
solr url for collection1 (core 1)- http://localhost:8983/solr/#/collection1
solr url for collection2 (core 2)- http://localhost:8983/solr/#/collection2
I have two different applications which should query their own particular solr collection, which means I have to provide their collection specific solr url. When I added this url http://localhost:8983/solr/#/collection1, the application that should utilize solr collection 'collection1' is unable to connect to solr. It is returning 'Problems were found while connecting to the SOLR server HTTP code=404 Not Found'. Same is the case with other application using Solr 'collection2'
Please tell me how I can get rid of '#' from the solr url or any possible fix for the above issue

The # is the url generated by the admin dashboard. For actually interacting with a collection, the url format is unchanged, just remove the # --
localhost:8983/solr/collection1
or /select or /update or whatever.

Related

Typo3 Solr -failed to execute page indexer request

I have a question related to the Typo3 v10 Solr extension. For some reasons some pages can not be indexed properly. In the logs I see these errors:
Failed to execute Page Indexer Request. See log for details. Request ID:
In the logs it looks like some requests contain empty raw body. Rest pages with same type have been indexed properly. Maybe someone has encountered similar issues with solr indexer? What cause the issue in this case?

SOLR index status

Can someone help me with URL to SOLR to get the status of a specific index.
I know with ElasticSearch it is easy:
http://domain-name:9200/_cluster/stats?<index-name>
What is the equivalent in SOLR ?
Thanks in advance
There are a few URLs that provide information about the current state of the Solr server:
http://localhost:8080/solr/admin/cores?wt=json
http://localhost:8080/solr/<corename>/admin/luke?wt=json&show=index&numTerms=0
http://localhost:8080/solr/<corename>/admin/system?wt=json
http://localhost:8080/solr/<corename>/replication?command=details&wt=json
A good way to discover these URLs are to watch the "Network" tab in your browsers debug tools while browsing the admin page for a Solr server. All the information provided in the UI is fetched from the above (and several other) URLs (you can also see these requests in the logs of your application container).

Search using SOLR is not up to date

I am writing an application in which I present search capabilities based on SOLR 4.
I am facing a strange behaviour: in case of massive indexing, search request doesnt always "sees" new indexed data. It seems like the index reader is not getting refreshed frequently, and only after I manually refresh the core from the Solr Core Admin window - the expected results will return...
I am indexing my data using JsonUpdateRequestHandler.
Is it a matter of configuration? do I need to configure Solr to reopen its index reader more frequently somehow?
Changes to the index are not available until they are commited.
For SolrJ, do
HttpSolrServer server = new HttpSolrServer(host);
server.commit();
For XML either send in <commit/> or add ?commit=true to the URL, e.g. http://localhost:8983/solr/update?commit=true

Finding or configuring Solr home directory

I'm following this tutorial on setting up django-haystack and solr: http://django-haystack.readthedocs.org/en/latest/tutorial.html
I hit a stumbling block here:
If you’re using the Solr backend, you have an extra step. Solr’s
configuration is XML-based, so you’ll need to manually regenerate the
schema. You should run ./manage.py build_solr_schema first, drop the
XML output in your Solr’s schema.xml file and restart your Solr
server.
Where is my schema.xml file located? It says it should in the Solr home directory and the .conf folder. But where is the Solr home directory, and/or how do I configure its location?
The solr home is the place where you can find your schema.xml and solrconfig.xml, as well as some other files depending on the text analysis you're using (dictionaries for stemming, stopwords etc.), and where your index gets created by default.
There are a couple of ways to configure the solr home, since it is located outside of the servlet container:
solr.solr.home java system property (most used one)
java:comp/env/solr/home for JNDI lookup
You can either check your servlet container configuration or go to the Solr admin page http://host:port/solr/admin, which prints out the actual solr home location together with other information about the solr instance running.
First check whether your Solr instance is working.
Got to -> http://localhost:8983/solr
If you can see a Solr web panel you have a live Solr instance.
Now go to Java Properties
Here you will see the the variables. This is where you can find the home DIRs
Note schema is now managed. If you want to override this you will have to hack it a bit. check here

Nutch solrindex command not indexing all URLs in Solr

I have a Nutch index crawled from a specific domain and I am using the solrindex command to push the crawled data to my Solr index. The problem is that it seems that only some of the crawled URLs are actually being indexed in Solr. I had the Nutch crawl output to a text file so I can see the URLs that it crawled, but when I search for some of the crawled URLs in Solr I get no results.
Command I am using to do the Nutch crawl: bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
This command is completing successfully and the output displays URLs that I cannot find in the resulting Solr index.
Command I am using to push the crawled data to Solr: bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
The output for this command says it is also completing successfully, so it does not seem to be an issue with the process terminating prematurely (which is what I initially thought it might be).
One final thing that I am finding strange is that the entire Nutch & Solr config is identical to a setup I used previously on a different server and I had no problems that time. It is literally the same config files copied onto this new server.
TL;DR: I have a set of URLs successfully crawled in Nutch, but when I run the solrindex command only some of them are pushed to Solr. Please help.
UPDATE: I've re-run all these commands and the output still insists it's all working fine. I've looked into any blockers for indexing that I can think of, but still no luck. The URLs being passed to Solr are all active and publicly accessible, so that's not an issue. I'm really banging my head against a wall here so would love some help.
I can only guess what happend from my experiences:
There is a component called url-normalizer (with its configuration url-normalizer.xml) which is truncating some urls (removing URL parameters, SessionIds, ...)
Additionally, Nutch uses a unique constraint, by default each url is only saved once.
So, if the normalizer truncates 2 or more URLs ('foo.jsp?param=value', 'foo.jsp?param=value2', 'foo.jsp?param=value3', ...) to the exactly same one ('foo.jsp'), they get only saved once. So Solr will only see a subset of all your crawled URLs.
cheers

Resources