Is there any limit for Post Length for SOLR? - solr

im using SOLR but for some reasons im missing some of my docs.
I want to know is there any limitation for HttpPost?
or Solr Docs per each commit?
thank you

The limit of post lies actually more on the server side than on the solr side. If you put solr into tomcat, the it's the limit of tomcat's post which usually interest you, if you put your tomcat behind apache or nginx, then their max post size will also interest you.
as for post itself, http spec doesn't have any limit. Most of the Time it's the server which limits it.

Related

Solr reindex is stopping prematurely when running Collective Solr for Plone

My team is working on a search application for our websites. We are using Collective Solr in Plone to index our intranet and documentation sites. We recently set up shared blob storage on our test instance of the intranet site because Solr was not indexing our PDF files. This appears to be working, however, each time I run the reindexing script (##solr-maintenance/reindex) it stops after about an hour and a half. I know that it is not indexing our entire site as there are numerous pages, files, etc. missing when I run a query in the Solr dashboard.
The warning below is the last thing I see in the Solr log before the script stops. I am very new to Solr so I'm not sure what it indicates. When I run the same script on our documentation site, it completes without error.
2017-04-14 18:05:37.259 WARN (qtp1989972246-970) [ ] o.a.s.h.a.LukeRequestHandler Error getting file length for [segments_284]
java.nio.file.NoSuchFileException: /var/solr/data/uvahealthPlone/data/index/segments_284
I'm hoping someone out there might have more experience with Collective Solr for Plone and could recommend some good resources for debugging this issue. I've done a lot of searching lately but haven't found much useful info.
This was a bug fixed some time ago with https://github.com/collective/collective.solr/pull/122

What is the benefit of pyramid_celery if I am using a standard celeryconfig?

So I have a pyramid app which stores data in zodb (Substanced) and also creates a solr index for a speedy search of that data. Some of the solr indexing takes a while so I am wanting to make the solr indexing asynchronous. I am going to use rabbitmq and celery.
Do I benefit from using pyramid_celery? I don't want to use the ini file to store the celery config and there are no scheduled tasks so no celery beats. This is small scale and all of the processes/tasks will run on one machine.
Thanks
OK, so I am answering my own question. I asked this on the pylons google group and the response from the author of pyramid_celery was
Absolutely nothing. pyramid_celery is specifically for sharing your ini configuration / app configuration with your celery workers. If you don't have a need to share those things you have no need for pyramid_celery :)
I will also look at Mikko's option.

Solr 5 - Restrict access to specific IP

Due to security reasons, I wish to restrict access to Solr server to particular IP's only. Since Solr 5 runs as a standalone server, could someone please let me know how can I do this restriction.
I did try searching for a solution, but all I could come up with were solutions related to Solr 4 and previous versions in which Solr war was deployed in containers and not standalone.
Thanks in advance !!!

Parse data with tika for apache solr

I have managed to get apache nutch to index a news website and pass the results off to Apache solr.
Using this tutorial
https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup the only difference is I have decided to use Cassandra instead.
As a test I am trying to crawl Cnn, to extract out the title of article's and the date it was published.
Question 1:
How to parse data from the webpage, to extract the date and the title.
I have found this article for a plugin. It seems a bit out dated and am not sure that it still applies. I have also read that Tika can be used as well but again most tutorials are quite old.
http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/
Another SO article is this
How to extend Nutch for article crawling. I would prefer to use Nutch, only because that is what I have started with. I have do not really have a preference.
Anything would be a great help.
Norconex HTTP Collector will store with your document all possible metadata it could find, without restriction. That ranges from the HTTP Header values obtained when downloading a page, to all the tags in that HTML page.
That may likely be too much fields for you. If so, you can reject those you do not want, or instead, be explicit about the ones you want to keep by adding a "KeepOnlyTagger" to your <importer> section in your configuration:
<tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
fields="title,pubdate,anotherone,etc"/>
You'll find how to get started quickly along with configuration options here: http://www.norconex.com/product/collector-http/configuration.html

Nutch querying on the fly

I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :)
I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible.
I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. But I did all the steps mentioned in the link. I think somewhere in the process there should be a crawling happening and which is missed.
Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience.
Cheers,
Abi
This is not possible.
What you could do though is chunk the crawl cycle in a smaller number of URL's such that it will publish result more often whith this command
nutch generate crawl/crawldb crawl/segments -topN <the limit>
If you are using the onestop command crawl it should be the same.
I typically have a 24hours chunking scheme.

Resources