1) I referred https://github.com/evolvingweb/ajax-solr/wiki/reuters-tutorial for Ajax-Solr setup.
I want to know that although ajax-solr is running but it's searching under only reuters data. If I want to crawl the web using nutch and integrate it with solr,then i have to replace solr's schema.xml file with nutch's schema.xml file which will not be according to ajax-solr configuration. By replacing the schema.xml files, ajax-solr wont work(correct me if I am wrong)!!!
How would I now integrate Solr with Nutch along with Ajax-Solr so ajax-Solr can search other data on the web as well??
2) I would like to ask whether there are any front end API for Solr searching,except Ajax-Solr, which would help in efficient searching of the crawled web?
Look at Solr with multiple cores, it's better not to try mix documents with different nature in one collection
There are many APIs for SOLR, such as SOLRJ for Java (http://wiki.apache.org/solr/Solrj), SolPHP for PHP (http://wiki.apache.org/solr/SolPHP) and so on.
Related
I am going to use apache solr for first time. I just want to write a basic code for apache solr in java with eclipse. Can you please suggest me that how to get familiar with apache solr.
Basically, SOLR is a full text search engine. You can use it as a data store to keep data of unstructured nature. You can use SolrJ to communicate to SOLR from Java.
my objective is to
- crawl on urls and
- extract micro data and
- save to solr
I used this guide to setup nutch, hbase, and solr
Im using nutch to crawl on urls and hbase, im using tika pluggin for nutch to parse pages, but it only gets meta data.
Did I miss something to config? please guide me or suggest alternatives
You need to implement your own ParseFilter and implement the extraction logic there. You will get a DocumentFragment generated by the Tika parser and could use e.g. XPath to get the micro data.
Note that the DOM generated by Tika are heavily normalised / modified so your Xpath expressions could possibly not match. Maybe better to rely on the old HTML parser instead.
One generic way of doing would be to use Apache Any23 as done for instance in this storm-crawler module.
BTW There is an open JIRA for a MicroDataHandler in Tika which hasn't been committed yet.
HTH
I am trying to implement solr into sitecore but could not find any way for creating a Solr instance for the same. I have few PDFs from SDN I could find any way to create Solr instance in any. Considering that I am new to CMS I hope I could get some help here. Thank you
There are lots of resources available for setting up Solr, and integrating Sitecore.
Essentially Sitecore is ignorant with respects to how you setup Solr (barring a few exceptions), so you need to follow standard methods to set Solr up. If you are doing this on your local machine, then I recommend you simply download Solr and get it running through the provided Jetty App Server.
Once Solr is running, download the Solr Extensions from SDN, then follow the search scaling guide to integrate Solr. This really only boils down to the following;
Remove Lucene config files
Add Solr config files and binaries
Add Solr endpoint into relevant config
Generate Solr Schema via Sitecore -> Control Panel -> Search (within Sitecore)
Add Schema file to Solr Core configuration
et voila
There is a great guide here: http://www.dansolovay.com/2013/05/setting-up-solr-with-sitecore-7.html
Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)?
I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?
Set up the Tika plugin with Nutch, Nutch will parse the data for you and will do all the hard work for you.
I would suggest setting it up on Solr as well, you may wish to send documents to Solr via the curl command and it would help to have it set up on Solr too. It comes with little extra configuration and no performance costs:
There is a guide to setting up Tika & extracting request handler here
Apply tika parser in Nutch's parsing phase.
I have a large number of documents (mainly PDFs) that I want to index and query on.
I want to store all these docs in a filesystem structure by year.
I currently have this setup in Solr. But i have to run scripts to extract meta from the PDFs, then update the index.
Is there a product out there that basically lets me pop a new PDF into a folder and its auto indexed by Solr.
I have seen Alfresco does this, but its got some drawbacks - is there anything else along these lines.
Or would I use nutch to crawl my filesystem and post updates to Solr? Im not sure about how I should do this?
Solr is a search server not a crawler. As you noted, Nutch can do this (I have used it for a similar usecase, indexing a knowledgebase dump).
Essentially, you would host a webserver with the root of the folder structure as Document root. Then allow directory listing at this webserver. Nutch could then crawl the top level url of this document dump.
Once you have this Nutch created index, you can then expose it through solr as well.