How can I manipulate the multicore in solr dynamically - solr

I am new to apache solr.
I want to manipulate the multicore dynamically using CoreAdminHandler class of org.apache.solr.handler.admin.CoreAdminHandler;
There are no tutorials on how to use it nor any good example I could google out.
Please give me some example of how can I manipulate multicore which are deployed in tomcat(not embedded) using CoreAdminHandler and solrj.
How can I specify the path of my tomcat server where solr is deployed for CoreAdminHandler/coreContainer. And how to specify the path where multicores are placed?

Here is an example you can use to get the list of the available cores through a status request:
CoreAdminRequest adminRequest = new CoreAdminRequest();
adminRequest.setAction(CoreAdminAction.STATUS);
CoreAdminResponse adminResponse = adminRequest.process(new CommonsHttpSolrServer(solrUrl));
NamedList<NamedList<Object>> coreStatus = adminResponse.getCoreStatus();
The following are the available CoreAdmin actions you can use:
STATUS,
LOAD,
UNLOAD,
RELOAD,
CREATE,
PERSIST,
SWAP,
RENAME,
#Deprecated
ALIAS,
MERGEINDEXES;
The code you can use is pretty much the same, you just have to pick the right action and correctly read the results within the returned NamedList object. Let me know if you have some more specific question.

Related

Solr "core" permissions

Using Solr Basic auth plugin
How do you set permissions based on the "core"? Only want certain users to be able to access a certain core.
I thought that was the "collection" value, but that isn't it.
From my own experience trying to solve this exact issue a year ago - I'm unable to find the reference where I read this, but:
You can't limit access to a specific core through security.json - if you need to limit which users can access which sets of data, you'll have to use SolrCloud and the collections parameter.
You might be able to limit access by having a reverse proxy in front, but be aware that certain GET parameters could be used to switch the request handler back in the "old days" - but that might have been removed now (and not sure if you could switch cores in the same way).
You also have to use the path parameter to limit operations (i.e. to only allow a user to access /select or /replication etc.).

How to programmatically create indexing from the list of urls in java

I want to create search engine. So I had used nutch and solr for the developing it.
But it is not able to crawl each and every url of the website and search results are not as
good as Google.So I started using jcrawler to get list of url.
Now I have list of urls.But I have to index them.
So is there any way where I can index list of urls stored line by line in a file.
and show results vis lucene or solr or any other Java API
How you programmatically do something really depends on which language you plan on writing your code in - fetching content from a URL and making sense of that content before indexing will be largely dependent on the libraries available for your programming language of choice.
You can still use nutch with the Solr backend - give it the list of urls as input and set --depth to 1 (so that it doesn't spider anything further).
There are also other "ready" options, such as Crawl Anywhere (which has a Solr backend) and Scrapy.
"Not as good as Google" is not a good description of what you want to accomplish and how to approach that (keep in mind that Search is a core product for Google and they have a very, very large set of custom technologies for handling search). If you have specific issues with your own data and how to display that (usually you can do more useful results as you have domain knowledge of the task you're trying to solve), ask concrete, specific questions.
You can use Data Import Handler to load the list of URLs from file and then read and index them.
You'd need to use nested entity with outside entity having rootEntity flag set to false.
You'd need to practice a little bit with DIH. So, I recommend that you first learn how to import just the URLs into individual Solr documents and then enhance it with actually parsing of URL content.

Index content of PDFs with Solr and Tika

The problem briefly: I would like Sitecore to index the contents of PDFs using Solr's built in functionality (supplied by Tika). I'm not sure how to configure Sitecore's indexing to use this feature in Solr(Tika). (I think I need to write a custom indexer.)
I'm working with Sitecore 7 (7.1 Update 1) and want to index content from PDFs (or other rich media types). I'd like to index this data for search purposes.
I have Solr (4.6.1) installed and working with Sitecore 7. When I index my site it saves all of the documents to the correct Solr core, and I can successfully retrieve these documents for display.
Using curl, I can send a PDF to my Solr instance and get it indexed.
curl "http://localhost:8983/solr/update/extract?literal._id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=#sample.pdf"
This works, and I can read this content in my Sitecore web project and display it in views, so I know I can get access to this data. However, I would like the data to be attached to the items that I have uploaded in Sitecore.
I'd like something like this to happen when I upload a PDF to the Sitecore Media Library and publish the item, or at least when I re-index the site.
I'm currently walking through the following tutorial to learn some things about writing custom indexing (here is a link to part 1):
http://www.sitecore.net/Community/Technical-Blogs/Getting-to-Know-Sitecore/Posts/2013/04/Sitecore-7-Search-Provider-Part-1-Manually-Triggered-Indexing.aspx
Thanks for you patience.
For Sitecore, when handling media data, Lucene and Solr needed to index the content in a consistent way (so that you could switch between them if needed and still get data indexed in the same way). As Tika integration is very much a Solr thing it was decided that both should use the general windows concept of IFilters for indexing (http://en.wikipedia.org/wiki/IFilter)
This means that as long as you have the correct IFilter for that mime type installed on the machine that is doing the indexing, then the '_content' computedfield will be populated with the output.
This doesn't mean you can't use the Solr Tika integration but it is not supported by default and would be a customization.
It would be very simple to:
Disable the '_content' computed field
Set up a publish pipeline processor that looks at each item being published
Check if it is a media item
Check to see if it is a PDF
Issues a command to push the content to your Solr server for indexing by Tika.
You may want to see what results you get by using an IFilter, if the results are close enough to what you want then you can go with that, if Tika is producing better results for you then you should be able to switch to that, although you would probably have your media content indexed in a separate Solr core so you would lose any Sitecore specific metadata around the document.
Some blog posts that might be helpful:
http://www.samjgriffin.com/blog/2013/11/06/sitecore-7-pdf-and-document-content-search/
http://www.sitecore.net/Community/Technical-Blogs/John-West-Sitecore-Blog/Posts/2013/04/Sitecore-7-Indexing-Media-with-IFilters.aspx
I second the recommendation to use Sitecore's built-in MediaItemContentExtractor + IFilter approach, unless you've already ruled that out for some reason (IFilter difficulties, perhaps). If an IFilter is not an option, or if you're interested in the other approach regardless, I would integrate Tika a little differently than Stephen suggested, though.
The tutorial you referenced deals with writing your own search provider - i.e. replacing the built-in provider entirely. You should be able to leverage Sitecore's Solr provider and accomplish what you need with something lighter: a computed index field. The built-in media extractor mentioned above uses this approach, which enables you to put just about anything into your index during the normal indexing process. Here is a blog post from John West that walks through creating a basic computed index field: Sitecore 7: Computed Index Fields.
In short, write a class that implements IComputedIndexField and represents content extracted from PDFs or other rich documents. In your implementation of the ComputeFieldValue method:
Call GetMediaStream() on the document.
Pass the stream to Solr in an extract-only command and capture the result.
Return that result to store it in the computed index field.
If you need the media content to be in a particular existing index field, then configure the computed field as a copy field (see 3.3.3) into the existing field. Otherwise, configure your search to reference the computed field.
The main drawback here is the expense of passing the extracted content back and forth, rather than committing it directly to the index in a single step. Depending on your index size and contents, that may not be an issue for you.
One other potential option would be a post-rebuild task to add media content to the existing indexed documents. I am not certain this would work. It depends on knowing the IDs of the media items' documents and committing the rich document content in partial document updates, which this person was unsuccessful in attempting. If you try this, be sure to execute it in the indexing:end event prior to HTML cache clearing.
Whatever approach you take, if you want to work with Tika on a higher level than cURL, have a look at SolrNet's implementation of ExtractCommand and related classes.
If you could upgrade your site into sitecore 7.2 in which the media items content will be indexed automatically and there is no need to install the related IFilter, You should read the following:
Media Content Indexing in sitecore 7.2

Easy GUI way to add data to a Solr index

I'm working on the front end of an app that uses Solr for data storage. Currently I have an empty index, but it'd (understandably) be a lot easier for me if some dummy data was returned so I could make sure that it's output correctly on the front end.
If I was working with and RDBMS (let's say postgres) I'd open up a GUI (e.g. pgadmin) and type data manually into a few rows to achieve this goal. I have access to the Solr web interface, but I can't see any obvious call to action saying INSERT YOUR DATA HERE. The closest thing I can find to an answer on the web is this SO thread, but it's still not quite the droids easy GUI-based solution I'm looking for.
So, my question is: Is the a way to quickly and easily insert some data equivalent to the RDBMS method mentioned above?
Make sure you have defined a schema in schema.xml.
SOLR does indeed have a (limited) html GUI, which on a local installation is probably found at localhost:8983/solr (default). If you can get to the base admin page, then on the left there is a small combobox where you can select a core/collection. If you click on THAT, then you get a list of options that emerges, and you can pick 'documents' to get a similar GUI to what I think you expect from postgres/RDBMS/whatnot.
http://localhost:8983/solr/#/collection1/documents is the URL on a default SOLR installation that I have. This should work as long as you don't have default cores. (Replace collection1 with your collection name and localhost:8983 with wherever your solr is hosted/the port).

Point to different data dirs in solrj?

I am using solrj for a distributed search. Based on certain user preferences, I will have to search through a particular set of indexes. Is there any way through which I can programatically(duh!) specify the datadir for that particular search query?
I did snoop through the documentation, but couldn't find anyway other than having the datasets in different cores, is there a better way?
Edit 1 : All the set of indexes have the same schema format.
You can change the dataDir almost dynamically using the CREATE CoreAdmin action, and specifying a dataDir parameter.
Creates a new core based on preexisting
instanceDir/solrconfig.xml/schema.xml, and registers it. If
persistence is enabled (persist=true), the configuration for this new
core will be saved in 'solr.xml'. If a core with the same name exists,
while the "new" created core is initalizing, the "old" one will
continue to accept requests. Once it has finished, all new request
will go to the "new" core, and the "old" core will be unloaded.
Anyway you'll need to load a new core, it isn't possible changing the data directory without doing it.
You can use solrj to invoke CoreAdmin action using the CoreAdminRequest and CoreAdminResponse classes.

Resources