How to get IndexReader from custom request handler? - solr

This is extension of my earlier question.
I'm going to create custom request handler to provide terms association mining over existing index. In order to do this I need access to Solr's IndexReader opened on default index directory.
The only way to do this I can think of is to get IndexReaderFactory by invoking SolrQueryRequest.getCore().getIndexReaderFactory(). This factory has method newReader() which seems to be what I need. But this method requires index directory as its first argument.
Here's my question: is it correct way to get IndexReader? If so, how can I get Solr's index directory? Can I access Solr configuration to find it from my code or should I go with something else?

I found an answer myself while reading LukeRequestHandler source:
SolrIndexSearcher searcher = req.getSearcher();
IndexReader reader = searcher.getReader();
So they first get searcher, and only then reader.

Related

How to use Camel's Exchange setProperty in content enrich()?

I have a camel route that splits and aggregates according to some ids. When an id is retrieved, a call is made to another endpoint to retrieve the project information according to this id. After retrieving the project information i had to enrich it by calling multiple enrich() methods on it. In the first enrich method i have to do some xpath processing wherein ill be able to retrieve a primaryOrgId value that i will set as a property in the exchange, dont worry about the xpath processing, i had that sorted out but my problem is when I set the property (primaryOrgId) inside the 1st enrich. The property value doesn't get persisted when the route goes to the 2nd enrich part. When I log the primaryOrgId value, the original value of "testValue" (this was set in the direct:createSomeIds route) is the one getting displayed instead of "changeTheValueHere" which was set in the 1st enrich part.
I am using Camel 2.15 based on Fuse 6.2.1.
I went to the camel site and read this part from http://camel.apache.org/content-enricher.html . I'm not sure I understood how to implement... "For that you must set the filename in the
endpoint URI" .. this text was talking about the header, i'm thinking its also applicable to the properties in the exchange.
pollEnrich or enrich does not access any data from the current
Exchange which means when polling it cannot use any of the existing
headers you may have set on the Exchange. For example you cannot set a
filename in the Exchange.FILE_NAME header and use pollEnrich to
consume only that file. For that you must set the filename in the
endpoint URI.
Here is my code:
from("direct:createSomeIds")
.routeId("createSomeIds")
.process(new IdCreatorProcessor()
.setProperty("primaryOrgId").constant("testValue")
.split(xpath("/TempProjects/TempProject/Code/text()").namespaces(ns) , new IdStrategy())
.to("direct:splitRouteById")
.end();
from("direct:splitRouteById")
.routeId("splitRouteById")
.to("direct:getProjectByID")
.to("xquery:template/AllProjectToSingleProject.xq") //xquery template
.convertBodyTo(Project.class)
.enrich("direct:getAdditionalInfo", new ProjectStrategy(ProjectStrategy.AggregatorType.AdditionalInfo))
.enrich("direct:getSecondaryInfo", new ProjectStrategy(ProjectStrategy.AggregatorType.SecondaryInfo))
.end();
from("direct:getAdditionalInfo")
//some xpath stuff here
.setProperty("primaryOrgId").constant("changeTheValueHere")
.end();
from("direct:getSecondaryInfo")
.log("Value of primaryOrgId = " + "${exchangeProperty.primaryOrgId}")
.end();
If you can provide some code example, that would be helpful.
If you read a bit further down you will see that it's recommended that you instead use RecipientList with an AggregationStrategy.
.recipientList("direct:getAdditionalInfo", "direct:getSecondaryInfo")
.aggregationStrategy(new ProjectStrategy())
The setting of filename in your endpoint URI would only be applicable if you were to access some file on an FTP or some other file area.
Edit:
I now see that you need the property from the first enrichment in your second enrichment. However, if you're not modifying the message body in the first enrich then I don't actually see the need for it at all.
If you are in fact modifying the body then you can still use the RecipientList but instead you use two separate ones calling only one endpoint in each.

How can I keep changes in the index when I use DIH fullimport?

I'm using Solr 6.5 to index files from multiples ftp files into multiples cores (having one core for each type of document, like audio file, image, software, video and documents).
The situation is that I'm doing this to populate an app that in its front end has a social networking approach in which every user can add new tags or modify other metadata without restriction.
So when I execute again data import handler to add new files to my application, it erase the index that previosly was modified for the user and set up with the data-config default configuration.
My question: is there a way to tell DIH, if the id exists, continues without importing and just adds the files which don't have an id in the index?
If this is not possible, can I do something similar in a different way?
Thanks for everything!
Sounds like you are doing a full import with default settings. One of them is clean, which defaults to true and deletes the whole index before the import.
Try setting it to false and also look at preImportDeleteQuery and postImportDeleteQuery for even more precision.

Adding raw query parameters via Criteria API

I could not find an answer to this. I found the previous similar question unanswered. I'd like to use Spring data solr for queries. But #Query is insufficient for my needs. As I understood, whatever you give here becomes a q parameter to `select' handler of solr.
In my case I need to add more parameters for example sfield for a spatial search. If #Query wont cut it, I am ready to write a custom repository implementation by autowiring SolrTemplate, But then the Criteria API does not seem to let me add a raw query parameter either.
Any help/points will be greatly appreciated.
I worked around it by creating a QueryParser decorator that adds the required parameters to a parsed solr query. The QueryParser was registered using solrTemplate.registerQueryParser().
Note however that I had to do a really nasty hack to get this working, since all queries that are sent to solrTemplate.queryForPage are wrapped by a static package protected inner class in QueryBase. So my registration code above had to be in a package org.springframework.data.solr.core

How to list all datasets in CKAN (not only the active ones)

I am working on a project based on CKAN, and I am required to list in a page all the datasets that have the state "active" and "draft". When you go to the datasets page, you can only see the ones that have the state marked as "active", but not "draft".
If I use the API (call the package_list() method) or REST calls (http://localhost/api/3/action/package_list), CKAN only returns "active" datasets, but not "drafts". I double and triple checked the documentation, and apparently one cannot lists the datasets by their state.
Does anybody have a clue on how to do this? Has anybody done this already?
Thanks!
If nothing else, you could write an extension to do this. The database call itself will be pretty simple:
SELECT id,title,name FROM package WHERE state='active' OR state='draft';
I managed to modify CKAN core to list the datasets that do not have the state "draft" or "deleted", and it works, but I do no want to touch CKAN's core, I want to do this using a plugin, so the normal thing to do is to implement plugins.IActions and override the package_list method with a custom one. I have already written my own extension to try to modify CKAN behavior on method package_list(), but I can't seem to figure it out how to make it work.
Here is my code:
#side_effect_free
def package_list_custom(context, data_dict=None):
datasets = []
dataset_q = (model.Session.query(model.Package)
.join(model.PackageRole))
for dataset in dataset_q:
if dataset.state != 'draft' and dataset.state != 'deleted':
datasets.append(dataset)
return [dataset.id for dataset in datasets]
class Cnaf_WorkflowPlugin(plugins.SingletonPlugin):
plugins.implements(plugins.IActions)
def get_actions(self):
return {
'package_list' : package_list_custom
}
If I modify CKAN core it works very well, but the problem is that I am not to touch it, so I am obliged to do it via an extension.
EDIT: Ok, I managed to make it work, you need to decorate the method with #side_effect_free. I modified my code, and now it works.
The package_search API is capable of this, by searching for state:draft and setting the include_drafts=True flag. Something like this:
https://my-site.com/api/action/package_search?q=state:draft&include_drafts=True
You should be able to access this from a plugin with something like: ckan.plugins.toolkit.get_action('package_search')(context=context, data_dict={'q': 'state:draft', 'include_drafts': True}) (you'll need to assemble the context yourself, containing a 'user' key for the current username and a 'userobj' key for the current user object).
Then make a page from the results.

Jackrabbit XPath Issue

I'm relatively new to Jackrabbit. In our application we never turned on SearchIndex section within repository.xml (so as workspace.xml) files because we always go directly to a given document using the JCR UUID reference. We are using Jackrabbit v2.2.1 and Oracle as the repository. Now our requirements are getting expanded as we would like to use the document metadata feature to store contextual info about a document so that we can use the metadata to retrieve a selected set of documents.
As the first step, I added the default SearchIndex section in workspace.xml file and restarted the JCR.
I saw a bunch of lines like this in my log file - then I saw it created the index folder under workspace area.
2011-07-05 15:04:01.724 INFO [WebContainer : 0] MultiIndex.java:1204 indexing... /vfs:metaData/21ee130e-978e-415f-bfd1-7aa03d91608c/vfs:attributes (3500)
I have the folder structure like this. When I create a document in JCR, I specify the metadata info as part of the document which is by a complex XSD type with tags like docType, uploadedBy, contextValue, etc.
/ (root)
/MyApp (sub-folder)
/documents/ (sub-folder)
/document-1.pdf (file)
/document-2.pdf (file)
/accounts/ (sub-folder)
/account.txt (file)
etc...
The following XPath expression works.
//jcr:root/vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
If I give wrong value, for example instead of 'TAX_DOCS', 'TAX', it returns no documents as expected which is great. This proves that the metadata is correctly stored as expected and it is used in the filter process correctly.
The problem with this query is that it starts searching from the root folder but I want to search from /MyApp/documents sub-folder only. So I tried this:
//jcr:root/MyApp/documents//vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
It returns nothing. Then I tried this too but no success.
//jcr:root/MyApp/documents//*[vfs:metaData/vfs:attributes/vfs:docType='TAX_DOCS']
So what am I doing wrong? Is anything in workspace.xml configuration that we need to set or missing?
Any help is appreciated.
Thanks, Jack
Drop the double slashed from anything but the last path component and use the # notation for the attribute value, resulting in:
/jcr:root/MyApp/documents//*[vfs:attributes/#vfs:docType='TAX_DOCS']
The // construct looks for the whole subtree instead of just the immediate children like / does. The JCR specification only requires implementations to support the // construct as the last step of the XPath query.

Resources