Manipulating DocList in SOLR

Manipulating DocList in SOLR - solr

I have written a custom request handler in solr to meet my business requirements. The handler involves getting data from two different from SolrIndexSearchers. I want the returned doclists from the two SolrIndexSearchers merged into one.
I tried iterating through one and adding doc by doc to another, but all I could get was an "Unsupported Operation" exception. Is there anyway to merge two doclists?
[Edit 1] : Code snippet inside the overridden handleRequestBody method
SolrCore core = new SolrCore("Desired Directory 1", schema);
reader = IndexReader.open("Desired Directory 1");
searcher = new SolrIndexSearcher(core, schema, getName(), reader, false);
Sort lsort = null;
FilteredQuery filter = null;
DocList results1 = searcher.getDocList(query, filter, lsort, 0, 10);
reader.close();
searcher.close();
core.close();
SolrCore core = new SolrCore("Desired Directory 2", schema);
reader = IndexReader.open("Desired Directory 2");
searcher = new SolrIndexSearcher(core, schema, getName(), reader, false);
Sort lsort = null;
FilteredQuery filter = null;
DocList results2 = searcher.getDocList(query, filter, lsort, 0, 10);
reader.close();
searcher.close();
core.close();
rsp.add("response",results1);
rsp.add("response",results2);
Now that I have two DocLists results1 and results2, how do I merge them?
[Edit 2] : The problem is not an exception/stack trace. When I add two responses, I get the results in two response sets when it is a single machine search. When it is a distributed search, I only get the distribution between response 1 of machine 1 and response 1 of machine 2. IN my understanding, only when I merge the responses to a single set, I will be able to get proper distribution. Hope I am understandable?

DocList is not supposed to be changed after you retrieve it from a search result.
I had the similar problem and, if I remember correctly, I used:
org.apache.solr.util.SolrPluginUtils.docListToSolrDocumentList
to convert DocList to SolrDocumentList, which extends ArrayList<SolrDocument>, which consequently supports add(SolrDocument).
So the worst case scenario (from the performance standpoint) is to convert the first DocList to SolrDocumentList and then loop through all other DocLists calling add on each doc.
You'll have to test how efficient this approach is. I'm not Solr expert, but this is where I'd start testing.

Related

Action to spawn multiple further actions in Gatling scenario

Background
I'm currently working on a capability analysis set of stress-testing tools for which I'm using gatling.
Part of this involves loading up an elasticsearch with scroll queries followed by update API calls.
What I want to achieve
Step 1: Run the scroll initiator and save the _scroll_id where it can be used by further scroll queries
Step 2: Run a scroll query on repeat, as part of each scroll query make a modification to each hit returned and index it back into elasticsearch, effectively spawning up to 1000 Actions from the one scroll query action, and having the results sampled.
Step 1 is easy. Step 2 not so much.
What I've tried
I'm currently trying to achieve this via a ResponseTransformer that parses JSON-formatted results, makes modifications to each one and fires off a thread for each one that attempts another exec(http(...).post(...) etc) to index the changes back into elasticsearch.
Basically, I think I'm going about it the wrong way about it. The indexing threads never get run, let alone sampled by gatling.
Here's the main body of my scroll query action:
...
val pool = Executors.newFixedThreadPool(parallelism)
val query = exec(http("Scroll Query")
.get(s"/_search/scroll")
.body(ElFileBody("queries/scrollquery.json")).asJSON // Do the scroll query
.check(jsonPath("$._scroll_id").saveAs("scroll_id")) // Get the scroll ID from the response
.transformResponse { case response if response.isReceived =>
new ResponseWrapper(response) {
val responseJson = JSON.parseFull(response.body.string)
// Get the hits and
val hits = responseJson.get.asInstanceOf[Map[String, Any]]("hits").asInstanceOf[Map[String,Any]]("hits").asInstanceOf[List[Map[String, Any]]]
for (hit <- hits) {
val id = hit.get("_id").get.asInstanceOf[String]
val immutableSource = hit.get("_source").get.asInstanceOf[Map[String, Any]]
val source = collection.mutable.Map(immutableSource.toSeq: _*) // Make the map mutable
source("newfield") = "testvalue" // Make a modification
Thread.sleep(pause) // Pause to simulate topology throughput
pool.execute(new DocumentIndexer(index, doctype, id, source)) // Create a new thread that executes the index request
}
}
}) // Make some mods and re-index into elasticsearch
...
DocumentIndexer looks like this:
class DocumentIndexer(index: String, doctype: String, id: String, source: scala.collection.mutable.Map[String, Any]) extends Runnable {
...
val httpConf = http
.baseURL(s"http://$host:$port/${index}/${doctype}/${id}")
.acceptHeader("application/json")
.doNotTrackHeader("1")
.disableWarmUp
override def run() {
val json = new ObjectMapper().writeValueAsString(source)
exec(http(s"Index ${id}")
.post("/_update")
.body(StringBody(json)).asJSON)
}
}
Questions
Is this even possible using gatling?
How can I achieve what I want to achieve?
Thanks for any help/suggestions!

It's possible to achieve this by using jsonPath to extract the JSON hit array and saving the elements into the session and then, using a foreach in the action chain and exec-ing the index task in the loop you can perform the indexing accordingly.
ie:
ScrollQuery
...
val query = exec(http("Scroll Query")
.get(s"/_search/scroll")
.body(ElFileBody("queries/scrollquery.json")).asJSON // Do the scroll query
.check(jsonPath("$._scroll_id").saveAs("scroll_id")) // Get the scroll ID from the response
.check(jsonPath("$.hits.hits[*]").ofType[Map[String,Any]].findAll.saveAs("hitsJson")) // Save a List of hit Maps into the session
)
...
Simulation
...
val scrollQueries = scenario("Enrichment Topologies").exec(ScrollQueryInitiator.query, repeat(numberOfPagesToScrollThrough, "scrollQueryCounter"){
exec(ScrollQuery.query, pause(10 seconds).foreach("${hitsJson}", "hit"){ exec(HitProcessor.query) })
})
...
HitProcessor
...
def getBody(session: Session): String = {
val hit = session("hit").as[Map[String,Any]]
val id = hit("_id").asInstanceOf[String]
val source = mapAsScalaMap(hit("_source").asInstanceOf[java.util.LinkedHashMap[String,Any]])
source.put("newfield", "testvalue")
val sourceJson = new ObjectMapper().writeValueAsString(mapAsJavaMap(source))
val json = s"""{"doc":${sourceJson}}"""
json
}
def getId(session: Session): String = {
val hit = session("hit").as[Map[String,Any]]
val id = URLEncoder.encode(hit("_id").asInstanceOf[String], "UTF-8")
val uri = s"/${index}/${doctype}/${id}/_update"
uri
}
val query = exec(http(s"Index Item")
.post(session => getId(session))
.body(StringBody(session => getBody(session))).asJSON)
...
Disclaimer: This code still needs optimising! And I haven't actually learnt much scala yet. Feel free to comment with better solutions
Having done this, what I really want to achieve now is to parallelise a given number of the indexing tasks. ie: I get 1000 hits back, I want to execute an index task for each individual hit, but rather than just iterating over them and doing them one after another, I want to do 10 at a time concurrently.
However, I think this is a separate question, really, so I'll present it as such.

Why does this get method stop returning correctly?

I am trying to write an app engine application for my university. What I am trying to achieve right now, is to create a method which takes in a Course name, and returns a list of all the CourseYears (think of that as being like a link table e.g. if Maths is the course, and it has Year 1, year 2 and Year 3; MathsYear1, MathsYear2 and MathsYear3 would be the names of the CourseYears).
This is the code for the module (WARING: super dirty code below!):
#ApiMethod(name = "courseYears")
public ArrayList<CourseYear> courseYears(#Named("name") String name){
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
Query.Filter keyFilter = new Query.FilterPredicate("name", Query.FilterOperator.EQUAL, name);
Query query = new Query("Course").setFilter(keyFilter);
PreparedQuery preparedQuery = datastore.prepare(query);
List<Entity> resultList = preparedQuery.asList(FetchOptions.Builder.withLimit(1));
Course course = ofy().load().type(Course.class).id(resultList.get(0).getKey().getId()).now();
ArrayList<String> courseYearNames = course.getAllCourseYearNames();
System.out.println(course.getName());
ArrayList<CourseYear> courseYears = new ArrayList<CourseYear>();
for(String courseYearName: courseYearNames){
Query.Filter courseNameFilter = new Query.FilterPredicate("name", Query.FilterOperator.EQUAL, courseYearName);
Query query2 = new Query("CourseYear").setFilter(courseNameFilter);
List<Entity> resL = preparedQuery.asList(FetchOptions.Builder.withLimit(1));
System.out.println("test");
CourseYear courseYear = ofy().load().type(CourseYear.class).id(resL.get(0).getKey().getId()).now();
courseYears.add(courseYear);
}
return courseYears;
}
It basically takes a Course name in, applies a filter on all courses to get the corresponding Course object, and then calls getAllCourseYearNames() on the course to get an array list containing all its CourseYears' names. (I would have loved to do this using Keys, but parameterised Objectify keys don't seem to be supported in this version of App Engine).
I then try and get the CourseYears by looping through the arraylist of names and applying the filter for each name. I print "test" each time to see how many times it is looping. Like I said, a super dirty way of doing it.
When I try passing a few course names as a parameters, it loops the correct number of times only once or twice, and after that does not loop at all (doesn't print "test"). I could understand if it never looped, but not doing it correctly once or twice and then never again. It doesn't successfully return a list of CourseYears when it does work, but rather the relevant number of NULLs - I don't know if this is relevant. I believe it successfully retrieves the course every time, as I print the name of the course after loading and it never fails to do this.
If anyone has ANY suggestions for why this may be happening, I would be incredibly grateful to hear them!
Thanks

query2 is never used in your code. You reuse preparedQuery from your previous query, which runs on a different entity kind.

ndb query by KeyProperty

I'm struggling with a KeyProperty query, and can't see what's wrong.
My model is
class MyList(ndb.Model):
user = ndb.KeyProperty(indexed=True)
status = ndb.BooleanProperty(default=True)
items = ndb.StructuredProperty(MyRef, repeated=True, indexed=False)
I create an instance of MyList with the appropriate data and can run the following properly
cls = MyList
lists = cls.query().fetch()
Returns
[MyList(key=Key('MyList', 12), status=True, items=..., user=Key('User', 11))]
But it fails when I try to filter by user, i.e. finding lists where the user equals a particular entity; even when using the one I've just used for insert, or from the previous query result.
key = lists[0].user
lists = cls.query(cls.user=key).fetch()
Returns
[]
But works fine with status=True as the filter, and I can't see what's missing?
I should add it happens in a unit testing environment with the following v3_stub
self.policy = datastore_stub_util.PseudoRandomHRConsistencyPolicy(probability=0)
self.testbed.init_datastore_v3_stub(
require_indexes=True,
root_path="%s/../"%(os.path.dirname(__file__)),
consistency_policy=self.policy
)

user=Key('User', 11) is a key to a different class: User. Not MyList
Perhaps you meant:
user = ndb.KeyProperty(kind='User', indexed=True)

Your code looks fine, but I have noticed some data integrity issues when developing locally with NDB. I copied your model and code, and I also got the empty list at first, but then after a few more attempts, the data is there.
Try it a few times?
edit: possibly related?
google app engine ndb: put() and then query(), there is always one less item

How to optimize one-to many queries in the datastore

I have a latency problem in my application due to the datastore doing additional queries for referenced entities. I have received good advice on how to handle this for single value properties by the use of the get_value_for_datastore() function. However my application also have one-to many relationships as shown in the code below, and I have not found a way to prefetch these entities. The result is an unacceptable latency when trying to show a table of 200 documents and their associated documentFiles (>6000ms).
(There will probably never be more than 10.000 Documents or DocumentFiles)
Is there a way to solve this?
models.py
class Document(db.Expando):
title = db.StringProperty()
lastEditedBy = db.ReferenceProperty(DocUser, collection_name = 'documentLastEditedBy')
...
class DocUser(db.Model):
user = db.UserProperty()
name = db.StringProperty()
hasWriteAccess= db.BooleanProperty(default = False)
isAdmin = db.BooleanProperty(default = False)
accessGroups = db.ListProperty(db.Key)
...
class DocumentFile(db.Model):
description= db.StringProperty()
blob = blobstore.BlobReferenceProperty()
created = db.DateTimeProperty() # needs to be stored here in relation to upload / download of everything
document = db.ReferenceProperty(Document, collection_name = 'files')
#property
def link(self):
return '%s' % (self.key().id(),self.blob.filename)
...
main.py
docUsers = DocUser.all()
docUsersNameDict = dict([(i.key(), i.name) for i in docUsers])
documents = Document.all()
for d idocuments:
out += '<td>%s</td>' % d.title
docUserKey = Document.lastEditedBy.get_value_for_datastore(d)
out +='<td>%s</td>' % docUsersNameDict.get(docUserKey)
out += '<td>'
# Creates a new query for each document, resulting in unacceptable latency
for file in d.files:
out += file.link + '<br>'
out += '</td>'

Denormalize and store the link in your Document, so that getting the link will be fast.
You will need to be careful that when you update a DocumentFile, you need to update the associated Document. This operates under the assumption that you read the link from the datastore far more often than you update it.
Denormalizing is often the fix for poor performance on App Engine.

Load your files asynchronously. Use get_value_for_datastore on d.files, which should return a collection of keys, which you can then do db.get_async(key) to return a future object. You will not be able to write out your result procedurally as you have done, but it should be trivial to assemble a partial request / dictionary for all documents, with a collection of pending future gets(), and then when you do your iteration to build the results, you can finalize the futures, which will have finished without blocking {~0ms latency}.
Basically, you need two iterations. The first iteration will go through and asynchronously request the files you need, and the second iteration will go through, finalize your gets, and build your response.
https://developers.google.com/appengine/docs/python/datastore/async

RIA Services - Silverlight 4.0 - Accessing entities from code

I have strange situation I have simple project to test RIA functionality in Silverlight 4.0.
When I use data source for Domain Service it works great but when I want to access Context from code and execute simple query I returns 0 rows.
//test One with query provided to DataSource
var q = ctx.GetDoctorsWithPatientsAndHospitalQuery();
var result = ctx.Load(q);
//test Two using EntityQuery
EntityQuery<Doctor> query =
from c in ctx.GetDoctorsWithPatientsAndHospitalQuery()
select c;
LoadOperation<Doctor> loadOp = this.ctx.Load(query);
var result2 = loadOp.Entities;
//test Three using only entity and Linq
var result3 = ctx.Doctors.ToList();
Strange is also that when I want to add new entity instance from code it works great.
Doctor newDoctor = new Doctor()
{
FirstName = firstNameTextBoxNew.Text,
LastName = lastNameTextBoxNew.Text,
Hospital_Id = tmp,
Hospital = tmpH
};
ctx.Doctors.Add(newDoctor);
ctx.SubmitChanges();
Can anyone could point me what I have done wrong to execute select from code?
Regards,
Daniel Skowroński

Calling "LoadOperation loadOp = this.ctx.Load(query);" from code is an async operation so you are basically checking the result before it completes.
If you want to see the results, you need to provide a callback, to the Load() method, that will execute after the data is loaded.
Data sources for domain services handle async updates, so keep propagating changes as and when load operations complete.
Your "save" works as it does not wait around for the result. You are manually checking the database afterwards. Not checking it in code.
Hope this helps.
As a quick check, try this (breakpoint on the "result2 =" line). Your loadOp is redundant in this example, but I did not want to change your code too much:
LoadOperation<Doctor> loadOp = this.ctx.Load(query, loadOperation =>
{
var result2 = loadOp.Entities;
}, null);
**Note: for those that want to edit this code... Please don't. I wanted to retain the flavour of the asker's code. loadOp and loadOperation point to the same object and result2 was the asker's choice of variable name.*

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Manipulating DocList in SOLR - solr

Related

Action to spawn multiple further actions in Gatling scenario

Why does this get method stop returning correctly?

ndb query by KeyProperty

How to optimize one-to many queries in the datastore

RIA Services - Silverlight 4.0 - Accessing entities from code

Categories

Resources