Getting IDs of added documents after import operation is complete - solr

I'm trying to setup a Solr dataimport.EventListener to call a SOAP service with the IDs of the documents which have been added in the update event. I have a class which implements org.apache.solr.handler.dataimport.EventListener and I thought that the result of getAllEntityFields() would yield a collection of document IDs. Unfortunately, the result of the method yields an empty list. Even more confusing is that context.getSolrCore().getName() yields an empty string rather than the actual core name. So it seems I am not quite on the right path here.
The current setup is the following:
Whenever a certain sproc is called in SQL, it puts a message in a queue. This queue has a listener on it which initiates a program which reads the queue and calls other sprocs. After the sprocs are complete, a delta or full import operation is performed on Solr. Immediately after, a method is called to update a cache. However, because the import operation on Solr may not have yet been completed before this update method is called the cache may be updated with "stale" data.
I was hoping to use a dataimport EventListener to call the method which updates the cache since my other options seem far too complex (e.g. polling the dataimport URL to determine when to call the update method or using a queue to list document IDs which need to be updated and have the EventListener call a method on a service to receive this queue and update the cache). I'm having a bit of a hard time finding documentation or examples. Does anyone have any ideas on how I should approach the problem?

From what i understand, you are trying to update your cache as and when the documents are added. Depending on what version of solr you are running, you can do one of the following.
Solr 4.0 provides script transformer that lets you do this.
http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer
With prior versions of solr, you can chain one handler on top of other as answered in the following post.
Solr and custom update handler

Related

Guava Cache as ValueState in Flink

I am trying to de-duplicate events in my Flink pipeline. I am trying to do that using guava cache.
My requirement is that, I want to de-duplicate over a 1 minute window. But at any given point I want to maintain not more than 10000 elements in the cache.
A small background on my experiment with Flink windowing:
Tumbling Windows: I was able to implement this using Tumbling windows + custom trigger. But the problem is, if an element occurs in the 59th minute and 61st minute, it is not recognized as a duplicate.
Sliding Windows: I also tried sliding window with 10 second overlap + custom trigger. But an element that came in the 55th second is part of 5 different windows and it is written to the sink 5 times.
Please let me know if I should not be seeing the above behavior with windowing.
Back to Guava:
I have Event which looks like this and a EventsWrapper for these events which looks like this. I will be getting a stream of EventsWrappers. I should remove duplicate Events across different EventsWrappers.
Example if I have 2 EventsWrappers like below:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e1',
name='event1'}, Event{id='e3', name='event3'}]}
I should emit as output the following:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e3', name='event3'}]}
i.e Making sure that e1 event is emitted only once assuming these two events are within the time and size requirements of the cache.
I created a RichFlatmap function where I initiate a guava cache and value state like this. And set the Guava cache in the value state like this. My overall pipeline looks like this.
But each time I try to update the guava cache inside the value state:
eventsState.value().put(eventId, true);
I get the following error:
java.lang.NullPointerException
at com.google.common.cache.LocalCache.hash(LocalCache.java:1696)
at com.google.common.cache.LocalCache.put(LocalCache.java:4180)
at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4888)
at events.piepline.DeduplicatingFlatmap.lambda$flatMap$0(DeduplicatingFlatmap.java:59)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:176)
On further digging, I found out that the error is because the keyEquivalence inside the Guava cache is null.
I checked by directly setting on the Guava cache(not through state, but directly on the cache) and that works fine.
I felt this could be because, ValueState is not able to serialize GuavaCache. So I added a Serializer like this and registered it like this:
env.registerTypeWithKryoSerializer((Class<Cache<String,Boolean>>)(Class<?>)Cache.class, CacheSerializer.class);
But this din't help either.
I have the following questions:
Any idea what I might be doing wrong with the Guava cache in the above case.
Is what I am seeing with my Tumbling and Slinding windows implementation is what is expected or am I doing something wrong?
What will happen if I don't set the Guava Cache in ValueState, instead just use it as a plain object in the DeduplicatingFlatmap class and operate directly on the Guava Cache instead of operating through the ValueState? My understanding is, the Guava cache won't be part of the Checkpoint. So when the pipeline fails and restarts, the GuavaCahe would have lost all the values in it and it will be empty on restart. Is this understanding correct?
Thanks a lot in advance for the help.
See below.
These windows are behaving as expected.
Your understanding is correct.
Even if you do get it working, using a Guava cache as ValueState will perform very poorly, because RocksDB is going to deserialize the entire cache on every access, and re-serialize it on every update.
Moreover, it looks like you are trying to share a single cache instance across all of the orgs that happen to be multiplexed across a single flatmap instance. That's not going to work, because the RocksDB state backend will make a copy of the cache for each org (a side effect of the serialization involved).
Your requirements aren't entirely clear, but a deduplication query might help. But I'm thinking MapState in combination with timers in a KeyedProcessFunction is more likely to be the building block you need. Here's an example that might help you get started (but you'll be wanting to handle the timers differently).

How to temporarily disable sitecore indexing while editing items

I am developing a Sitecore project that has several data import jobs running on daily basis. Every time a job is executed, it may update a large amount of Sitecore items (thousands) and I've noticed that all these editings trigger Solr index updates.
My concern is, I don't really sure if this is better or update everything at the end of the job is. So, I would love to try both options. Could anyone tell me how can I use code to temporarily disable Lucene/Solr indexing and enable it later when I finish editing all items?
This is a common requirement, and you're right to have such concerns. In general it's considered good practice to disable indexing during big import jobs, then rebuild afterwards.
Assuming you're using Sitecore 7 or above, this is pretty much what you need:
IndexCustodian.PauseIndexing();
IndexCustodian.ResumeIndexing();
Here's a comprehensive article discussing this:
http://blog.krusen.dk/disable-indexing-temporarily-in-sitecore-7/
In addition to #Martin answer, you can pass (silent=true) when you finish the editing of the item, Something like:
item.Editing.BeginEdit();
//Change fields values
item.Editing.EndEdit(true,true);
The second parameter in EndEdit() method force a silent update of the item, which means no Events/Indexing will be triggered on item save.
I feel this is safer than pausing indexing on the whole application level during import process, you just skip indexing of the items you are updating.
EDIT:
In case you need to rebuild the index for the updated items after the import process is done, you can use the following code, It will index the content tree starting from RootItemInTree and below:
var index = Sitecore.ContentSearch.ContentSearchManager.GetIndex("Your_Index_Name")
index.Refresh(new SitecoreIndexableItem(RootItemInTree));
To disable indexing during large import/update tasks you should wrap your logic inside a BulkUpdateContext block. You can also use other wrappers like the EventDisabler to stop events from being fired if that is appropriate in your context. Alternatively you could wrap your code in an EditContext and set it to silent. So your code could end up something like this:
using (new BulkUpdateContext())
using (new EditContext(targetItem, false, true))
{
// insert update logic here...
}
here is a older question that discusses this topic: Optimisation tips when migrating data into Sitecore CMS

Still seeing old shard after calling SPLITSHARD

I called splitshard, and now this is what I see even after posting a commit:
I thought splitshard was supposed to get rid of the original shard, shard1, in this case. Am I missing something? I was expecting the only two remaining shards to be shard1_0 and shard1_1.
The REST call I used was /admin/collections?collection=default-collection&shard=shard1&action=SPLITSHARD if that helps.
Response from the Solr mailing list:
Once the SPLITSHARD call completes, it just marks the original shard as
Inactive i.e. it no longer accepts requests. So yes, you would have to use
DELETESHARD (
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api7)
to clean it up.
As far as what you see on the admin UI, that information is wrong i.e. the
UI does not respect the state of the shard while displaying them. So,
though the parent shard might be inactive, you still would end up seeing it
as just another active shard. There's an open issue for this one.
One way to confirm the shard state is by looking at the shard state in
clusterstate.json (or state.json, depending upon the version of Solr you're
using).

How to make another search call inside SOLR

I would like to implement some kind of fallback querying mechanism inside SOLR. That is if a first search call doesn't generate enough results, I would like to make another call with different ranking and then combine the results and return it. I guess this can be done in the SOLR client side but I hope to do this inside the SOLR. By reading documentation, I guess I need to implement a search component and then add it next to "query" component? Any reference or experience in this regard would be highly appreciated.
SearchHandler calls all the registered search components in order you define, and there are several stages (prepare,process etc.).
You know the number of results only after the distributed processing phase (I suppose you work with distributed mode),so your custom search component should check the number of results in response object and run its own query if necessary.
Actually you may inherit (or wrap) a regular QueryComponent for that, augmenting its process/distributed process phases.

CakePHP afterSave Timing

I have a situation where, in a model's afterSave callback, I'm trying to access data from a distant association (it's a legacy data model with a very wonky association linkage). What I'm finding is that within the callback I can execute a find call on the model, but if I exit right then, the record is never inserted into the database. The lack of a record means that I can't execute a find on the related model using data that was just inserted into the current.
I haven't found any mention of when data is actually committed with respect to when the afterSave callback is engaged. I'm working with legacy code, but I see no indication that we're specifically engaging transactions, so I'm trying to figure out what my options might be.
Thanks.
UPDATE
The gist of the scenario is this: We're taking event registrations, but folks can be wait listed. A user can register (or be registered) for a given Date. After a registration is complete, I need to check the wait list for the existence of a record for the registering user (WaitList.user_id) on the date being registered for (WaitList.date_id). If such a record exists, it can be deleted because it's become an active registration.
The legacy schema puts me in a place where the registration isn't directly tied to a date so I can't get the Date.id easily. Instead, Registration->Registrant->Ticket->Date. Unintuitive, I know, but it is what it is for now. Even better (sarcasm included), we have a view named attendees that rolls all of this info up and from which I would be able to use the newly created Registration->id to return Attendee.date_id. Since the record doesn't exist, it's not available in the view.
Hopefully that provides a little more context.
What's the purpose of the find query inside of your afterSave?
Update
Is it at all possible to properly associate the records? Or are we talking about way too much refactoring for it to be worth it? You could move the check to the controller if it's not possible to modify the associations between the records.
Something like (in psuedo code)
if (save->isSuccessful) {
if (onWaitList) {
// delete record
}
}
It's not best practice, but it will get you around your issue.

Resources