How to temporarily disable sitecore indexing while editing items - solr

I am developing a Sitecore project that has several data import jobs running on daily basis. Every time a job is executed, it may update a large amount of Sitecore items (thousands) and I've noticed that all these editings trigger Solr index updates.
My concern is, I don't really sure if this is better or update everything at the end of the job is. So, I would love to try both options. Could anyone tell me how can I use code to temporarily disable Lucene/Solr indexing and enable it later when I finish editing all items?

This is a common requirement, and you're right to have such concerns. In general it's considered good practice to disable indexing during big import jobs, then rebuild afterwards.
Assuming you're using Sitecore 7 or above, this is pretty much what you need:
IndexCustodian.PauseIndexing();
IndexCustodian.ResumeIndexing();
Here's a comprehensive article discussing this:
http://blog.krusen.dk/disable-indexing-temporarily-in-sitecore-7/

In addition to #Martin answer, you can pass (silent=true) when you finish the editing of the item, Something like:
item.Editing.BeginEdit();
//Change fields values
item.Editing.EndEdit(true,true);
The second parameter in EndEdit() method force a silent update of the item, which means no Events/Indexing will be triggered on item save.
I feel this is safer than pausing indexing on the whole application level during import process, you just skip indexing of the items you are updating.
EDIT:
In case you need to rebuild the index for the updated items after the import process is done, you can use the following code, It will index the content tree starting from RootItemInTree and below:
var index = Sitecore.ContentSearch.ContentSearchManager.GetIndex("Your_Index_Name")
index.Refresh(new SitecoreIndexableItem(RootItemInTree));

To disable indexing during large import/update tasks you should wrap your logic inside a BulkUpdateContext block. You can also use other wrappers like the EventDisabler to stop events from being fired if that is appropriate in your context. Alternatively you could wrap your code in an EditContext and set it to silent. So your code could end up something like this:
using (new BulkUpdateContext())
using (new EditContext(targetItem, false, true))
{
// insert update logic here...
}
here is a older question that discusses this topic: Optimisation tips when migrating data into Sitecore CMS

Related

When applying a scenario, do we have to delete the scenario as well, to prevent applying the changes twice?

Running through the full loop for a scenarios-based workflow and noticed that the scenario once applied is NOT auto-deleted. What is considered best practice to prevent users from accidentally applying a scenario twice? Is it best to delete them afterwards? If so, why is auto-delete not enabled?
you could delete but usually I would add a flag called "applied" and filter my list of displayed scenarios by "applied" == false or something like that.
If you ever want to use your scenarios for metrics. E.g. how many scenarios have been applied / maybe write some stats to the scenario object on apply this data is all lost if you're deleting on apply. I believe that's also why it's not part of the workflow by default. The idea is that you should make that decision for your use case.

Guava Cache as ValueState in Flink

I am trying to de-duplicate events in my Flink pipeline. I am trying to do that using guava cache.
My requirement is that, I want to de-duplicate over a 1 minute window. But at any given point I want to maintain not more than 10000 elements in the cache.
A small background on my experiment with Flink windowing:
Tumbling Windows: I was able to implement this using Tumbling windows + custom trigger. But the problem is, if an element occurs in the 59th minute and 61st minute, it is not recognized as a duplicate.
Sliding Windows: I also tried sliding window with 10 second overlap + custom trigger. But an element that came in the 55th second is part of 5 different windows and it is written to the sink 5 times.
Please let me know if I should not be seeing the above behavior with windowing.
Back to Guava:
I have Event which looks like this and a EventsWrapper for these events which looks like this. I will be getting a stream of EventsWrappers. I should remove duplicate Events across different EventsWrappers.
Example if I have 2 EventsWrappers like below:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e1',
name='event1'}, Event{id='e3', name='event3'}]}
I should emit as output the following:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e3', name='event3'}]}
i.e Making sure that e1 event is emitted only once assuming these two events are within the time and size requirements of the cache.
I created a RichFlatmap function where I initiate a guava cache and value state like this. And set the Guava cache in the value state like this. My overall pipeline looks like this.
But each time I try to update the guava cache inside the value state:
eventsState.value().put(eventId, true);
I get the following error:
java.lang.NullPointerException
at com.google.common.cache.LocalCache.hash(LocalCache.java:1696)
at com.google.common.cache.LocalCache.put(LocalCache.java:4180)
at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4888)
at events.piepline.DeduplicatingFlatmap.lambda$flatMap$0(DeduplicatingFlatmap.java:59)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:176)
On further digging, I found out that the error is because the keyEquivalence inside the Guava cache is null.
I checked by directly setting on the Guava cache(not through state, but directly on the cache) and that works fine.
I felt this could be because, ValueState is not able to serialize GuavaCache. So I added a Serializer like this and registered it like this:
env.registerTypeWithKryoSerializer((Class<Cache<String,Boolean>>)(Class<?>)Cache.class, CacheSerializer.class);
But this din't help either.
I have the following questions:
Any idea what I might be doing wrong with the Guava cache in the above case.
Is what I am seeing with my Tumbling and Slinding windows implementation is what is expected or am I doing something wrong?
What will happen if I don't set the Guava Cache in ValueState, instead just use it as a plain object in the DeduplicatingFlatmap class and operate directly on the Guava Cache instead of operating through the ValueState? My understanding is, the Guava cache won't be part of the Checkpoint. So when the pipeline fails and restarts, the GuavaCahe would have lost all the values in it and it will be empty on restart. Is this understanding correct?
Thanks a lot in advance for the help.
See below.
These windows are behaving as expected.
Your understanding is correct.
Even if you do get it working, using a Guava cache as ValueState will perform very poorly, because RocksDB is going to deserialize the entire cache on every access, and re-serialize it on every update.
Moreover, it looks like you are trying to share a single cache instance across all of the orgs that happen to be multiplexed across a single flatmap instance. That's not going to work, because the RocksDB state backend will make a copy of the cache for each org (a side effect of the serialization involved).
Your requirements aren't entirely clear, but a deduplication query might help. But I'm thinking MapState in combination with timers in a KeyedProcessFunction is more likely to be the building block you need. Here's an example that might help you get started (but you'll be wanting to handle the timers differently).

Keeping repository synced with multiple clients

I have a WPF application that uses entity framework. I am going to be implementing a repository pattern to make interactions with EF simple and more testable. Multiple clients can use this application and connect to the same database and do CRUD operations. I am trying to think of a way to synchronize clients repositories when one makes a change to the database. Could anyone give me some direction on how one would solve this type of issue, and some possible patterns that would be beneficial for this type of problem?
I would be very open to any information/books on how to keep clients synchronized, and even be alerted of things other clients are doing(The only thing I could think of was having a server process running that passes messages around). Thank you
The easiest way by far to keep every client UI up to date is just to simply refresh the data every so often. If it's really that important, you can set a DispatcherTimer to tick every minute when you can get the latest data that is being displayed.
Clearly, I'm not suggesting that you refresh an item that is being edited, but if you get the fresh data, you can certainly compare collections with what's being displayed currently. Rather than just replacing the old collection items with the new, you can be more user friendly and just add the new ones, remove the deleted ones and update the newer ones.
You could even detect whether an item being currently edited has been saved by another user since the current user opened it and alert them to the fact. So rather than concentrating on some system to track all data changes, you should put your effort into being able to detect changes between two sets of data and then seamlessly integrating it into the current UI state.
UPDATE >>>
There is absolutely no benefit from holding a complete set of data in your application (or repository). In fact, you may well find that it adds detrimental effects, due to the extra RAM requirements. If you are polling data every few minutes, then it will always be up to date anyway.
So rather than asking for all of the data all of the time, just ask for what the user wants to see (dependant on which view they are currently in) and update it every now and then. I do this by simply fetching the same data that the view requires when it is first opened. I wrote some methods that compare every property of every item with their older counterparts in the UI and switch old for new.
Think of the Equals method... You could do something like this:
public override bool Equals(Release otherRelease)
{
return base.Equals(otherRelease) && Title == otherRelease.Title &&
Artist.Equals(otherRelease.Artist) && Artists.Equals(otherRelease.Artists);
}
(Don't actually use the Equals method though, or you'll run into problems later). And then something like this:
if (!oldRelease.Equals(newRelease)) oldRelease.UpdatePropertyValues(newRelease);
And/Or this:
if (!oldReleases.Contains(newRelease) oldReleases.Add(newRelease);
I'm guessing that you get the picture now.

Objects not saving using Objectify and GAE

I'm trying to save an object and verify that it is saved right after, and it doesn't seem to be working.
Here is my object
import com.googlecode.objectify.annotation.Entity;
import com.googlecode.objectify.annotation.Id;
#Entity
public class PlayerGroup {
#Id public String n;//sharks
public ArrayList<String> m;//members [39393,23932932,3223]
}
Here is the code for saving then trying to load right after.
playerGroup = new PlayerGroup();
playerGroup.n = reqPlayerGroup.n;
playerGroup.m = reqPlayerGroup.m;
ofy().save().entity(playerGroup).now();
response.i = playerGroup;
PlayerGroup newOne = ofy().load().type(PlayerGroup.class).id(reqPlayerGroup.n).get();
But the "newOne" object is null. Even though I just got done saving it. What am I doing wrong?
--Update--
If I try later (like minutes later) sometimes I do see the object, but not right after saving. Does this have to do with the high replication storage?
Had the same behavior some time ago and asked a question on google groups - objectify
Here the answer I got :
You are seeing the eventual consistency of the High-Replication
Datastore. There has been a lot of discussion of this exact subject
on the Objecify list in google groups , including several links to the
Google documentation on the subject.
Basically, any kind of query which does not include an ancestor() may
return results from a stale view of the datastore.
Jeff
I also got another good answer to deal with the behavior
For deletes, query for keys and then batch-get the entities. Make sure
your gets are set to strong consistency (though I believe this is the
default). The batch-get should return null for the deleted entities.
When adding, it gets a little trickier. Index updates can take a few
seconds. AFAIK, there are three ways out of this: 1; Use precomputed
results (avoiding the query entirely). If your next view is the user's
recently created entities, keep a list of those keys in the user
entity, and update that list when a new entity is created. That list
will always be fresh, no query required. Besides avoiding stale
indexes, this also speeds up your app. The more you result sets you
can reliably manage, the more queries you can avoid.
2; Hide the latency by "enhancing" the query results with the recently
added entities. Depending on the rate at which you're adding entities,
either inject only the most recent key, or combine this with the
solution in 1.
3; Hide the latency by taking the user through some unaffected views
before landing on your query-based view. This strategy definitely has
a smell over it. You need to make sure those extra steps are relevant
to the user, or you'll give a poor experience.
Butterflies, Joakim
You can read it all here:
How come If I dont use async api after I'm deleting an object i still get it in a query that is being done right after the delete or not getting it right after I add one
Another good answer to a similar question : Objectify doesn't store synchronously, even with now

Updating Model Schema in Google App Engine?

Google is proposing changing one entry at a time to the default values ....
http://code.google.com/appengine/articles/update_schema.html
I have a model with a million rows and doing this with a web browser will take me ages. Another option is to run this using task queues but this will cost me a lot of cpu time
any easy way to do this?
Because the datastore is schema-less, you do literally have to add or remove properties on each instance of the Model. Using Task Queues should use the exact same amount of CPU as doing it any other way, so go with that.
Before you go through all of that work, make sure that you really need to do it. As noted in the article that you link to, it is not the case that all entities of a particular model need to have the same set of properties. Why not change your Model class to check for the existence of new or removed properties and update the entity whenever you happen to be writing to it anyhow.
Instead of what the docs suggest, I would suggest to use low level GAE API to migrate.
The following code will migrate all the items of type DbMyModel:
new_attribute will be added if does not exits.
old_attribute will be deleted if exists.
changed_attribute will be converted from boolean to string (True to Priority 1, False to Priority 3)
Please note that query.Run returns iterator returning Entity objects. Entity objects behave simply like dicts:
from google.appengine.api.datastore import Query, Put
query = Query("DbMyModel")
for item in query.Run():
if not 'new_attribute' in item:
item['attribute'] = some_value
if 'old_attribute' in item:
del item['old_attribute']
if ['changed_attribute'] is True:
item['changed_attribute'] = 'Priority 1'
elif ['changed_attribute'] is False:
item['changed_attribute'] = 'Priority 3'
#and so on...
#Put the item to the db:
Put(item)
In case you need to select only some records, see the google.appengine.api.datastore module's source code for extensive documentation and examples how to create filtered query.
Using this approach it is simpler to remove/add properties and avoid issues when you have already updated your application model than in GAE's suggested approach.
For example, now-required fields might not exist (yet) causing errors while migrating. And deleting fields does not work for static properties.
This doesn't help OP but may help googlers with a tiny app: I did what Alex suggested, but simpler. Obviously this isn't appropriate for production apps.
deploy App Engine Console
write code right inside the web interpreter against your live datastore
like so:
from models import BlogPost
for item in BlogPost.all():
item.attr="defaultvalue"
item.put()

Resources