State handling on KeyedCoProcessFunction serving ML models - apache-flink

I am working on a KeyedCoProcessFunction that looks like this:
class MyOperator extends KeyedCoProcessFunction[String, ModelDef, Data, Prediction]
with CheckpointedFunction {
// To hold loaded models
#transient private var models: HashMap[(String, String), Model] = _
// For serialization purposes
#transient private var modelsBytes: MapState[(String, String), Array[Bytes]] = _
...
override def snapshotState(context: FunctionSnapshotContext): Unit = {
modelsBytes.clear() // This raises an exception when there is no active key set
for ((k, model) <- models) {
modelsBytes.put(k, model.toBytes(v))
}
}
override def initializeState(context: FunctionInitializationContext): Unit = {
modelsBytes = context.getKeyedStateStore.getMapState[String, String](
new MapStateDescriptor("modelsBytes", classOf[String], classOf[String])
)
if (context.isRestored) {
// restore models from modelsBytes
}
}
}
The state consists of a collection of ML models built using a third party library. Before checkpoints, I need to dump the loaded models into byte arrays in snapshotState.
My question is, within snapshotState, modelsBytes.clear() raises an exception when there is no active key. This happens when I start the application from scratch without any data on the input streams. So, when the time for a checkpoint comes, I get this error:
java.lang.NullPointerException: No key set. This method should not be called outside of a keyed context.
However, when the input stream contains data, checkpoints work just fine. I am a bit confused about this because snapshotState does not provide a keyed context (contrary to processElement1 and processElement2, where the current key is accessible by doing ctx.getCurrentKey) so it seems to me that the calls to clear and put within snapshotState should fail always since they're supposed to work only within a keyed context. Can anyone clarify if this is the expected behaviour actually?

A keyed state can only be used on a keyed stream as written in the documentation.
* <p>The state is only accessible by functions applied on a {#code KeyedStream}. The key is
* automatically supplied by the system, so the function always sees the value mapped to the
* key of the current element. That way, the system can handle stream and state partitioning
* consistently together.
If you call clear(), you will not clear the whole map, but just reset the state of the current key. The key is always known in processElementX.
/**
* Removes the value mapped under the current key.
*/
void clear();
You should actually receive a better exception when you try to call clear in a function other than processElementX. In the end, you are using the keyed state incorrectly.
Now for your actual problem. I'm assuming you are using a KeyedCoProcessFunction because the models are updated in a separate input. If they are static, you could just load them open from a static source (for example, included in the jar). Furthermore, often there is only one model that is applied for all values with different keys, then you could use BroadCast state. So I'm assuming you have different models for different types of data separated by keys.
If they are coming in from input2, then you already serialize them upon invocation of processElement2.
override def processElement2(model: Model, ctx: Context, collector): Unit = {
models.put(ctx.getCurrentKey, model)
modelsBytes.put(ctx.getCurrentKey, model.toBytes(v))
}
Then you would not override snapshotState, as the state is already up-to-date. initializeState would deserialize models eagerly or you could also materialize them lazily in processElement1.

Related

Flink re-scalable keyed stream stateful function

I have the following Flink job where I tried to use keyed-stream stateful function (MapState) with backend type RockDB,
environment
.addSource(consumer).name("MyKafkaSource").uid("kafka-id")
.flatMap(pojoMapper).name("MyMapFunction").uid("map-id")
.keyBy(new MyKeyExtractor())
.map(new MyRichMapFunction()).name("MyRichMapFunction").uid("rich-map-id")
.addSink(sink).name("MyFileSink").uid("sink-id")
MyRichMapFunction is a stateful function which extends RichMapFunction which has following code,
public static class MyRichMapFunction extends RichMapFunction<MyEvent, MyEvent> {
private transient MapState<String, Boolean> cache;
#Override
public void open(Configuration config) {
MapStateDescriptor<String, Boolean> descriptor =
new MapStateDescriptor("seen-values", TypeInformation.of(new TypeHint<String>() {}), TypeInformation.of(new TypeHint<Boolean>() {}));
cache = getRuntimeContext().getMapState(descriptor);
}
#Override
public MyEvent map(MyEvent value) throws Exception {
if (cache.contains(value.getEventId())) {
value.setIsSeenAlready(Boolean.TRUE);
return value;
}
value.setIsSeenAlready(Boolean.FALSE);
cache.put(value.getEventId(), Boolean.TRUE)
return value;
}
}
In future, I would like to rescale the parallelism (from 2 to 4), so my question is, how can I achieve re-scalable keyed states so that after changing the parallelism I can get the corresponding cache keyed data to its corresponding task slot. I tried to explore this, where I found a documentation here. According to this, re-scalable operator state can be achieved by using ListCheckPointed interface which provides snapshotState/restoreState method for that. But not sure how re-scalable keyed state (MyRichMapFunction) can be achieved? Should I need to implement ListCheckPointed interface for my MyRichMapFunction class? If yes how can I redistribute the cache according to new parallelism key hash on restoreState method (my MapState will hold huge number of keys with TTL enabled, let's say max it will hold 1 billion keys at any point of time)? Could some one please help me on this or if you point me to any example that would be great too.
The code you've written is already rescalable; Flink's managed keyed state is rescalable by design. Keyed state is rescaled by rebalancing the assignment of keys to instances. (You can think of keyed state as a sharded key/value store. Technically what happens is that consistent hashing is used to map keys to key groups, and each parallel instance is responsible for some of the key groups. Rescaling simply involves redistributing the key groups among the instances.)
The ListCheckpointed interface is for state used in a non-keyed context, so it's inappropriate for what you are doing. Note also that ListCheckpointed will be deprecated in Flink 1.11 in favor of the more general CheckpointedFunction.
One more thing: if MyKeyExtractor is keying by value.getEventId(), then you could be using ValueState<Boolean> for your cache, rather than MapState<String, Boolean>. This works because with keyed state there is a separate value of ValueState for every key. You only need to use MapState when you need to store multiple attribute/value pairs for each key in your stream.
Most of this is discussed in the Flink documentation under Hands-on Training, which includes an example that's very close to what you are doing.

On efficient checkpoints with dynamic (self-evolving) keyed state

In a KeyedCoProcessFunction, I am managing a keyed state which consists of third-party library models. These models are created on reception of new data on the control stream within processElement1. Because the models are self-evolving, in the sense that have their own internal state, I need to make sure that they are serialized in modelsBytes when their state changes. My first attempt goes like this:
class MyOperator
extends KeyedCoProcessFunction[String, Control, Data, Prediction]
with CheckpointedFunction {
// To hold loaded models
#transient private var models: HashMap[String, Model] = _
// For serialization purposes
#transient private var modelsBytes: MapState[String, Array[Bytes]] = _
override def processElement1(control, ctx, ...) {
if (restoreModels) {
restoreModels()
}
// - Create new model out of `control` element
// - Add it to `models` keyed state
}
override def processElement2(data, ctx, ...) {
if (restoreModels) {
restoreModels()
}
// - Send `data` element to the corresponding models
// This will update their internal states
}
override def snapshotState(context: FunctionSnapshotContext): Unit = {
// Suspicious, wishful-thinking code that compiles and runs just fine
for ((k, model) <- models) {
modelsBytes.put(k, model.toBytes(v))
}
}
override def initializeState(context: FunctionInitializationContext): Unit = {
modelsBytes = context.getKeyedStateStore.getMapState[String](
new MapStateDescriptor("modelsBytes", classOf[String])
)
if (context.isRestored) restoreModels = true
}
}
So, the idea is to use snapshotState to override the keyed state entries in modelsBytes. The reason why I am trying this approach is because serializing the models (model.toBytes) might be an expensive operation. Therefore, I would prefer to do it once per model when a checkpoint comes. The problem with this approach is that it might be inherently/conceptually wrong. Here is why, even if the code within snapshotState compiles and runs just fine, note that I am referring to a keyed state piece without getting a keyed context passed in, so it is not clear at all what key I am really working on to start with. I have written a small test to verify the checkpoints, and I have observed that from time to time I get an empty state back, even if the modelsBytes state entries were updated in snapshotState. So it seems that snapshotting my models like this is not reliable at all. What confuses me is that the user is perfectly allowed to do this, maybe the put method should raise an exception to make it clear that a keyed state is required in the first place, otherwise it gives false hope and might lead to hard-to-spot bugs. As a matter of fact, shouldn't this be considered a bug?
The other option I have is, of course, to serialize my models in processElement2, after sending new data elements to them. However, continuously serialzing my models to update modelsBytes might be costly.
What would be the most efficient way to handle this scenario?

Getting Collection of particular Type with Hybris ModelService

Hiy!
I want all objects(rows in Test Type) with ModelService
So I could iterate through collection and update a Single row (object)'s attribute with new value
I see getModelService.create(TestModel.class) and getModelService.save()
but will they not create a new object/row rather than update a existing object?right
I don't want to create a new one rather selecting one of the existing matching my criteria and update one attribute of that
can somebody help with List<TestModel> testModels = getModelService.get(TestModel.class) will that return me all rows (collection) of Test Type/Table?
unfortunately I can't test it so need help
Actually I am in validateInterceptor ... and on the basis of this intercepted model changed attribute value I have to update another model attribute value...
thanks
ModelService.create(new TestModel.class) will create a single instance of the specified type and attach it to the modelservice's context.
But it will only be saved to the persistence store when you call modelService.save(newInstance)
ModelService.get() returns a model object but expects a Jalo object as input, (Jalo being the legacy persistence layer of hybris) so that won't work for you.
To retrieve objects you can either write your own queries using the FlexibleSearchService or you can have a look at the DefaultGenericDao which has a bunch of simple find() type of methods.
Typically you would inject the dao like e.g.:
private GenericDao<TestModel> dao;
[...]
public void myMethod()
{
List<TestModel> allTestModels = dao.find();
[...]
}
There are a lot more methods with which you can create WHERE type of statements to restrict your result.
Regarding ValidateInterceptor:
Have a look at the wiki page for the lifecycle of interceptors:
https://wiki.hybris.com/display/release5/Interceptors
It's not a good idea to modify 'all' objects of a type while being an interceptor of that type.
So if you're in an interceptor declared for the Test item type, then don't try to modify the items there.
If you happen to be in a different interceptor and want to modify items of a different type:
E.g. you have Type1 which has a list of Type2 objects in it and in the interceptor for Type1 you want to modify all Type2 objects.
For those scenarios you would have to add the instances of Type2 that you modify to the interceptor context so that those changes will be persisted.
That would be something like:
void onValidate(Test1 model, InterceptorContext ctx) throws InterceptorException
{
...
List<Type2> type2s = dao.find();
for (Type2 type2 : type2s)
{
// do something with it
// then make sure to persist that change
ctx.registerElementFor(type2, PersistenceOperation.SAVE);
[...]
}
}
First of all - i think it's not a good idea, to create/update models in any interceptor, especially in 'validation' one.
Regarding your question:
ModelService in most of the cases works with single model, and
designed for create/update/delete operations.
To retreive all models of certain type, you have to use FlexibleSearchService
Then to update each retrieved TestType model, you can use ModelService's save method.
A query to retreive all TestType models will look like:
SELECT PK FROM {TestType}
You could simply use the Flexible Search Service search by example method, and the model service to save them all. Here is an example using Groovy script, with all products :
import java.util.List
import de.hybris.platform.core.model.product.ProductModel
import de.hybris.platform.servicelayer.search.FlexibleSearchService
import de.hybris.platform.servicelayer.model.ModelService
FlexibleSearchService fsq = spring.getBean("flexibleSearchService")
ModelService ms = spring.getBean("modelService")
ProductModel prd = ms.create(ProductModel.class)
List<ProductModel> products = fsq.getModelsByExample(prd)
//Do Whatever you want with the objects in the List
ms.saveAll(products)

How to save and retrive view when it's needed

My goal is to keep session size as small as possible. (Why?.. it's other topic).
What I have is Phase listener declared in faces-config.xml
<lifecycle>
<phase-listener>mypackage.listener.PhaseListener</phase-listener>
</lifecycle>
I want to save all other views, except the last one(maximum two) , in some memcache. Getting the session map:
Map<String, Object> sessionMap = event.getFacesContext().getExternalContext().getSessionMap();
in beforePhase(PhaseEvent event) method is giving me access to all views. So here I could save all views to the memcache and delete them from the session. The question is where in jsf these views that are still loaded in the browser are requested so that I can refill with this view if it's needed. Is it possible at all? Thank you.
To address the core of your question, implement a ViewHandler, within which you can take control of the RESTORE_VIEW and RENDER_RESPONSE phases/processes. You'll save the view during the RENDER_RESPONSE and selectively restore, during the RESTORE_VIEW phase. Your view handler could look something like the following
public class CustomViewHandlerImpl extends ViewHandlerWrapper{
#Inject ViewStore viewStore; //hypothetical storage for the views. Could be anything, like a ConcurrentHashMap
ViewHandler wrapped;
public CustomViewHandlerImpl(ViewHandler toWrap){
this.wrapped = toWrap;
}
public UIViewRoot restoreView(FacesContext context, String viewId) throws IOException{
//this assumes you've previously saved the view, using the viewId
UIViewRoot theView = viewStore.get(viewId);
if(theView == null){
theView = getWrapped().restoreView(context, viewId);
}
return theView;
}
public void renderView(FacesContext context, UIViewRoot viewToRender) throws IOException, FacesException{
viewStore.put(viewToRender.getId(),viewToRender);
getWrapped().renderView(context, viewToRender);
}
}
Simply plug in your custom viewhandler, using
<view-handler>com.you.customs.CustomViewHandlerImpl</view-handler>
Of course, you probably don't want to give this treatment to all your views; you're free to add any conditions to the logic above, to implement conditional view-saving and restoration.
You should also consider other options. It appears that you're conflating issues here. If your true concern is limit the overhead associated with view processing, you should consider
Stateless Views, new with JSF-2.2. The stateless view option allows you to exclude specific pages from the JSF view-saving mechanism, simply by specifying transient="true" on the f:view. Much cleaner than mangling the UIViewRoot by hand. The caveat here is that a stateless view cannot be backed by scopes that depend on state-saving, i.e. #ViewScoped. In a stateless view, the #ViewScoped bean is going to be recreated for every postback. Ajax functionality also suffers in this scenario, because state saving is the backbone of ajax-operations.
Selectively set mark components as transient The transient property is available for all UIComponents, which means, on a per-view basis, you can mark specific components with transient="true", effectively giving you the same benefits as 1) but on a much smaller scope. Without the downside of no ViewScoped
EDIT: For some reason, UIViewRoot#getViewId() is not returning the name of the current view (this might be a bug). Alternatively, you can use
ExternalContext extCtxt = FacesContext.getCurrentInstance().getExternalContext();
String viewName = ((HttpServletRequest)extCtxt.getRequest()).getRequestURI(); //use this id as the key to store your views instead

Entity Framework and WPF best practices

Is it ever a good idea to work directly with the context? For example, say I have a database of customers and a user can search them by name, display a list, choose one, then edit that customer's properties.
It seems I should use the context to get a list of customers (mapped to POCOs or CustomerViewModels) and then immediately close the context. Then, when the user selects one of the CustomerViewModels in the list the customer properties section of the UI populates.
Next they can change the name, type, website address, company size, etc. Upon hitting a save button, I then open a new context, use the ID from the CustomerViewModel to retrieve that customer record, and update each of its properties. Finally, I call SaveChanges() and close the context. This is a LOT OF WORK.
My question is why not just work directly with the context leaving it open throughout? I have read using the same context with a long lifetime scope is very bad and will inevitably cause problems. My assumption is if the application will only be used by ONE person I can leave the context open and do everything. However, if there will be many users, I want to maintain a concise unit of work and thus open and close the context on a per request basis.
Any suggestions? Thanks.
#PGallagher - Thanks for the thorough answer.
#Brice - your input is helpful as well
However, #Manos D. the 'epitome of redundant code' comment concerns me a bit. Let me go through an example. Lets say I'm storing customers in a database and one of my customer properties is CommunicationMethod.
[Flags]
public enum CommunicationMethod
{
None = 0,
Print = 1,
Email = 2,
Fax = 4
}
The UI for my manage customers page in WPF will contain three check boxes under the customer communication method (Print, Email, Fax). I can't bind each checkbox to that enum, it doesn't make sense. Also, what if the user clicked that customer, gets up and goes to lunch... the context sits there for hours which is bad. Instead, this is my thought process.
End user chooses a customer from the list. I new up a context, find that customer and return a CustomerViewModel, then the context is closed (I've left repositories out for simplicity here).
using(MyContext ctx = new MyContext())
{
CurrentCustomerVM = new CustomerViewModel(ctx.Customers.Find(customerId));
}
Now the user can check/uncheck the Print, Email, Fax buttons as they are bound to three bool properties in the CustomerViewModel, which also has a Save() method. Here goes.
public class CustomerViewModel : ViewModelBase
{
Customer _customer;
public CustomerViewModel(Customer customer)
{
_customer = customer;
}
public bool CommunicateViaEmail
{
get { return _customer.CommunicationMethod.HasFlag(CommunicationMethod.Email); }
set
{
if (value == _customer.CommunicationMethod.HasFlag(CommunicationMethod.Email)) return;
if (value)
_customer.CommunicationMethod |= CommunicationMethod.Email;
else
_customer.CommunicationMethod &= ~CommunicationMethod.Email;
}
}
public bool CommunicateViaFax
{
get { return _customer.CommunicationMethod.HasFlag(CommunicationMethod.Fax); }
set
{
if (value == _customer.CommunicationMethod.HasFlag(CommunicationMethod.Fax)) return;
if (value)
_customer.CommunicationMethod |= CommunicationMethod.Fax;
else
_customer.CommunicationMethod &= ~CommunicationMethod.Fax;
}
}
public bool CommunicateViaPrint
{
get { return _customer.CommunicateViaPrint.HasFlag(CommunicationMethod.Print); }
set
{
if (value == _customer.CommunicateViaPrint.HasFlag(CommunicationMethod.Print)) return;
if (value)
_customer.CommunicateViaPrint |= CommunicationMethod.Print;
else
_customer.CommunicateViaPrint &= ~CommunicationMethod.Print;
}
}
public void Save()
{
using (MyContext ctx = new MyContext())
{
var toUpdate = ctx.Customers.Find(_customer.Id);
toUpdate.CommunicateViaEmail = _customer.CommunicateViaEmail;
toUpdate.CommunicateViaFax = _customer.CommunicateViaFax;
toUpdate.CommunicateViaPrint = _customer.CommunicateViaPrint;
ctx.SaveChanges();
}
}
}
Do you see anything wrong with this?
It is OK to use a long-running context; you just need to be aware of the implications.
A context represents a unit of work. Whenever you call SaveChanges, all the pending changes to the entities being tracked will be saved to the database. Because of this, you'll need to scope each context to what makes sense. For example, if you have a tab to manage customers and another to manage products, you might use one context for each so that when a users clicks save on the customer tab, all of the changes they made to products are not also saved.
Having a lot of entities tracked by a context could also slow down DetectChanges. One way to mitigate this is by using change tracking proxies.
Since the time between loading an entity and saving that entity could be quite long, the chance of hitting an optimistic concurrency exception is greater than with short-lived contexts. These exceptions occur when an entity is changed externally between loading and saving it. Handling these exceptions is pretty straightforward, but it's still something to be aware of.
One cool thing you can do with long-lived contexts in WPF is bind to the DbSet.Local property (e.g. context.Customers.Local). this is an ObservableCollection that contains all of the tracked entities that are not marked for deletion.
Hopefully this gives you a bit more information to help you decide which approach to help.
Microsoft Reference:
http://msdn.microsoft.com/en-gb/library/cc853327.aspx
They say;
Limit the scope of the ObjectContext
In most cases, you should create
an ObjectContext instance within a using statement (Using…End Using in
Visual Basic).
This can increase performance by ensuring that the
resources associated with the object context are disposed
automatically when the code exits the statement block.
However, when
controls are bound to objects managed by the object context, the
ObjectContext instance should be maintained as long as the binding is
needed and disposed of manually.
For more information, see Managing Resources in Object Services (Entity Framework). http://msdn.microsoft.com/en-gb/library/bb896325.aspx
Which says;
In a long-running object context, you must ensure that the context is
disposed when it is no longer required.
StackOverflow Reference:
This StackOverflow question also has some useful answers...
Entity Framework Best Practices In Business Logic?
Where a few have suggested that you promote your context to a higher level and reference it from here, thus keeping only one single Context.
My ten pence worth:
Wrapping the Context in a Using Statement, allows the Garbage Collector to clean up the resources, and prevents memory leaks.
Obviously in simple apps, this isn't much of a problem, however, if you have multiple screens, all using alot of data, you could end up in trouble, unless you are certain to Dispose your Context correctly.
Hence I have employed a similar method to the one you have mentioned, where I've added an AddOrUpdate Method to each of my Repositories, where I pass in my New or Modified Entity, and Update or Add it depending upon whether it exists.
Updating Entity Properties:
Regarding updating properties however, I've used a simple function which uses reflection to copy all the properties from one Entity to Another;
Public Shared Function CopyProperties(Of sourceType As {Class, New}, targetType As {Class, New})(ByVal source As sourceType, ByVal target As targetType) As targetType
Dim sourceProperties() As PropertyInfo = source.GetType().GetProperties()
Dim targetProperties() As PropertyInfo = GetType(targetType).GetProperties()
For Each sourceProp As PropertyInfo In sourceProperties
For Each targetProp As PropertyInfo In targetProperties
If sourceProp.Name <> targetProp.Name Then Continue For
' Only try to set property when able to read the source and write the target
'
' *** Note: We are checking for Entity Types by Checking for the PropertyType to Start with either a Collection or a Member of the Context Namespace!
'
If sourceProp.CanRead And _
targetProp.CanWrite Then
' We want to leave System types alone
If sourceProp.PropertyType.FullName.StartsWith("System.Collections") Or (sourceProp.PropertyType.IsClass And _
sourceProp.PropertyType.FullName.StartsWith("System.Collections")) Or sourceProp.PropertyType.FullName.StartsWith("MyContextNameSpace.") Then
'
' Do Not Store
'
Else
Try
targetProp.SetValue(target, sourceProp.GetValue(source, Nothing), Nothing)
Catch ex As Exception
End Try
End If
End If
Exit For
Next
Next
Return target
End Function
Where I do something like;
dbColour = Classes.clsHelpers.CopyProperties(Of Colour, Colour)(RecordToSave, dbColour)
This reduces the amount of code I need to write for each Repository of course!
The context is not permanently connected to the database. It is essentially an in-memory cache of records you have loaded from disk. It will only request records from the database when you request a record it has not previously loaded, if you force it to refresh or when you're saving your changes back to disk.
Opening a context, grabbing a record, closing the context and then copying modified properties to an object from a brand new context is the epitomy of redundant code. You are supposed to leave the original context alone and use that to do SaveChanges().
If you're looking to deal with concurrency issues you should do a google search about "handling concurrency" for your version of entity framework.
As an example I have found this.
Edit in response to comment:
So from what I understand you need a subset of the columns of a record to be overridden with new values while the rest is unaffected? If so, yes, you'll need to manually update these few columns on a "new" object.
I was under the impression that you were talking about a form that reflects all the fields of the customer object and is meant to provide edit access to the entire customer record. In this case there's no point to using a new context and painstakingly copying all properties one by one, because the end result (all data overridden with form values regardless of age) will be the same.

Resources