On efficient checkpoints with dynamic (self-evolving) keyed state - apache-flink

In a KeyedCoProcessFunction, I am managing a keyed state which consists of third-party library models. These models are created on reception of new data on the control stream within processElement1. Because the models are self-evolving, in the sense that have their own internal state, I need to make sure that they are serialized in modelsBytes when their state changes. My first attempt goes like this:
class MyOperator
extends KeyedCoProcessFunction[String, Control, Data, Prediction]
with CheckpointedFunction {
// To hold loaded models
#transient private var models: HashMap[String, Model] = _
// For serialization purposes
#transient private var modelsBytes: MapState[String, Array[Bytes]] = _
override def processElement1(control, ctx, ...) {
if (restoreModels) {
restoreModels()
}
// - Create new model out of `control` element
// - Add it to `models` keyed state
}
override def processElement2(data, ctx, ...) {
if (restoreModels) {
restoreModels()
}
// - Send `data` element to the corresponding models
// This will update their internal states
}
override def snapshotState(context: FunctionSnapshotContext): Unit = {
// Suspicious, wishful-thinking code that compiles and runs just fine
for ((k, model) <- models) {
modelsBytes.put(k, model.toBytes(v))
}
}
override def initializeState(context: FunctionInitializationContext): Unit = {
modelsBytes = context.getKeyedStateStore.getMapState[String](
new MapStateDescriptor("modelsBytes", classOf[String])
)
if (context.isRestored) restoreModels = true
}
}
So, the idea is to use snapshotState to override the keyed state entries in modelsBytes. The reason why I am trying this approach is because serializing the models (model.toBytes) might be an expensive operation. Therefore, I would prefer to do it once per model when a checkpoint comes. The problem with this approach is that it might be inherently/conceptually wrong. Here is why, even if the code within snapshotState compiles and runs just fine, note that I am referring to a keyed state piece without getting a keyed context passed in, so it is not clear at all what key I am really working on to start with. I have written a small test to verify the checkpoints, and I have observed that from time to time I get an empty state back, even if the modelsBytes state entries were updated in snapshotState. So it seems that snapshotting my models like this is not reliable at all. What confuses me is that the user is perfectly allowed to do this, maybe the put method should raise an exception to make it clear that a keyed state is required in the first place, otherwise it gives false hope and might lead to hard-to-spot bugs. As a matter of fact, shouldn't this be considered a bug?
The other option I have is, of course, to serialize my models in processElement2, after sending new data elements to them. However, continuously serialzing my models to update modelsBytes might be costly.
What would be the most efficient way to handle this scenario?

Related

State handling on KeyedCoProcessFunction serving ML models

I am working on a KeyedCoProcessFunction that looks like this:
class MyOperator extends KeyedCoProcessFunction[String, ModelDef, Data, Prediction]
with CheckpointedFunction {
// To hold loaded models
#transient private var models: HashMap[(String, String), Model] = _
// For serialization purposes
#transient private var modelsBytes: MapState[(String, String), Array[Bytes]] = _
...
override def snapshotState(context: FunctionSnapshotContext): Unit = {
modelsBytes.clear() // This raises an exception when there is no active key set
for ((k, model) <- models) {
modelsBytes.put(k, model.toBytes(v))
}
}
override def initializeState(context: FunctionInitializationContext): Unit = {
modelsBytes = context.getKeyedStateStore.getMapState[String, String](
new MapStateDescriptor("modelsBytes", classOf[String], classOf[String])
)
if (context.isRestored) {
// restore models from modelsBytes
}
}
}
The state consists of a collection of ML models built using a third party library. Before checkpoints, I need to dump the loaded models into byte arrays in snapshotState.
My question is, within snapshotState, modelsBytes.clear() raises an exception when there is no active key. This happens when I start the application from scratch without any data on the input streams. So, when the time for a checkpoint comes, I get this error:
java.lang.NullPointerException: No key set. This method should not be called outside of a keyed context.
However, when the input stream contains data, checkpoints work just fine. I am a bit confused about this because snapshotState does not provide a keyed context (contrary to processElement1 and processElement2, where the current key is accessible by doing ctx.getCurrentKey) so it seems to me that the calls to clear and put within snapshotState should fail always since they're supposed to work only within a keyed context. Can anyone clarify if this is the expected behaviour actually?
A keyed state can only be used on a keyed stream as written in the documentation.
* <p>The state is only accessible by functions applied on a {#code KeyedStream}. The key is
* automatically supplied by the system, so the function always sees the value mapped to the
* key of the current element. That way, the system can handle stream and state partitioning
* consistently together.
If you call clear(), you will not clear the whole map, but just reset the state of the current key. The key is always known in processElementX.
/**
* Removes the value mapped under the current key.
*/
void clear();
You should actually receive a better exception when you try to call clear in a function other than processElementX. In the end, you are using the keyed state incorrectly.
Now for your actual problem. I'm assuming you are using a KeyedCoProcessFunction because the models are updated in a separate input. If they are static, you could just load them open from a static source (for example, included in the jar). Furthermore, often there is only one model that is applied for all values with different keys, then you could use BroadCast state. So I'm assuming you have different models for different types of data separated by keys.
If they are coming in from input2, then you already serialize them upon invocation of processElement2.
override def processElement2(model: Model, ctx: Context, collector): Unit = {
models.put(ctx.getCurrentKey, model)
modelsBytes.put(ctx.getCurrentKey, model.toBytes(v))
}
Then you would not override snapshotState, as the state is already up-to-date. initializeState would deserialize models eagerly or you could also materialize them lazily in processElement1.

How to save and retrive view when it's needed

My goal is to keep session size as small as possible. (Why?.. it's other topic).
What I have is Phase listener declared in faces-config.xml
<lifecycle>
<phase-listener>mypackage.listener.PhaseListener</phase-listener>
</lifecycle>
I want to save all other views, except the last one(maximum two) , in some memcache. Getting the session map:
Map<String, Object> sessionMap = event.getFacesContext().getExternalContext().getSessionMap();
in beforePhase(PhaseEvent event) method is giving me access to all views. So here I could save all views to the memcache and delete them from the session. The question is where in jsf these views that are still loaded in the browser are requested so that I can refill with this view if it's needed. Is it possible at all? Thank you.
To address the core of your question, implement a ViewHandler, within which you can take control of the RESTORE_VIEW and RENDER_RESPONSE phases/processes. You'll save the view during the RENDER_RESPONSE and selectively restore, during the RESTORE_VIEW phase. Your view handler could look something like the following
public class CustomViewHandlerImpl extends ViewHandlerWrapper{
#Inject ViewStore viewStore; //hypothetical storage for the views. Could be anything, like a ConcurrentHashMap
ViewHandler wrapped;
public CustomViewHandlerImpl(ViewHandler toWrap){
this.wrapped = toWrap;
}
public UIViewRoot restoreView(FacesContext context, String viewId) throws IOException{
//this assumes you've previously saved the view, using the viewId
UIViewRoot theView = viewStore.get(viewId);
if(theView == null){
theView = getWrapped().restoreView(context, viewId);
}
return theView;
}
public void renderView(FacesContext context, UIViewRoot viewToRender) throws IOException, FacesException{
viewStore.put(viewToRender.getId(),viewToRender);
getWrapped().renderView(context, viewToRender);
}
}
Simply plug in your custom viewhandler, using
<view-handler>com.you.customs.CustomViewHandlerImpl</view-handler>
Of course, you probably don't want to give this treatment to all your views; you're free to add any conditions to the logic above, to implement conditional view-saving and restoration.
You should also consider other options. It appears that you're conflating issues here. If your true concern is limit the overhead associated with view processing, you should consider
Stateless Views, new with JSF-2.2. The stateless view option allows you to exclude specific pages from the JSF view-saving mechanism, simply by specifying transient="true" on the f:view. Much cleaner than mangling the UIViewRoot by hand. The caveat here is that a stateless view cannot be backed by scopes that depend on state-saving, i.e. #ViewScoped. In a stateless view, the #ViewScoped bean is going to be recreated for every postback. Ajax functionality also suffers in this scenario, because state saving is the backbone of ajax-operations.
Selectively set mark components as transient The transient property is available for all UIComponents, which means, on a per-view basis, you can mark specific components with transient="true", effectively giving you the same benefits as 1) but on a much smaller scope. Without the downside of no ViewScoped
EDIT: For some reason, UIViewRoot#getViewId() is not returning the name of the current view (this might be a bug). Alternatively, you can use
ExternalContext extCtxt = FacesContext.getCurrentInstance().getExternalContext();
String viewName = ((HttpServletRequest)extCtxt.getRequest()).getRequestURI(); //use this id as the key to store your views instead

Calling DAL from ViewModel asynchronously

I am building composite WPF application using MVVM-light. I have Views that have ViewModels injected into them using MEF:
DataContext = App.Container.GetExportedValue<ViewModelBase>(
ViewModelTypes.ContactsPickerViewModel);
In addition, I have ViewModels for each View (Screens and UserControls), where constructor usually looks like this:
private readonly ICouchDataModel _couchModel;
[ImportingConstructor]
public ContactsPickerControlViewModel(ICouchDataModel couchModel)
{
_couchModel = couchModel;
_couchModel.GetContactsListCompleted+=GetContactsListCompleted;
_couchModel.GetConcatcsListAsync("Andy");
}
Currently, I have some performance issues. Everything is just slow.
I have 2 kind of related questions
What is the right way of calling DAL methods asynchronously (that access my couchdb)? await/async? Tasks? Because currently I have to write a lot of wrappers(OnBegin, OnCompletion) around each operation, I have GetAsyncResult method that does some crazy things with ThreadPool.QueueUserWorkItem , Action etc.
I hope there is the more elegant way of calling
Currently, I have some screens in my application and on each screen, there are different custom UserControls, some of them need same data (or slightly changed) from DB.
Questions: what is the right way to share datasource among them? I am mostly viewing data, not editing.
Example: On Screen A: I have Contacts dropdown list user control (UC1), and contact details user control(UC2). In each user control, their ViewModel is calling DAL:
_couchModel.GetConcatcsListAsync("Andy");
And on completion I assign result data to a property:
List<ContactInfo> ContactsList = e.Resuls;
ContactsList is binded to ItemsSource of DropDownListBox in UC1. The same story happens in UC2. So I end up with 2 exactly same calls to DB.
Also If I go to Screen B, where I have UC1, I’ll make another call to DB, when I’ll go to Screen B from Screen A.
What is the right way to making these interaction ? e.g. Getting Data and Binding it to UC.
Thank you.
Ad.1
I think you can simply use Task.Factory to invoke code asynchronously (because of that you can get rid off OnBegin, OnCompletion) or if you need more flexibility, than you can make methods async.
Ad. 2
The nice way (in my opinion) to do it is to create DatabaseService (singleton), which would be injected in a constructor. Inside DatabaseService you can implement some logic to determine whether you want to refresh a collection(call DAL) or return the same (it would be some kind of cache).
Then you can call DatabaseService instead of DAL directly and DatabaseService will decide what to do with this call (get collection from DB or return the same or slightly modified current collection).
Edit:
DatabaseService will simply share a collection of objects between ViewModels.
Maybe the name "DBCacheService" would be more appropriate (you will probably use it only for special tasks as caching collections).
I don't know your architecture, but basically you can put that service in your client application, so the plan would be:
Create DatabaseService.cs
[Export(typeof(IDatabaseService))]
public class DatabaseService : IDatabaseService
{
private List<object> _contacts = new List<object>();
public async Task<List<object>> GetConcatcsList(string name)
{
if (_contacts.Count == 0)
{
//call DAL to get it
//_contacts = await Task.Factory.StartNew(() => dal.GetContactsList(name));
}
else
{
//refresh list if required (there could be some condition)
}
return _contacts;
}
}
Add IDatabaseService to your ViewModel's constructor.
Call IDatabaseService instead of DAL.
If you choose async version of DatabaseService, then you'll need to use await and change your methods to async. You can do it also synchronously and call it (whenever you want it to be asynchronous) like that:
Task.Factory.StartNew(() =>
{
var result = dbService.GetContactsList("Andy");
});
Edit2:
invoking awaitable method inside Task:
Task.Factory.StartNew(async () =>
{
ListOfContacts = await _CouchModel.GetConatcsList ("Andy");
});

Entity Framework and WPF best practices

Is it ever a good idea to work directly with the context? For example, say I have a database of customers and a user can search them by name, display a list, choose one, then edit that customer's properties.
It seems I should use the context to get a list of customers (mapped to POCOs or CustomerViewModels) and then immediately close the context. Then, when the user selects one of the CustomerViewModels in the list the customer properties section of the UI populates.
Next they can change the name, type, website address, company size, etc. Upon hitting a save button, I then open a new context, use the ID from the CustomerViewModel to retrieve that customer record, and update each of its properties. Finally, I call SaveChanges() and close the context. This is a LOT OF WORK.
My question is why not just work directly with the context leaving it open throughout? I have read using the same context with a long lifetime scope is very bad and will inevitably cause problems. My assumption is if the application will only be used by ONE person I can leave the context open and do everything. However, if there will be many users, I want to maintain a concise unit of work and thus open and close the context on a per request basis.
Any suggestions? Thanks.
#PGallagher - Thanks for the thorough answer.
#Brice - your input is helpful as well
However, #Manos D. the 'epitome of redundant code' comment concerns me a bit. Let me go through an example. Lets say I'm storing customers in a database and one of my customer properties is CommunicationMethod.
[Flags]
public enum CommunicationMethod
{
None = 0,
Print = 1,
Email = 2,
Fax = 4
}
The UI for my manage customers page in WPF will contain three check boxes under the customer communication method (Print, Email, Fax). I can't bind each checkbox to that enum, it doesn't make sense. Also, what if the user clicked that customer, gets up and goes to lunch... the context sits there for hours which is bad. Instead, this is my thought process.
End user chooses a customer from the list. I new up a context, find that customer and return a CustomerViewModel, then the context is closed (I've left repositories out for simplicity here).
using(MyContext ctx = new MyContext())
{
CurrentCustomerVM = new CustomerViewModel(ctx.Customers.Find(customerId));
}
Now the user can check/uncheck the Print, Email, Fax buttons as they are bound to three bool properties in the CustomerViewModel, which also has a Save() method. Here goes.
public class CustomerViewModel : ViewModelBase
{
Customer _customer;
public CustomerViewModel(Customer customer)
{
_customer = customer;
}
public bool CommunicateViaEmail
{
get { return _customer.CommunicationMethod.HasFlag(CommunicationMethod.Email); }
set
{
if (value == _customer.CommunicationMethod.HasFlag(CommunicationMethod.Email)) return;
if (value)
_customer.CommunicationMethod |= CommunicationMethod.Email;
else
_customer.CommunicationMethod &= ~CommunicationMethod.Email;
}
}
public bool CommunicateViaFax
{
get { return _customer.CommunicationMethod.HasFlag(CommunicationMethod.Fax); }
set
{
if (value == _customer.CommunicationMethod.HasFlag(CommunicationMethod.Fax)) return;
if (value)
_customer.CommunicationMethod |= CommunicationMethod.Fax;
else
_customer.CommunicationMethod &= ~CommunicationMethod.Fax;
}
}
public bool CommunicateViaPrint
{
get { return _customer.CommunicateViaPrint.HasFlag(CommunicationMethod.Print); }
set
{
if (value == _customer.CommunicateViaPrint.HasFlag(CommunicationMethod.Print)) return;
if (value)
_customer.CommunicateViaPrint |= CommunicationMethod.Print;
else
_customer.CommunicateViaPrint &= ~CommunicationMethod.Print;
}
}
public void Save()
{
using (MyContext ctx = new MyContext())
{
var toUpdate = ctx.Customers.Find(_customer.Id);
toUpdate.CommunicateViaEmail = _customer.CommunicateViaEmail;
toUpdate.CommunicateViaFax = _customer.CommunicateViaFax;
toUpdate.CommunicateViaPrint = _customer.CommunicateViaPrint;
ctx.SaveChanges();
}
}
}
Do you see anything wrong with this?
It is OK to use a long-running context; you just need to be aware of the implications.
A context represents a unit of work. Whenever you call SaveChanges, all the pending changes to the entities being tracked will be saved to the database. Because of this, you'll need to scope each context to what makes sense. For example, if you have a tab to manage customers and another to manage products, you might use one context for each so that when a users clicks save on the customer tab, all of the changes they made to products are not also saved.
Having a lot of entities tracked by a context could also slow down DetectChanges. One way to mitigate this is by using change tracking proxies.
Since the time between loading an entity and saving that entity could be quite long, the chance of hitting an optimistic concurrency exception is greater than with short-lived contexts. These exceptions occur when an entity is changed externally between loading and saving it. Handling these exceptions is pretty straightforward, but it's still something to be aware of.
One cool thing you can do with long-lived contexts in WPF is bind to the DbSet.Local property (e.g. context.Customers.Local). this is an ObservableCollection that contains all of the tracked entities that are not marked for deletion.
Hopefully this gives you a bit more information to help you decide which approach to help.
Microsoft Reference:
http://msdn.microsoft.com/en-gb/library/cc853327.aspx
They say;
Limit the scope of the ObjectContext
In most cases, you should create
an ObjectContext instance within a using statement (Using…End Using in
Visual Basic).
This can increase performance by ensuring that the
resources associated with the object context are disposed
automatically when the code exits the statement block.
However, when
controls are bound to objects managed by the object context, the
ObjectContext instance should be maintained as long as the binding is
needed and disposed of manually.
For more information, see Managing Resources in Object Services (Entity Framework). http://msdn.microsoft.com/en-gb/library/bb896325.aspx
Which says;
In a long-running object context, you must ensure that the context is
disposed when it is no longer required.
StackOverflow Reference:
This StackOverflow question also has some useful answers...
Entity Framework Best Practices In Business Logic?
Where a few have suggested that you promote your context to a higher level and reference it from here, thus keeping only one single Context.
My ten pence worth:
Wrapping the Context in a Using Statement, allows the Garbage Collector to clean up the resources, and prevents memory leaks.
Obviously in simple apps, this isn't much of a problem, however, if you have multiple screens, all using alot of data, you could end up in trouble, unless you are certain to Dispose your Context correctly.
Hence I have employed a similar method to the one you have mentioned, where I've added an AddOrUpdate Method to each of my Repositories, where I pass in my New or Modified Entity, and Update or Add it depending upon whether it exists.
Updating Entity Properties:
Regarding updating properties however, I've used a simple function which uses reflection to copy all the properties from one Entity to Another;
Public Shared Function CopyProperties(Of sourceType As {Class, New}, targetType As {Class, New})(ByVal source As sourceType, ByVal target As targetType) As targetType
Dim sourceProperties() As PropertyInfo = source.GetType().GetProperties()
Dim targetProperties() As PropertyInfo = GetType(targetType).GetProperties()
For Each sourceProp As PropertyInfo In sourceProperties
For Each targetProp As PropertyInfo In targetProperties
If sourceProp.Name <> targetProp.Name Then Continue For
' Only try to set property when able to read the source and write the target
'
' *** Note: We are checking for Entity Types by Checking for the PropertyType to Start with either a Collection or a Member of the Context Namespace!
'
If sourceProp.CanRead And _
targetProp.CanWrite Then
' We want to leave System types alone
If sourceProp.PropertyType.FullName.StartsWith("System.Collections") Or (sourceProp.PropertyType.IsClass And _
sourceProp.PropertyType.FullName.StartsWith("System.Collections")) Or sourceProp.PropertyType.FullName.StartsWith("MyContextNameSpace.") Then
'
' Do Not Store
'
Else
Try
targetProp.SetValue(target, sourceProp.GetValue(source, Nothing), Nothing)
Catch ex As Exception
End Try
End If
End If
Exit For
Next
Next
Return target
End Function
Where I do something like;
dbColour = Classes.clsHelpers.CopyProperties(Of Colour, Colour)(RecordToSave, dbColour)
This reduces the amount of code I need to write for each Repository of course!
The context is not permanently connected to the database. It is essentially an in-memory cache of records you have loaded from disk. It will only request records from the database when you request a record it has not previously loaded, if you force it to refresh or when you're saving your changes back to disk.
Opening a context, grabbing a record, closing the context and then copying modified properties to an object from a brand new context is the epitomy of redundant code. You are supposed to leave the original context alone and use that to do SaveChanges().
If you're looking to deal with concurrency issues you should do a google search about "handling concurrency" for your version of entity framework.
As an example I have found this.
Edit in response to comment:
So from what I understand you need a subset of the columns of a record to be overridden with new values while the rest is unaffected? If so, yes, you'll need to manually update these few columns on a "new" object.
I was under the impression that you were talking about a form that reflects all the fields of the customer object and is meant to provide edit access to the entire customer record. In this case there's no point to using a new context and painstakingly copying all properties one by one, because the end result (all data overridden with form values regardless of age) will be the same.

How to figure out which SQLDependency triggered change function?

I'm exploring query notifications with the SQLDependency class. Building a simple working example is easy, but I feel like I'm missing something. Once I step past a simple one-table/one-dependency example I'm left wondering how can I figure out which dependency triggered my callback?
I'm having a bit of trouble explaining, so I included the simple example below. When AChange() is called I cannot look at the sql inside the dependency, and i don't have a reference to the associated cache object.
So what's a boy to do?
Option 1 - create a distinct function for each object i want to track and hard code the cache-key (or relevant information) in the callback. This feels dirty & eliminates the posibility of adding new cache items without deploying new code--ewww.
Option 2 - Use the Dependency Id property and a parallel tracking structure
Am I just missing something? Is this a deficiency in the SQLDependency structure? I've I've looked at 20 different articles on the topic and all of them seem to have the same hole. Suggestions?
Code Sample
public class DependencyCache{
public static string cacheName = "Client1";
public static MemoryCache memCache = new MemoryCache(cacheName);
public DependencyCache() {
SqlDependency.Start(connString);
}
private static string GetSQL() {
return "select someString FROM dbo.TestTable";
}
public void DoTest() {
if (memCache["TEST_KEY"] != null ) {
Debug.WriteLine("resources found in cache");
return;
}
Cache_GetData();
}
private void Cache_GetData() {
SqlConnection oConn;
SqlCommand oCmd;
SqlDependency oDep;
SqlDataReader oRS;
List<string> stuff = new List<string>();
CacheItemPolicy policy = new CacheItemPolicy();
SqlDependency.Start(connString);
using (oConn = new SqlConnection(connString) ) {
using (oCmd = new SqlCommand(GetSQL(), oConn) ) {
oDep = new SqlDependency(oCmd);
oConn.Open();
oRS = oCmd.ExecuteReader();
while(oRS.Read() ) {
resources.Add( oRS.GetString(0) );
}
oDep.OnChange += new OnChangeEventHandler (AChange);
}
}
memCache.Set("TEST_KEY", stuff, policy);
}
private void AChange( object sender, SqlNotificationEventArgs e) {
string msg= "Dependency Change \nINFO: {0} : SOURCE {1} :TYPE: {2}";
Debug.WriteLine(String.Format(msg, e.Info, e.Source, e.Type));
// If multiple queries use this as a callback how can i figure
// out WHAT QUERY TRIGGERED the change?
// I can't figure out how to tell multiple dependency objects apart
((SqlDependency)sender).OnChange -= Cache_SqlDependency_OnChange;
Cache_GetData(); //reload data
}
}
First and foremost: the handler has to be set up before the command is executed:
oDep = new SqlDependency(oCmd);
oConn.Open();
oDep.OnChange += new OnChangeEventHandler (AChange);
oRS = oCmd.ExecuteReader();
while(oRS.Read() ) {
resources.Add( oRS.GetString(0) );
}
Otherwise you have a window when the notification may be lost and your callback never invoked.
Now about your question: you should use a separate callback for each query. While this may seem cumbersome, is actually trivial by using a lambda. Something like the following:
oDep = new SqlDependency(oCmd);
oConn.Open();
oDep.OnChange += (sender, e) =>
{
string msg = "Dependency Change \nINFO: {0} : SOURCE {1} :TYPE: {2}";
Debug.WriteLine(String.Format(msg, e.Info, e.Source, e.Type));
// The command that trigger the notification is captured in the context:
// is oCmd
//
// You can now call a handler passing in the relevant info:
//
Reload_Data(oCmd, ...);
};
oRS = oCmd.ExecuteReader();
...
And remember to always check the notification source, info and type. Otherwise you run the risk of spinning ad-nauseam when you are notified for reasons other than data change, like invalid query. As a side comment I would add that a good cache design does not refresh the cache on invalidation, but simply invalidates the cached item and lets the next request actually fetch a fresh item. With your 'proactive' approach you are refreshing cached items even when not needed, refresh multiple times before they are accessed etc etc. I left out from the example error handling and proper thread synchronization (both required).
Finally, have a look at LinqtoCache which does pretty much what you're trying to do, but for LINQ queries.

Resources