Flink reference data advice/best practice - apache-flink

Looking for some advice on where to store/access Flink reference data. Use case here is really simple - I have a single column text file with a list of countries. I am streaming twitter data and then matching the countries from the text file based on the (parsed) Location field of the tweet. In the IDE (Eclipse) its all good as I have a static ArrayList populated when the routine fires up via a static Build method in my Flink Mapper (ie implements Flinks MapFunction). This class is now inner static as it gets shirt on serialization otherwise. Point is, when the overridden map function is invoked at runtime from within the stream, the static array of country data is their waiting, fully populated and ready to be matched against. Works a charm. BUT, when deployed into a Flink cluster ( and it took me to hell and back last week to actually get the code to FIND the text file), the array is only populated as part of the Build method. When it comes to being used the data has mysteriously disappeared and I am left with an array size of 0. (ergo, not a lot of matches get found. Thus, 2 questions - why does it work in Eclipse and not on deploy (renders a lot of Eclipse unit tests pointless as well). Or possibly just more generally, what is the right way to cross reference this kind of static, fixed reference data within Flink? (and in a way that it is found in both Eclipse and the cluster...)

The standard way to handle static reference data is to load the data in the open method of a RichMapFunction or RichFlatMapFunction. Rich functions have open and close methods that are useful for creating and finalizing local state, and can access the runtime context.

Related

Kafka-Flink-Stream processing: Is there a way to reload input files into the variables being used in a streaming process?

We are planning to use Flink to process a stream of data from a kafka topic (Logs in Json format).
But for that processing, we need to use input files which change every day, and the information within can change completely (not the format, but the contents).
Each time one of those input files changes we will have to reload those files into the program and keep the stream processing going on.
Re-loading of the data could be done same way as it is done now:
DataSet<String> globalData = env.readTextFile("file:///path/to/file");
But so far I couldnt find examples or come up with a way to trigger that reload in a stream processing job.
As extra information, we wont be using HDFS but local filesystem on each node, so the reload will have to be done in each node, from the local file.
This is because the only reason why we would need HDFS would be for this input files, which are just 100 mb in total and using HDFS would be an overkill.
So far I have been experimenting with RichMapFunction, trying to find a kafka-topic that would provide this functionality (reload files) and trying to find examples of this with no luck.
Edit:
After reading a lot more, I found in several places that this is the way to go: DataArtisans examples.
Trying to make a simple code that would do a simple change in a stream from a control stream, I got the following code:
public class RichCoFlatMapExample extends EventTimeJoinHelper {
private String config_source_path = "NOT_INITIALIZED";
#Override
public void open(Configuration conf) {
config_source_path = "first_file_path";
}
public abstract void processElement1(String one, String two, Collector<String> out) {
config_source_path = one;
}
public abstract void processElement2(String one, String two, Collector<String> out) {
String three = two + config_source_path;
out.collect(three);
}
}
The problem Im having now is, no matter what I try, I get the following error:
Class 'RichCoFlatMapExample' must either be declared abstract or implement abstract method 'processElement1(String, String, Collector)' in 'RichCoFlatMapExample'
The problem is, the requested method is implemented, but I cant make them "abstract" in a non abstract class (I get an error from the IDE).
If I make the class RichCoFlatMapExample, I wont be able to call it from Flink methods (dataStream methods).
Im not sure what is happening but I think this must be close. I will keep trying and update if I make this work.
Flink can monitor a directory and ingest files when they are moved into that directory; maybe that's what you are looking for. See the PROCESS_CONTINUOUSLY option for readfile in the documentation.
However, if the data is in Kafka, it would be much more natural to use Flink's Kafka consumer to stream the data directly into Flink. There is also documentation about using the Kafka connector. And the Flink training includes an exercise on using Kafka with Flink.

Short lived DbContext in WPF application reasonable?

In his book on DbContext, #RowanMiller shows how to use the DbSet.Local property to avoid 1.) unnecessary roundtrips to the database and 2.) passing around collections (created with e.g. ToList()) in the application (page 24). I then tried to follow this approach. However, I noticed that from one using [} – block to another, the DbSet.Local property becomes empty:
ObservableCollection<Destination> destinationsList;
using (var context = new BAContext())
{
var query = from d in context.Destinations …;
query.Load();
destinationsList = context.Destinations.Local; //Nonzero here.
}
//Do stuff with destinationsList
using (var context = new BAContext())
{
//context.Destinations.Local zero here again;
//So no way of getting the in-memory data from the previous using- block here?
//Do I have to do another roundtrip to the database here to get the same data I wanted
//to cache locally???
}
Then, what is the point on page 24? How can I avoid the passing around of my collections if the DbSet.Local is only usable inside the using- block? Furthermore, how can I benefit from the change tracking if I use these short-lived context instances not handing over any cache data to each others under the hood? So, if the contexts should be short-lived for freeing resources such as connections, have I to give up the caching for this? I.e. I can’t use both at the same time (short-lived connections but long-lived cache)? So my only option would be to store the results returned by the query in my own variables, exactly what is discouraged in the motivation on page 24?
I am developing a WPF application which maybe will also become multi-tiered in the future, involving WCF. I know Julia has an example of this in her book, but I currently don’t have access to it. I found several others on the web, e.g. http://msdn.microsoft.com/en-us/magazine/cc700340.aspx (old ObjectContext, but good in explaining the inter-layer-collaborations). There, a long-lived context is used (although the disadvantages are mentioned, but no solution to these provided).
It’s not only that the single Destinations.Local gets lost, as you surely know all other entities fetched by the query are, too.
[Edit]:
After some more reading in Julia Lerman’s book, it seems to boil down to that EF does not have 2nd level caching per default; with some (considerable, I think) effort, however, ones can add 3rd party caching solutions, as is also described in the book and in various articles on MSDN, codeproject etc.
I would have appreciated if this problem had been mentioned in the section about DbSet.Local in the DbContext book that it is in fact a first level cache which is destroyed when the using {} block ends (just my proposal to make it more transparent to the readers). After first reading I had the impression DbSet.Local would always return the same reference (Singleton-style) also in the second using {} block despite the new DbContext instance.
But I am still unsure whether the 2nd level cache is the way to go for my WPF application (as Julia mentions the 2nd level cache in her article for distributed applications)? Or is the way to go to get my aggregate root instances (DDD, Eric Evans) of my domain model into memory by one or some queries in a using {} block, disposing the DbContext and only holding the references to the aggregate instances, this way avoiding a long-lived context? It would be great if you could help me with this decision.
http://msdn.microsoft.com/en-us/magazine/hh394143.aspx
http://www.codeproject.com/Articles/435142/Entity-Framework-Second-Level-Caching-with-DbConte
http://blog.3d-logic.com/2012/03/31/using-tracing-and-caching-provider-wrappers-with-codefirst/
The Local property provides a “local view of all Added, Unchanged, and Modified entities in this set”. Like all change tracking it is specific to the context you are currently using.
The DB Context is a workspace for loading data and preparing changes.
If two users were to add changes at the same time, they must not know of the others changes before they saved them. They may discard their prepared changes which suddenly would lead to problems for other other user as well.
A DB Context should be short lived indeed, but may be longer than super short when necessary. Also consider that you may not save resources by keeping it short lived if you do not load and discard data but only add changes you will save. But it is not only about resources but also about the DB state potentially changing while the DB Context is still active and has data loaded; which may be important to keep in mind for longer living contexts.
If you do not know yet all related changes you want to save into the database at once then I suggest you do not use the DB Context to store your changes in-memory but in a data structure in your code.
You can of course use entity objects for doing so without an active DB Context. This makes sense if you do not have another appropriate data class for it and do not want to create one, or decide preparing the changes in them make more sense. You can then use DbSet.Attach to attach the entities to a DB Context for saving the changes when you are ready.

How to get all the AutomationIDs of a WPF application in a file?

In automation of a WPF application (using UI Automation; VSTS 2010), we were adding all the Automation IDs in a Resource File manually and then access it one by one. Considering the application can expand any time, manually adding these IDs can become tedious.
So, is there any tool available which can create this for us? i.e. Get all the ids in a hierarchical format and store it in a file (xml or csv), and then we could parse it whenever required.
I was hoping for a tool like UISpy, which not only can spy all the elements but also export the same.
Do such tools exist? Or is there any alternate approach?
Any valuable feedback is highly appreciated.
Thanks!
I do like this:
public static class AutomationIds
{
public static readonly string MyDataGridId= Create();
private static string Create([CallerMemberName] string name = null)
{
return name;
}
}
<DataGrid AutomationProperties.AutomationId="{x:Static local:AutomationIds.MyDataGridId}"
... />
Then in tests
var dataGrid = window.Get<ListView>(AutomationIds.MyDataGridId);
Assign the automation IDs directly in XAML, then parse XAML files since they are XML after all...
Let's see...
First, I think that your data is not hierarchical - just because a control can be dynamically assigned to be a child of another.
If we change the problem to a subset: "how can we get a hierarchical view of the controls at a time t?" then we can answer this with MS UIA, and say, using a simple RawViewWalker (just a simple breadth-first search on the walker, starting from your main window will do - of course while the application is running so that UIA can reach and query it).
But this subset will not satisfy your initial question, because you'll probably see a portion of your whole ui collection (since some will be hidden / not activated yet at time t).
So it becomes very hard to use a UIA based tool (such as uispy) because then you'll have to set the application view to different states to reach all the controls in your application at different times t1, t2...
I would suggest parsing all your xmls at the same time and build a complete tree of the application's "static" control map, which I believe would be closest to what you're asking for.
Given that this is an old question, I doubt it matters anymore, but just wanted to make the distinctions here.

NSTreeController how to save to file

Hi I am using an NSTreeController to control an NSOutlineView. This application loads bookmarks from file to application. As in the SourceView example in ADC:
http://developer.apple.com/mac/library/samplecode/SourceView/index.html
My questions is how do I save the bookmark to file once user makes the changes. Should I maintain the array/tree internally in my application and save before quitting or is there any easier methods?
You want to reverse the action taking place in the populateOutline method of MyWindowController.m. This method is reading the plist into one dictionary, reading a value from that dictionary, and using it to build the tree. Start with that method and follow the code to see how it is building the tree. It's using the BaseNode and ChildNode classes to build up the data model as a tree (I'm not sure why they didn't just use NSTreeNode). You want to reverse that procedure, ending up with an NSDictionary. You can then use writeToFile:atomically: to save the dictionary back to disk.
This can get as complex as you'd like to make it. For instance, the current code loads the dictionary file in a separate thread, so you could save in a separate thread, too. Or, you might want to save after every edit, again in a separate thread.

SSRS Code Shared Variables and Simultaneous Report Execution

We have some SSRS reports that are failing when two of them are executed very close together.
I've found out that if two instances of an SSRS report run at the same time, any Code variables declared at the class level (not inside a function) can collide. I suspect this may be the cause of our report failures and I'm working up a potential fix.
The reason we're using the Code portion of SSRS at all is for things like custom group and page header calculation. The code is called from expressions in TextBoxes and returns what the current label should be. The code needs to maintain state to remember what the last header value was in order return it when unknown or to store the new header value for reuse.
Note: here are my resources for the variable collision problem:
The MSDN SSRS Forum:
Because this uses static variables, if two people run the report at the exact same
moment, there's a slim chance one will smash the other's variable state (In SQL 2000,
this could occasionally happen due to two users paginating through the same report at
the same time, not just due to exactly simultaneous executions). If you need to be 100%
certain to avoid this, you can make each of the shared variables a hash table based on
user ID (Globals!UserID).
Embedded Code in Reporting Services:
... if multiple users are executing the report with this code at the same time, both
reports will be changing the same Count field (that is why it is a shared field). You
don’t want to debug these sorts of interactions – stick to shared functions using only
local variables (variables passed ByVal or declared in the function body).
I guess the idea is that on the report generation server, the report is loaded and the Code module is a static class. If a second clients ask for the same report as another quickly enough, it connects to the same instance of that static class. (You're welcome to correct my description if I'm getting this wrong.)
So, I was proceeding with the idea of using a hash table to keep things isolated. I was planning on the hash key being an internal report parameter called InstanceID with default =Guid.NewGuid().ToString().
Part way through my research into this, though, I found that it is even more complicated because Hashtables aren't thread-safe, according to Maintaining State in Reporting Services.
That writer has code similar to what I was developing, only the whole thread-safe thing is completely outside my experience. It's going to take me hours to research all this and put together sensible code that I can be confident of and that performs well.
So before I go too much farther, I'm wondering if anyone else has already been down this path and could give me some advice. Here's the code I have so far:
Private Shared Data As New System.Collections.Hashtable()
Public Shared Function Initialize() As String
If Not Data.ContainsKey(Parameters!InstanceID.Value) Then
Data.Add(Parameters!InstanceID.Value, New System.Collections.Hashtable())
End If
LetValue("SomethingCount", 0)
Return ""
End Function
Private Shared Function GetValue(ByVal Name As String) As Object
Return Data.Item(Parameters!InstanceID.Value).Item(Name)
End Function
Private Shared Sub LetValue(ByVal Name As String, ByVal Value As Object)
Dim V As System.Collections.Hashtable = Data.Item(Parameters!InstanceID.Value)
If Not V.ContainsKey(Name) Then
V.Add(Name, Value)
Else
V.Item(Name) = Value
End If
End Sub
Public Shared Function SomethingCount() As Long
SomethingCount = GetValue("SomethingCount") + 1
LetValue("SomethingCount", SomethingCount)
End Function
My biggest concern here is thread safety. I might be able to figure out the rest of the questions below, but I am not experienced with this and I know it is an area that it is EASY to go wrong in. The link above uses the method Dim _sht as System.Collections.Hashtable = System.Collections.Hashtable.Synchronized(_hashtable). Is that best? What about Mutex? Semaphore? I have no experience in this.
I think the namespace System.Collections for Hashtable is correct, but I'm having trouble adding System.Collections as a reference in my report to try to cure my current error of "Could not load file or assembly 'System.Collections'". When I browse to add the reference, it's not an available component to select.
I just confirmed that I can call code from a parameter's default value expression, so I'll put my Initialize code there. I also just found out about the OnInit procedure, but this has its own gotchas to research and work around: the Parameters collection may not be referenced from the OnInit method during parameter initialization.
I'm unsure about declaring the Data variable as New, perhaps it should be only be instantiated in the initializer if not already done (but I worry about race conditions because of the delay between the check that it's empty and the instantiation of it).
I also have a question about the Shared keyword. Is it necessary in all cases? I get errors if I leave it off function declarations, but it appears to work when I leave it off the variable declaration. Testing multiple simultaneous report executions is difficult... Could someone explain what Shared means specifically in the context of SSRS Code?
Is there a better way to initialize variables? Should I provide a second parameter to the GetValue function which is the default value to use if it finds that the variable doesn't exist in the hashtable yet?
Is it better to have nested Hashtables as I chose in my implementation, or to concatenate my InstanceID with the variable name to have a flat hashtable?
I'd really appreciate guidance, ideas and/or critiques on any aspect of what I've presented here.
Thank you!
Erik
Your code looks fine. For thread safety only the root (shared) hashtable Data needs to be synchronised. If you want to avoid using your InstanceID you could use Globals.ExecutionTime and User.UserID concatenated.
Basically I think you just want to change to initialize like this:
Private Shared Data As System.Collections.Hashtable
If Data Is Nothing Then
Set Data = Hashtable.Synchronized(New System.Collections.Hashtable())
End If
The contained hashtables should only be used by one thread at a time anyway, but if in doubt, you could synchronize them too.

Resources