I'm drawing a blank, or as some would say, having a senior moment. I know there’s a formal definition and a name for the concept where a db operation (stored procedure) that runs in a database will yield the same results if run repeatedly.
It's something in the genre of the Mathematician’s reflexive, symmetric, transitive, etc.
Do you mean "deterministic" - as in will always return the same result if called with the same input?
Or maybe "idempotent", which also means that calling the function again will have no further effect on the database.
IT's called idempotent
I think what you're looking for is Idempotent. Idempotence is a property that can apply to any sort of operation (not just databases). It means that doing the operation any number of times more than once is equivalent to doing it once. I.e. every subsequent operation after the first leaves the state unchanged.
For example, the play button on most DVD remotes is idempotent while playing a video because no matter how many times you push it, it keeps playing. However a power button on your remote is usually not idempotent, because it toggles the machine on and off each time. Idempotence is a nice property because you don't always have to know what state a system is in before engaging an operation to try to produce a given state.
Or perhaps deterministic.
I'm pretty sure you're thinking of the work "Deterministic". A function is deterministic if it returns the same answer for the same inputs all the time. A function is nondeterministic if it can return different answers for the same input.
Related
I am trying to figure out the best practices to deal with poison messages / unhandled exceptions with Apache Flink. We have a Job doing real time event processing of location data from IoT devices. There are two potential scenarios where this can arise:
Data is bad in some way - e.g. invalid value
Data triggers a bug due to some edge case we have not anticipated.
Currently, all my data processing stops because of just one message.
I've seen two suggestions:
Catch the exceptions - this requires me wrapping every piece of logic with something to catch every runtime exception
Use side outputs as a kind of DLQ - from what I can tell this seems to be a variation on #1 where I have to catch all the exceptions and send them to the side output.
Is there really no way to do this other than wrap every piece of logic with exception handling? Is there no generic way to catch exceptions and not have processing continue?
I think the idea is not to catch all kinds of exceptions and send them elsewhere, but rather to have well-tested and functioning code and use dead letters only for invalid inputs.
So a typical pipeline would be
source => validate => ... => sink
\=> dead letter queue
As soon as your record passes your validate operator, you want all errors to bubble up, as any error in these operators may result in corrupted aggregates and data that - once written - cannot be reverted easily.
The validate step would work with any of the two approaches that you outlined. Typically, side-outputs have better semantics, but you may end up with more code.
Now you may have a service with high SLAs and actually want it to produce output even if it is corrupted just to produce data. Or you have simple transformation pipeline, where you'd miss some events but keep the majority (and downstream can deal with incomplete data). Then you are right that you need to wrap the code of all operators with try-catch. However, you'd typically still would only do it for the fragile operators and not for all of them. Trivial operators should be tested and then trusted to work. Further, you'd usually only catch specific kinds of exceptions to limit the scope to the kind of expected exceptions that can happen.
You might wonder why Flink doesn't have it incorporated as a default pattern. There are two reasons as far as I can see:
If Flink silently ignores any kind of exception and sends an extra message to a secondary sink, how can Flink ensure that the throwing operator is in a sane state afterwards? How can it avoid any kind of leaks that may happen because cleanup code is not executed?
It's more common in Java to let the developers explicitly reason about exceptions and exception handling. It's also not straight-forward to see what the requirements are: Do you want to have the input only? Do you also want to store the exception? What about the operator state that may have influenced the outcome? Should Flink still fail when too many errors have been received in a given time window? It quickly becomes a huge feature for something that should not happen at all in an ideal world where high quality data is ingested and properly processed.
So while it looks easy for your case because you exactly know which kinds of information you want to store, it's not easy to have a solution for all purposes, especially since the extra code that a user has to write is tiny compared to the generic solution.
What you could do is to extract most of the complicated logic things into a single ProcessFunction and use side-outputs as you have outlined. Since it's a central piece, you'd only need to write the side-output function once. If it's done multiple times, you could extract a helper function where you pass your actual code as a RunnableWithException lambda which hides all the side-output logic. Make sure you use plenty of finally blocks to ensure a sane state.
I'd also add quite a few IT cases and use mutation testing to harden your pipeline quicker. If you keep your test data inline, the mutants may also exactly simulate your unexpected data issues, such that your validate operator gets more complete.
Let's say we want to compute the sum and average of the items,
and can either working with states or windows(time).
Example working with windows -
https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#example-program
Example working with states -
https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/exercises/datastream_java/ride_speed/RideSpeed.java
Can I ask what would be the reasons to make decision? Can I infer that if the data arrives very irregularly (50% comes in the defined window length and the other 50% don't), the result of the window approach is more biased (because the 50% events are dropped)?
On the other hand, do we spend more time checking and updating the states when working with states?
First, it depends on your semantics... The two examples use different semantics and are thus not comparable directly. Furthermore, windows work with state internally, too. It is hard to say in general with approach is the better one.
As Flink's window semantics are very rich, I would suggest to use windows. If you cannot express your semantics with windows, using state can be a good alternative. Using windows, has the additional advantage that state handling---which is hard to get done right---is done automatically for you.
The decision is definitely independent from your data arrival rate. Flink does not drop any data. If you work with event time (rather than with processing time) your result will be the same independently of the data arrival rate after all.
I was reading about the .settings file on msdn and I noticed they give 2 examples of how to set the value of a item in the settings. Now my question is what is the real diffrence between the 2 and when would you use one instead of the other, since to me they seem pretty mutch the same.
To Write and Persist User Settings at Run Time
Access the user setting and assign it a new value, as shown in the following example:
Properties.Settings.Default.myColor = Color.AliceBlue;
If you want to persist changes to user settings between application sessions, call the Save method, as shown in the following code:
Properties.Settings.Default.Save();
The first statement updates the value of the setting in memory. The second statement updates the persisted value in the user.config file on the disk. That second statement is required to get the value back when you restart the program.
It is very, very important to realize that these two statements must be separate and never be written close together in your code. Keeping them close is harakiri-code. Settings tend to implement unsubtle features in your code, making it operate differently. Which isn't always perfectly tested. What you strongly want to avoid is persisting a setting value that subsequently crashes your program.
That's the harakiri angle, if you saved that value then it is highly likely that the program will immediately crash again when the user restarts it. Or in other words, your program will never run correctly again.
The Save() call must be made when you have a reasonable guarantee that nothing bad happened when the new setting value was used. It belongs at the end of your Main() method. Only reached when the program terminated normally.
I've got a situation which I want to fetch data from a database, and assign it to the tooltips of each row in a ListView control in WPF. (I'm using C# 4.0.) Since I've not done this sort of thing before, I've started a smaller, simpler app to get the ideas down before I attempt to use them in my main WPF app.
One of my concerns is the amount of data that could potentially come down. For that reason I thought I would use LINQ to SQL, which uses deferred execution. I thought that would help and not pull down the data until the user passes their mouse over the relevant row. To do this, I'm going to use a separate function to assign the values to the tooltip, from the database, passed upon the parameters I need to pass to the relevant stored procedures. I'm doing 2 queries using LINQ to SQL, using 2 different stored procedures, and assigning the results to 2 different DataGrids.
Even though I know that LINQ to SQL does use deferred execution, I'm beginning to wonder if some of the code I'm writing may defeat my whole intent of using LINQ to SQL. For example, in testing in my simpler app, I am choosing several different values to see how it works. One selection of values brought no data back, as there was no data for the given parameters. I thought this could potentially cause the user confusion, so I thought I would check the Count property of the list that I assign from running the DBML associated method (related to the stored procedure). Thinking about it, I would think it would be necessary for LINQ to run the query, in order to give me a result for the Count property. Am I not correct?
If I eliminate the call to the list's Count property, I'm still wondering if I might have a problem; if LINQ may still be invoked, because I'm associating the tooltip to the control via a function call?
You are correct, when you call the Count property it iterates over the result set. Not clear on your last question, but the LINQ probably gets called at the point where you populate your DataGrids, way after the tooltip comes into play.
EDIT: however, this does not mean there is anything wrong with deffered execution or your use of it, it executes at the latest possible stage, right when you need the data. If you still want to check the Count ahead of actually fetching all the data, you could have a simple LINQ to SQL function that checks for Any() rows. (Actually Any() is probably what you want more than Count > 0)
You should use Any(), not Count(), but even Any() will cause the query to be executed - after all, it can't determine whether or not there are any rows in the result set without executing the query. But there's executing the query, and there's fetching the result set. Any() will fetch one row, Count() will fetch them all.
That said, I think that having a non-instantaneous operation that occurs on mouseover is just a bad idea. There was a build of Outlook, once, that displayed a helpful tooltip when you moused over the Print button. Less helpfully, it got the data for that tooltip by calling the system function that finds out what printers are available. So you'd be reaching for a menu, and the button would grab the mouse pointer and the UI would freeze for two seconds while it went out and figured out how to display a tooltip that you weren't even asking for. I still hate this program today. Don't be this guy.
A better approach would be to get your tooltip data asynchronously after populating the visible data on the screen. It's easy enough to create a BackgroundWorker that fetches the data into a DataTable, and then make the DataTable available to the view models in the RunWorkerCompleted event handler. (Do it there so that you don't do any updates to UI-bound data on the UI thread.) You can implement a ToolTip property in your view model that returns a default value (probably null, but maybe something like "Fetching data...") if the DataTable containing tool tip data is null, and that calculates the value if it's not. That should work admirably. You can even implement property-change notification so that the ToolTip will still get updated if the user keeps the mouse pointer over it while you're fetching the data.
Alex is correct that calling Count() or Any() will enumerate the LINQ expression causing the query to execute. I would recommend re-thinking your design as you probably don't want a query to the database executed every time the user moves his/her mouse. There is also the issue of the delay to query the database. What might be instantaneous on your dev box with a local database might have a multi-second delay on a heavily loaded server. I would recommend creating a DisplayTooltip() function that takes a lazily evaluated LINQ expression. You can then cache the results or apply other heuristics to decide whether you should actually be querying the database or not.
Weird question, but I'm not sure if it's anti-pattern or not.
Say I have a web app that will be rendering 1000 records to an html table.
The typical approach I've seen is to send a query down to the database, translate the records in some way to some abstract state (be it an array, or a object, etc) and place the translated records into a collection that is then iterated over in the view.
As the number of records grows, this approach uses up more and more memory.
Why not send along with the query a callback that performs an operation on each of the translated rows as they are read from the database? This would mean that you don't need to collect the data for further iteration in the view so the memory footprint shrinks, and you're not iterating over the data twice.
There must be something implicitly wrong with this approach, because I rarely see it used anywhere. What's wrong with this approach?
Thanks.
Actually, this is exactly how a well-developed application should behave.
There is nothing wrong with this approach, except that not all database interfaces allow you to do this easily.
If we talk about tabularizing 10 records for a yet another social network, there is no need to mess with callbacks if you can get an array of hashes or whatever with a single call that is already implemented for you.
There must be something implicitly wrong with this approach, because I rarely see it used anywhere.
I use it. Frequently. Even when i wouldn't use too much memory repeatedly copying the data, using a callback just seems cleaner. In languages with closures, it also lets you keep relevant code together while factoring out the messy DB stuff.
This is a "limited by your tools" class of problem: Most programming languages don't allow to say "Do something around this code". This was solved in recent years with the advent of closures. Think of a closure as a way to pass code into another method which is then executed in a context. For example, in GSQL, you can write:
def l = []
sql.execute ("select id from table where time > ?", time) { row ->
l << row[0]
}
This will open a connection to the database, create a statement and a result set and then run the l << it[0] for each row the DB returns. Note that the code runs inside of sql.execute() but it can access local variables (l) and variables defined in sql.execute() (row).
With this kind of code, you can even generate the result of a HTTP request on the fly without keeping much of the page in RAM at any time. In my case, I'd stream a 2MB document to the browser using only a few KB of RAM and the browser would then chew 83s to parse this.
This is roughly what the iterator pattern allows you to do. In many cases this breaks down on the interface between your application and the database. Technologies like LINQ even have solutions that can send back code to the database.
I've found it easier to use an interface resolver than deep callback where its hooked up through several classes. MS has a much fancier version than mine called Unity. This provides a much cleaner way of accessing classes that should not be tightly coupled
http://www.codeplex.com/unity