How to access java variables declared outside in flink map? - apache-flink

I am creating a List in Java. I want to share the List within the map function in flink. How to share the variables across flink processes
Requirement
I have static data records(Less than 1000 records). I want to join these records with the data stream.

Flink is designed with scalability in mind, and so it embodies a share nothing approach. An operator can be stateful, but its state is strictly local and not visible to other operators.
In this particular case, a good solution might be to use a RichMapFunction (or RichFlatMapFunction) to do the join, and to load a copy of all of the static data records into a local transient data structure in its open method.

Related

Local aggregation for data stream in Flink

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Data Modeling - modeling an Append-only list in NDB

I'm trying to make a general purpose data structure. Essentially, it will be an append-only list of updates that clients can subscribe to. Clients can also send updates.
I'm curious for suggestions on how to implement this. I could have a ndb.Model, 'Update' that contains the data and an index, or I could use a StructuredProperty with Repeated=true on the main Entity. I could also just store a list of keys somehow and then the actual update data in a not-strongly-linked structure.
I'm not sure how the repeated properties work - does appending to the list of them (via the Python API) have to rewrite them all?
I'm also worried abut consistency. Since multiple clients might be sending updates, I don't want them to overwrite eachother and lose an update or somehow end up with two updates with the same index.
The problem is that you've a maximum total size for each model in the datastore.
So any single model that accumulates updates (storing the data directly or via collecting keys) will eventually run out of space (not sure how the limit applies with regard to structured properties however).
Why not have a model "update", as you say, and a simple version would be to have each provided update create and save a new model. If you track the save date as a field in the model you can sort them by time when you query for them (presumably there is an upper limit anyway at some level).
Also that way you don't have to worry about simultaneous client updates overwriting each other, the data-store will worry about that for you. And you don't need to worry about what "index" they've been assigned, it's done automatically.
As that might be costly for datastore reads, I'm sure you could implement a version that used repeated properties in a single, moving to a new model after N keys are stored but then you'd have to wrap it in a transaction to be sure mutiple updates don't clash and so on.
You can also cache the query generating the results and invalidate it only when a new update is saved. Look at NDB also as it provides some automatic caching (not for a query however).

How to handle "reference data" (static data) in the Google App Engine datastore?

I have an application I am working on where I have a set of data that, while not technically static, will not change very often (say, 3 or 4 times a year on average). However, some of this data is interrelated.
An example of this type of data would be states and counties - ideally, we would like to know all of the states available when putting in an address or location, but we would also like to know the counties available for each state, so we can display that information appropriately to the user (i.e. filtering out the inappropriate counties when a user has a state selected).
In the past, I have done this in a relational database by having a state and county table, where the county is linked back to the state it belongs in, and the state and counties are linked to any tables that need their information.
This data is not owned however, and in the Google datastore it seems like the locking transaction mechanism will cause locks to occur even though we are not actively modifying this data. What is the best way to handle this type of data? Is it to have an entity for the pieces that does not have a parent (parent of None/null)? Will this cause locking problems in the future?
I'd consider storing this in an optimized data structure inside your code and updating it manually. The performance gain will be huge, and since google charges you for that, you will end up thanking for it.
The idea is to mix this fixed data structures with your database, so you give each country (or whatever) an id, and you reference it in your models.
A simple approach is making a list of countries and each have a list of states in them. You can load them in def main():, before you run the app. Of course this will bring all sorts of problems if you are not careful, but if you are, you should be fine.
A more advanced one would be to keep in memory only the most used, and lazy load and dump countries on the fly.

Are objects usually stored in databases as serialized objects or is just the fields

Let's say a program uses certain objects when it's running and stores the objects and their data in a database when the program is not running. Is it more common to store the actual objects serialized in the database so that when the program is running again, they can be deserialized back into main memory and used by the program, or is it more common to store the fields of each object in the database and when the program is started up again, a new object is created with the fields as constructor arguments or set methods? The former (serialized objects) seems cleaner from the programmer's perspective, but I can see the latter being preferable if other programs that don't have the same exact class API to deserialize. What's the trend in actual practice?
Normally one saves them as fields, that is, the discreet pieces of data in an object are stored in different fields.
This allows you to make ad-hoc querying of the data, which would be impossible (or very difficult) with a serialized form.
The point of a relational database is to minimize required storage and duplication while maintaining ACID.

When is it a good idea to use a database

I am doing an information retrieval project in c++. What are the advantages of using a database to store terms, as opposed to storing it in a data structure such as a vector? More generally when is it a good idea to use a database rather than a data structure?
(Shawn): Whenever you want to keep the data beyond the length of the instance of the program. (persistence across time)
(Michael Kjörling): Whenever you want many instances of your program, either in the same computer or in many computers, like in a network or the Net, access and manipulate (share) the same data. (persistence across network space)
Whenever you have very big amount of data that do not fit into memory.
Whenever you have very complex data structures and you prefer to not rewrite code to manipulate them, e.g search, update them, when the db programmers have already written such code and probably much faster than the code you (or I)'ll write.
Whenever you want to keep the data beyond the length of the instance of the program?
In addition to Shawn pointing out persistence: whenever you want multiple instances of the program to be able to easily share data?
In-memory data structures are great, but they are not a replacement for persistence.
It really depends on the scope. For example if you're going to have multiple applications accessing the data then a database is better because you won't have to worry about file locks, etc. Also, you'd use a database when you need to do things like joining other data, sorting, etc... unless you like to implement Quicksort.

Resources