What magics does Flink use in distinct()? How are surrogate keys generated? - apache-flink

Regarding generating surrogate key, the first step is to get the distinct and then build an incremental key for each tuple.
So I use Java Set to get the distinct elements and it's out of heap space.
Then, I use Flink's distinct() and it totally works.
Could I ask what make this difference?
Another related question is, can Flink generate surrogate key in mapper?

Flink executes a distinct() internally as a GroupBy followed by a ReduceGroup operator, where the reduce operator returns the first element of the group only.
The GroupBy is done by sorting the data. Sorting is done on a binary data representation, if possible in-memory, but might spill to disk if not enough memory is available. This blog post gives some insight about that. GroupBy and Sort are memory-safe in Flink and will not fail with an OutOfMemoryError.
You can also do a distinct on a custom key, by using DataSet.distinct(KeySelector ks). The key selector is basically a MapFunction that generates a custom key.

Related

Reverse Indexing and Data modeling in Key-Value store

I am new to key-value stores. My objective is to use an embedded key-value store to keep the persistent data model. The data model comprises of few related tables if designed with conventional RDBMS. I was checking a medium article on modeling a table for key value store. Although the article uses Level DB with Java I am planning to use RocksDB or FASTER with C++ for my work.
It uses a scheme where one key is used for every attribute of each row, like the following example.
$table_name:$primary_key_value:$attribute_name = $value
The above is fine for point lookups when usercode is aware about exactly which key to get. But there are scenarios like searching for users having same email address, or searching for users above a certain age or searching for users of one specific gender. In search scenarios the article performs a linear scan through all keys. In each iterations it checks the pattern of the key and applies the business logic (checking the value for match) once a key with a matching pattern is found.
It seems that, such type of searching is inefficient and in worst case it needs to traverse through the entire store. To solve that a reverse lookup table is required. My question is
How to model the reverse lookup table ? Is it some sort of reinvention of wheel ? Is there any alternative way ?
One solution that readily comes in mind is to have a separate ? store for each index-able property like the following.
$table_name:$attribute_name:$value_1 = $primary_key_value
With this approach the immediate question is
How to handle collisions in this reverse lookup table ? because multiple $primary_keys may be associated with the same vale.
As an immediate solution, instead of storing a single value an array of multiple primary keys can be stored as shown below.
$table_name:$attribute_name:$value_1 = [$primary_key_value_1, ... , $primary_key_value_N]
But such type of modeling requires usercode to parse the array from string and again serialize that to string after manipulation several times (assuming the underlying key-value store is not aware about array values).
Is it efficient to store multiple keys as array value ? or there exists some vendor provided efficient way ?
Assuming that the stringifi'ed array like design works, there has to be such indexes for each indexable properties. So this gives a fine grained control on what to index and what not to index. Next design decision that comes in mind is where these indexes will be store ?
should the indexes be stored in a separate store/file ? or in the same store/file the actual data belongs to ? Should there be a different store for each property ?
For this question, I don't have a clue because both of these approaches require more or less same amount of I/O. However having large data file will have more things on disk and fewer things on memory (so more I/O), whereas for multiple files there will be more things on memory so less page faults. This assumption could be totally wrong depending on the architecture of the specific key-value store. At the same time having too many files turns into a problem of managing a complicated file structure. Also, maintaining indexes require transactions for insert, update and delete operations. Having multiple files results into single updation in multiple trees, whereas having single file results into multiple updation in single tree.
Is transaction more specifically transaction involving multiple store/files supported ?
Not only the indices there are some meta information of the table that are also required to be kept along with the table data. To generate a new primary key (auto incremented) it is required to have prior knowledge about the last row number or last primary key generated because something like a COUNT(*) won't work. Additionally as all keys are not indexed, the meta information may include what properties are indexed and what properties are not indexed.
How to store the meta information of each table ?
Again the same set of questions appear for the meta table also. e.g. should the meta be a separate store/file ? Additionally as we have noticed that not all properties are indexed we may even decide to store each row as a JSON encoded value in the data store and keep that along with the index stores. The underlying key-value store vendor will treat that JSON as a string value like the following.
$table_name:data:$primary_key_value = {$attr_1_name: $attr_1_value, ..., $attr_N_name: $attr_N_value}
...
$table_name:index:$attribute_name = [$primary1, ..., $primaryN]
However reverse lookups are still possible through the indexes pointing towards the primary key.
Is there any drawbacks of using JSON encoded values instead of storing all properties as separate keys ?
So far I could not find any draw backs using this method, other than forcing the user to use JSON encoding, and some heap allocation in for JSON encoding/decoding.
The problems mentioned above is not specific to any particular application. These problems are generic enough to be associated to all developments using key-value store. So it is essential to know whether there is any reinvention of wheel.
Is there any defacto standard solution of all the problems mentioned in the question ? Does the solutions differ from the one stated in the question ?
How to model the reverse lookup table ? Is it some sort of reinvention of wheel ? Is there any alternative way ?
All the ways you describe are valid ways to create an index.
It does not re-invent the wheel in RocksDB because RocksDB does not support indices.
It really depends on the data, in general you will need to copy the index value and the primary key into another space to create the index.
How to handle collisions in this reverse lookup table ? because multiple $primary_keys may be associated with the same vale.
You can serialize pks using JSON (or something else). The problem with that approach is when the pks grow very large (which might or might not be a thing).
Is it efficient to store multiple keys as array value ? or there exists some vendor provided efficient way ?
With RocksDB, you have nothing that will make it "easier".
You did not mention the following approach:
$table_name:$attribute_name:$value_1:$primary_key_value_1 = ""
$table_name:$attribute_name:$value_1:$primary_key_value_2 = ""
...
$table_name:$attribute_name:$value_1:$primary_key_value_n = ""
Where the value is empty. And the indexed pk is part of the key.
should the indexes be stored in a separate store/file ? or in the same store/file the actual data belongs to ? Should there be a different store for each property ?
It depends on the key-value store. With rocksdb, if you need transactions, you must stick to one db file.
Is transaction more specifically transaction involving multiple store/files supported ?
Only Oracle Berkeley DB and WiredTiger support that feature.
How to store the meta information of each table ?
metadata can be in the database or the code.
Is there any drawbacks of using JSON encoded values instead of storing all properties as separate keys ?
Yeah, like I said above, if you encoded all pks into a single value, it might lead to problem downstream when the number of pk is large. For instance, you need to read the whole list to do pagination.
Is there any defacto standard solution of all the problems mentioned in the question ? Does the solutions differ from the one stated in the question ?
To summarize:
With RocksDB, Use a single database file
In the index, encode the primary key inside the key, and leave value empty, to be able to paginate.

How to perform a self-join in Apache Flink

Usually when performing a self-join on a table one would include a restriction on the ids of the joined table(s) to avoid symmetric results and joins of a row with itself.
There seem to be at least three ways to implement a self-join in apache flink (as mentioned here):
Use the regular join operator dataset.join(dataset). However the beforementioned restriction does not seem possible to implement this way.
Use a reduce operator and manually implement the join. Which may lead to memory problems.
A combination of reduceGroup, flatMap, and reduceGroup. This approach is used in some gelly implementations of graph algorithms, eg. JaccardIndex, where the operations are named GenerateGroupSpans, GenerateGroups, GenerateGroupPairs. It is not clear to me whether these names refer to a certain pattern or computation strategy.
Is there a single best way to perform a self-join in Apache Flink or does it depend on the use case? Which method is best performance-, memory- and reliability-wise? Is there a general pattern to understand method 3?

Does anyone have a good example of a ProcessFunction that sums or aggregates data at some frequency

I am looking mimic the behaviour of a window().reduce() operation but without a key at the task manager level. Sort of like a .windowAll().reduce() does for a stream, but I am looking to get individual results from each task manager.
I tried searching for "flink processFunction examples" but not finding anything useful to look at.
For ProcessFunction examples, I suggest the examples in the Flink docs and in the Flink training materials.
Another approach would be to use windows with a random key selector. That's not as easy as it sounds: you can't just select by a random number, as the value of the key must be deterministic for each stream element. So you could add a field that you set to a random value, and then keyBy that field. Compared to the ProcessFunction approach this will force a shuffle, but be simpler.

Obtain KeyedStream from custom partitioning in Flink

I know that Flink comes with custom partitioning APIs. However, the problem is that, after invoking partitionCustom on a DataStream you get a DataStream back and not a KeyedStream.
On the other hand, you cannot override the partitioning strategy for a KeyedStream.
I do want to use KeyedStream, because the API for DataStream does not have reduce and sum operators and because of automatically partitioned internal state.
I mean, if the word count is:
words.map(s -> Tuple2.of(s, 1)).keyBy(0).sum(1)
I wish I could write:
words.map(s -> Tuple2.of(s, 1)).partitionCustom(myPartitioner, 0).sum(1)
Is there any way to accomplish this?
Thank you!
From Flink's documentation (as of Version 1.2.1), what partitioners do is to partition data physically with respect to their keys, only specifying their locations stored in the partition physically in the machine, which actually have not logically grouped the data to keyed stream. To do the summarization, we still need to group them by keys using "keyBy" operator, then we are allowed to do the "sum" operations.
Details please refer to "https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning" :)

GAE NDB Sorting a multiquery with cursors

In my GAE app I'm doing a query which has to be ordered by date. The query has to containt an IN filter, but this is resulting in the following error:
BadArgumentError: _MultiQuery with cursors requires __key__ order
Now I've read through other SO question (like this one), which suggest to change to sorting by key (as the error also points out). The problem is however that the query then becomes useless for its purpose. It needs to be sorted by date. What would be suggested ways to achieve this?
The Cloud Datastore server doesn't support IN. The NDB client library effectively fakes this functionality by splitting a query with IN into multiple single queries with equality operators. It then merges the results on the client side.
Since the same entity could be returned in 1 or more of these single queries, merging these values becomes computationally silly*, unless you are ordering by the Key**.
Related, you should read into underlying caveats/limitations on cursors to get a better understanding:
Because the NOT_EQUAL and IN operators are implemented with multiple queries, queries that use them do not support cursors, nor do composite queries constructed with the CompositeFilterOperator.or method.
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values. The de-duplication logic for such multiple-valued properties does not persist between retrievals, possibly causing the same result to be returned more than once.
If the list of values used in IN is a static list rather than determined at runtime, a work around is to compute this as an indexed Boolean field when you write the Entity. This allows you to use a single equality filter. For example, if you have a bug tracker and you want to see a list of open issues, you might use a IN('new', 'open', 'assigned') restriction on your query. Alternatively, you could set a property called is_open to True instead, so you no longer need the IN condition.
* Computationally silly: Requires doing a linear scan over an unbounded number of preceding values to determine if the current retrieved Entity is a duplicate or not. Also known as conceptually not compatible with Cursors.
** Key works because we can alternate between different single queries retrieving the next set of values and not have to worry about doing a linear scan over the entire proceeding result set. This gives us a bounded data set to work with.

Resources