Usually when performing a self-join on a table one would include a restriction on the ids of the joined table(s) to avoid symmetric results and joins of a row with itself.
There seem to be at least three ways to implement a self-join in apache flink (as mentioned here):
Use the regular join operator dataset.join(dataset). However the beforementioned restriction does not seem possible to implement this way.
Use a reduce operator and manually implement the join. Which may lead to memory problems.
A combination of reduceGroup, flatMap, and reduceGroup. This approach is used in some gelly implementations of graph algorithms, eg. JaccardIndex, where the operations are named GenerateGroupSpans, GenerateGroups, GenerateGroupPairs. It is not clear to me whether these names refer to a certain pattern or computation strategy.
Is there a single best way to perform a self-join in Apache Flink or does it depend on the use case? Which method is best performance-, memory- and reliability-wise? Is there a general pattern to understand method 3?
Related
I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.
I was building a scalabale solution, and hence require sharding of my data.
I know specific usage map of my present shard and based on that I wanted to break them and create new shards based on that usage map. [Higher usage key-range gets broken down into smaller parts and ditributed to different machine to equalize load across nodes].
Is there any theory/text/algo which gives the most efficient shardings strategy (sharding as such without breaking their sequence/index), if its known which key-ranges are used the most.
It is better to match sharding algorithms/strategies and business scenario.
There are some regular algorithms, such as: Hash, Range, Mod, Tag, HashMod, Time, etc.
And maybe we need more algorithms need to be customized, for example: use user_id mod for database sharding, and use order_id mod for table sharding.
Maybe you can have a look with Apache ShardingSphere, this project just defined some standard sharding algorithms and can permit developers customization.
The documentation related is: https://shardingsphere.apache.org/document/current/en/dev-manual/sharding/
The source code FYI: https://github.com/apache/shardingsphere/blob/master/shardingsphere-features/shardingsphere-sharding/shardingsphere-sharding-core/src/main/resources/META-INF/services/org.apache.shardingsphere.sharding.spi.ShardingAlgorithm
I have been browsing over the internet for quite few hours now and didn't came to a satisfactory answer for why one is better over another. If this is situation dependent than what are the situations to use one over the other.It would be great if you could provide me a solution on this with example if there can be one. I understand that since the aggregation operators came later so probably they are the better one, but i have still seen people using the find()+sort() method.
You shouldn't think of this as an issue of "which method is better?", but "what kind of query do I need to perform?"
The MongoDB aggregation pipeline exists to handle a different set of problems than a simple .find() query. Specifically, aggregation is meant to allow processing of data on the database end in order to reduce the workload on the application server. For example, you can use aggregation to generate a numerical analysis on all of the documents in a collection.
If all you want to do is retrieve some documents in sorted order, use find() and sort(). If you want to perform a lot of processing on the data before retrieving the results, then use aggregation with a $sort stage.
In my GAE app I'm doing a query which has to be ordered by date. The query has to containt an IN filter, but this is resulting in the following error:
BadArgumentError: _MultiQuery with cursors requires __key__ order
Now I've read through other SO question (like this one), which suggest to change to sorting by key (as the error also points out). The problem is however that the query then becomes useless for its purpose. It needs to be sorted by date. What would be suggested ways to achieve this?
The Cloud Datastore server doesn't support IN. The NDB client library effectively fakes this functionality by splitting a query with IN into multiple single queries with equality operators. It then merges the results on the client side.
Since the same entity could be returned in 1 or more of these single queries, merging these values becomes computationally silly*, unless you are ordering by the Key**.
Related, you should read into underlying caveats/limitations on cursors to get a better understanding:
Because the NOT_EQUAL and IN operators are implemented with multiple queries, queries that use them do not support cursors, nor do composite queries constructed with the CompositeFilterOperator.or method.
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values. The de-duplication logic for such multiple-valued properties does not persist between retrievals, possibly causing the same result to be returned more than once.
If the list of values used in IN is a static list rather than determined at runtime, a work around is to compute this as an indexed Boolean field when you write the Entity. This allows you to use a single equality filter. For example, if you have a bug tracker and you want to see a list of open issues, you might use a IN('new', 'open', 'assigned') restriction on your query. Alternatively, you could set a property called is_open to True instead, so you no longer need the IN condition.
* Computationally silly: Requires doing a linear scan over an unbounded number of preceding values to determine if the current retrieved Entity is a duplicate or not. Also known as conceptually not compatible with Cursors.
** Key works because we can alternate between different single queries retrieving the next set of values and not have to worry about doing a linear scan over the entire proceeding result set. This gives us a bounded data set to work with.
Regarding generating surrogate key, the first step is to get the distinct and then build an incremental key for each tuple.
So I use Java Set to get the distinct elements and it's out of heap space.
Then, I use Flink's distinct() and it totally works.
Could I ask what make this difference?
Another related question is, can Flink generate surrogate key in mapper?
Flink executes a distinct() internally as a GroupBy followed by a ReduceGroup operator, where the reduce operator returns the first element of the group only.
The GroupBy is done by sorting the data. Sorting is done on a binary data representation, if possible in-memory, but might spill to disk if not enough memory is available. This blog post gives some insight about that. GroupBy and Sort are memory-safe in Flink and will not fail with an OutOfMemoryError.
You can also do a distinct on a custom key, by using DataSet.distinct(KeySelector ks). The key selector is basically a MapFunction that generates a custom key.