I have a usecase where I want to build a realtime a decision tree evaluator using Flink. I have a decision tree something like below:
Decision tree example
Root Node(Product A)---- Check if price of Product A increased by $10 in last 10mins
----------------------------
If Yes --> Left Child of A(Product B) ---> check if price of Product B increased by $20 in last 10mins ---> If not output Product B
----------------------------
If No ---> Right Child of A(Product C) ---> Check if price of Product C increased by $20 in last 10mins ---> If not output Product C
Note: This is just example of one decision tree, I have multiple such decision trees with different product type/number of nodes and different conditions. Want to write a common Flink app to evaluate all these.
Now in input I am getting an input data stream with prices of all product types(A, B and c) every 1min. To achieve my usecase one approach that I can think of is as follows:
Filter input stream by product type
For each product type, use Sliding Window over last X mins based on product type triggered every min
Process window function to check difference of prices for a particular product type and output price difference for each product type in output stream.
Now that we have price difference of each product type/nodes of the tree, we can then evaluate the decision tree logic. Now to do this, we have to make sure the processing of price diff calculation of all product types in a decision tree (Product A, B and C in above example) has to be completed before determining the output. One way is to store the outputs of all these products from output stream to a datastore and keep checking from an ec2 instance every 5s or so if all these price computations are completed. Once done, execute the decision tree logic to determine the output product.
Wanted to understand if there is any other way where this entire computation can be done in Flink itself without needing any other components(datastore/ec2). I am fairly new to Flink so any leads would be highly appreciated!
Related
Problem statement:
Trying to evaluate Apache Flink for modelling advanced real time low latency distributed analytics
Use case abstract:
Provide complex analytics for instruments I1, I2, I3... etc each having product definition P1, P2, P3; configured with user parameters (Dynamic) U1, U2,U3 & requiring streaming Market Data M1, M2, M3...
Instrument Analytics function (A1,A2) are complex in terms of computation complexity, some of them could take 300-400ms but can be computed in parallel.
From above clearly Market data stream would be much faster (<1ms) than analytics function & need to consume latest consistent market data for calculations.
Next challenge is multiple Dependendant Enrichment functions E1,E2,E3 (e.g. Risk/PnL) which combine streaming Market data with instrument analytics result (E.g. Price or Yield)
Last challenge is consistency for calculations - as function A1 could be faster than A2 and need a consistent all instrument result from given market input.
Calculation Graph dependency examples (scale it to hundreds of instruments & 10-15 market data sources):
In case above image is not visible, graph dependency flow is like:
- M1 + M2 + P1 => A2
- M1 + P1 => A1
- A1 + A2 => E2
- A1 => E1
- E1 + E2 => Result numbers
Questions:
Correct design/model for these calculation data streams, currently I use ConnectedStreams for (P1 + M1), Another approach could be to use Iterative model feeding same instruments static data to itself again?
Facing issue to use just latest market data events in calculations as analytics function (A1) is lot slower than Market data (M1) streaming.
Hence need stale market data eviction for next iteration retaining those where no value is not available (LRU cache like)
Need to synchronize / correlate function execution of different time complexity so that iteration 2 starts only when everything in iteration 1 finished
This is quite a broad question and to answer it more precisely, one would need a few more details.
Below are a few thoughts that I hope will point you in a good direction and help you to approach your use case:
Connected streams by key (a.keyBy(...).connect(b.keyBy(...)) are the most powerful join- or union-like primitive. Using CoProcessFunction on a connected stream should give you the flexibility to correlate or join values as needed. You can for example store the events from one stream in the state while waiting for a matching event to arrive from the other stream.
Holding always the latest data of one input is easily doable by just putting that value into the state of a CoFlatMapFunction or a CoProcessFunction. For each event from input 1, you store the event in the state. Each event from stream 2, you look into the state to find the latest event from stream 1.
To synchronize on time, you could actually look into using event time. Event time can also be "logical time", meaning just a version number, iteration number, or anything. You only need to make sure that the timestamp you assign and the watermarks you generate reflect that consistently.
If you window by event time then, you will get all data of that version together, regardless of whether one operator is faster than others, or the events arrive via paths with different latency. That is the beauty of real event time processing :-)
I have a table with millions of transactions of a single account. Each transaction contains:
moment - Timestamp when the transaction happened.
sequence - A number to sort transactions that happen at exact same moment.
description, merchant, etc - overall information.
amount - The monetary value of the transaction, which may be positive or negative.
balance - The account balance after the transaction (sum of current and all previous amount). This is computed by the system.
What data structure is optimized for quickly displaying or updating the correct balance of all transactions, assuming the user can insert, delete or modify the amount of very old transactions?
My current option is organizing the transactions in a B-tree of order M, then store the sum of amount on each node. Then if some very old transaction is updated, I only update the corresponding node sum and all its parents up the root, which is very fast. It also allows me to show the total balance with a single read of the root node. However in order to display the right balance value of future records, I eventually need to read M nodes, which is kind of slow assuming each node is on cloud storage.
Is there a better solution?
Solution with B-tree may be enhanced further. You may store a list of delta modifications in RAM. This list (which also may be a binary tree) contains only updates and is sorted by timestamp.
For example, this list may look like following at some point:
(t1, +5), (t10, -6), (t15, +80)
This means that when you need to display balance of transaction with timestamp
less than t1 - do nothing
between [t1, t10) - you add 5
between [t10, t15) - decrement by 6
[t15, inf) - add 80
Now suppose that we need to make modification (t2, -3). We
Insert this node into list at proper position
Update all nodes to the right with delta (-3)
Update this node`s value with value from its left neighbor (+5 -3 = +2)
List becomes:
(t1, +5), (t2, +2), (t10, -9), (t15, +77)
Eventually, when delta list becomes large, you will need to apply it to your B-tree.
Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap
Advice please. I have a collection of documents that all share a common attribute (e.g. The word French appears) some of these documents have been marked as not pertinent to this collection (e.g. French kiss appears) but not all documents are guaranteed to have been identified. What is the best method to use to figure out which other documents don't belong.
Assumptions
Given your example "French", I will work under the assumption that the feature is a word that appears in the document. Also, since you mention that "French kiss" is not relevant, I will further assume that in your case, a feature is a word used in a particular sense. For example, if "pool" is a feature, you may say that documents mentioning swimming pools are relevant, but those talking about pool (the sport, like snooker or billiards) are not relevant.
Note: Although word sense disambiguation (WSD) methods would work, they require too much effort, and is an overkill for this purpose.
Suggestion: localized language model + bootstrapping
Think of it this way: You don't have an incomplete training set, but a smaller training set. The idea is to use this small training data to build bigger training data. This is bootstrapping.
For each occurrence of your feature in the training data, build a language model based only on the words surrounding it. You don't need to build a model for the entire document. Ideally, just the sentences containing the feature should suffice. This is what I am calling a localized language model (LLM).
Build two such LLMs from your training data (let's call it T_0): one for pertinent documents, say M1, and another for irrelevant documents, say M0. Now, to build a bigger training data, classify documents based on M1 and M0. For every new document d, if d does not contain the feature-word, it will automatically be added as a "bad" document. If d contains the feature-word, then consider a local window around this word in d (the same window size that you used to build the LLMs), and compute the perplexity of this sequence of words with M0 and M1. Classify the document as belonging to the class which gives lower perplexity.
To formalize, the pseudo-code is:
T_0 := initial training set (consisting of relevant/irrelevant documents)
D0 := additional data to be bootstrapped
N := iterations for bootstrapping
for i = 0 to N-1
T_i+1 := empty training set
Build M0 and M1 as discussed above using a window-size w
for d in D0
if feature-word not in d
then add d to irrelevant documents of T_i+1
else
compute perplexity scores P0 and P1 corresponding to M0 and M1 using
window size w around the feature-word in d.
if P0 < P1 - delta
add d to irrelevant documents of T_i+1
else if P1 < P0 - delta
add d to relevant documents of T_i+1
else
do not use d in T_i+1
end
end
end
Select a small random sample from relevant and irrelevant documents in
T_i+1, and (re)classify them manually if required.
end
T_N is your final training set. In this above bootstrapping, the parameter delta needs to be determined with experiments on some held-out data (also called development data).
The manual reclassification on a small sample is done so that the noise during this bootstrapping is not accumulated through all the N iterations.
Firstly you should take care of how to extract features of the sample docs. Counting every word is not a good way. You might need some technique like TFIDF to teach the classifier that which words are important to classify and which are not.
Build a right dictionary. In your case, the word French kiss should be a unique word, instead of a sequence of French + kiss. Use the right technique to build a right dictionary is important.
The remain errors in samples are normal, we call it "not linear separable". There're a huge amount of advanced researches on how to solve this problem. For example, SVM (support vector machine) would be what you like to use. Please note that single-layer Rosenblatt perceptron usually shows very bad performance for the dataset which are not linear separable.
Some kinds of neural networks (like Rosenblatt perceptron) can be educated on erroneus data set and can show a better performance than tranier has. Moreover in many cases you should make errors for avoid over-training.
You can mark all unlabeled documents randomly, train several nets and estimate theirs performance on the test set (of course, you should not include unlabeled documents in the test set). After that you can in cycle recalculate weights of unlabeled documents as w_i = sum of quality(j) * w_ij, and then repeate training and the recalculate weight and so on. Because procedure is equivalent to introducing new hidden layer and recalculating it weights by Hebb procedure the overall procedure should converge if your positive and negative sets are lineary separable in some network feature space.
I have a number of timetables for a train network that are of the form -
Start Location (Time) -> Stop 1 (time1) -> Stop 2 (time2) -> ... End Location
I.e. each route/timetable is composed of a sequence of stops which occur at ascending times. The routes are repeated multiple times per day.
Eventually I want to build a pathfinding algorithm on top of this data. Initially this would return paths from a single timetable/route. But eventually I would like to be able to calculate optimal journeys across more than one route.
My question is therefore, what is the best way of storing this data to make querying routes as simple as possible? I imagine a query being of the format...
Start Location: x, End Location: y, At Time: t
If you are doing path finding, a lot of path finding algorithms deal with following the shortest path segment to the next node, and querying paths from that node. So your queries will end up being, all segments from station x at or after time t, but the earliest for a given distinct destination.
If you have route from Washington, DC to Baltimore, your stop 1 and stop 2 might be New Carrolton and Aberdeen. So you might store:
id (auto-increment), from_station_id, to_station_id, departure_time, arrival_time
You might store a record for Washington to New Carrolton, a record for New Carrolton to Aberdeen, and a record from Aberdeen to Baltimore. However, I would only include these stops if (a) they are possible origins and destinations for your trip planning, or (b) there is some significant connecting route (not just getting of the train and taking the next one on the same route).
Your path finding algorithm is going to have a step (in a loop) of starting from the node with the lowest current cost (earliest arrival) list of the next segments, and the node that segment brings you to.
select segments.*
from segments inner joint segments compare_seg on segments.to_station_id = compare_seg.station_id
where departure_time > ?
group by segments.id
having segment.arrival_time = min(compare_seg.arrival_time)