Storing Route Data - database

I have a number of timetables for a train network that are of the form -
Start Location (Time) -> Stop 1 (time1) -> Stop 2 (time2) -> ... End Location
I.e. each route/timetable is composed of a sequence of stops which occur at ascending times. The routes are repeated multiple times per day.
Eventually I want to build a pathfinding algorithm on top of this data. Initially this would return paths from a single timetable/route. But eventually I would like to be able to calculate optimal journeys across more than one route.
My question is therefore, what is the best way of storing this data to make querying routes as simple as possible? I imagine a query being of the format...
Start Location: x, End Location: y, At Time: t

If you are doing path finding, a lot of path finding algorithms deal with following the shortest path segment to the next node, and querying paths from that node. So your queries will end up being, all segments from station x at or after time t, but the earliest for a given distinct destination.
If you have route from Washington, DC to Baltimore, your stop 1 and stop 2 might be New Carrolton and Aberdeen. So you might store:
id (auto-increment), from_station_id, to_station_id, departure_time, arrival_time
You might store a record for Washington to New Carrolton, a record for New Carrolton to Aberdeen, and a record from Aberdeen to Baltimore. However, I would only include these stops if (a) they are possible origins and destinations for your trip planning, or (b) there is some significant connecting route (not just getting of the train and taking the next one on the same route).
Your path finding algorithm is going to have a step (in a loop) of starting from the node with the lowest current cost (earliest arrival) list of the next segments, and the node that segment brings you to.
select segments.*
from segments inner joint segments compare_seg on segments.to_station_id = compare_seg.station_id
where departure_time > ?
group by segments.id
having segment.arrival_time = min(compare_seg.arrival_time)

Related

Tree evaluation in Flink

I have a usecase where I want to build a realtime a decision tree evaluator using Flink. I have a decision tree something like below:
Decision tree example
Root Node(Product A)---- Check if price of Product A increased by $10 in last 10mins
----------------------------
If Yes --> Left Child of A(Product B) ---> check if price of Product B increased by $20 in last 10mins ---> If not output Product B
----------------------------
If No ---> Right Child of A(Product C) ---> Check if price of Product C increased by $20 in last 10mins ---> If not output Product C
Note: This is just example of one decision tree, I have multiple such decision trees with different product type/number of nodes and different conditions. Want to write a common Flink app to evaluate all these.
Now in input I am getting an input data stream with prices of all product types(A, B and c) every 1min. To achieve my usecase one approach that I can think of is as follows:
Filter input stream by product type
For each product type, use Sliding Window over last X mins based on product type triggered every min
Process window function to check difference of prices for a particular product type and output price difference for each product type in output stream.
Now that we have price difference of each product type/nodes of the tree, we can then evaluate the decision tree logic. Now to do this, we have to make sure the processing of price diff calculation of all product types in a decision tree (Product A, B and C in above example) has to be completed before determining the output. One way is to store the outputs of all these products from output stream to a datastore and keep checking from an ec2 instance every 5s or so if all these price computations are completed. Once done, execute the decision tree logic to determine the output product.
Wanted to understand if there is any other way where this entire computation can be done in Flink itself without needing any other components(datastore/ec2). I am fairly new to Flink so any leads would be highly appreciated!

Alternative 1 of index data entry

One of the three alternatives of what to store as a data entry in an index is a data entry k* which is an actual record with search key value k. My question is, if you're going to store the actual record in the index, then what's the point of creating the index?
This is an extreme case, because it does not really correspond to
having data entries separated by data records (the hashed file is an
example of this case).
(M. Lenzerini, R. Rosati, Database Management Systems: Access file manager and query evaluation, "La Sapienza" University of Rome, 2016)
Alternative 1 is often used for direct indexing, for instance in B-trees and hash indexes (see also Oracle, Building Domain Indexes)
Let's do a concrete example.
We have a relation R(a,b,c) and we have a clustered B+⁠-⁠tree using alternative 2 on search key a. Since the tree is clustered, the relation R must be sorted by a.
Now, let's suppose that a common query for the relation is:
SELECT *
FROM R
WHERE b > 25
so we want to build another index to efficiently support this kind of query.
Case 1: clustered tree with alt. 2
We know that clustered B+⁠-⁠trees with alternative 2 are efficient with range queries, because they needs just to search for the first good result (say the one with b=25), then do 1 page access to the relation's page to which this result points, and finally scan that page (and eventually, some other pages) until the records fall within the given range.
To sum up:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Use the found pointer to go to a specific page. Cost: 1
Scan the page and eventual other pages. Cost: num. of relevant pages
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + 1 + #relevant-pages
where ƒ is the fan-out and ℓ the number of leaves.
Unfortunately, in our case a tree on search key b must be unclustered, because the relation is already sorted by a
Case 2: unclustered tree with alt. 2 (or 3)
We also know that B+⁠-⁠trees are not so efficient in range queries when they are unclustered. Infact, having a tree with alternative 2 or 3, in the tree we'd store only the pointers to the records, so for each result that falls in the range we'd have to do a page access to a potential different page (because the relation has a different order with respect to the index).
To sum up:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Follow scanning the leaf (and maybe other leaves) and do a different page access for each tuple that falls in the range. Cost: num. of other relevant leaves + num. of relevant tuples
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves + #relevant-tuples
notice that the number of tuples is pretty bigger respect to the number of pages!
Case 3: unclustered tree with alt. 1
Using alternative 1, we have all the data in the tree, so for executing the query we:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Follow scanning the leaf (and maybe other leaves). Cost: num. of other relevant leaves
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves
that is even smaller than (or at most equal to) the cost of case 1, but this instead is allowed.
I hope I was clear enough.
N.B. The cost is expressed in terms of page accesses because the I/O operations from/to second⁠-⁠storage are the most expensive ones in terms of time (we ignore the cost of scanning a whole page in main memory but we consider just the cost of accessing it).

Data structure for a sum of numbers that keeps the balance updated

I have a table with millions of transactions of a single account. Each transaction contains:
moment - Timestamp when the transaction happened.
sequence - A number to sort transactions that happen at exact same moment.
description, merchant, etc - overall information.
amount - The monetary value of the transaction, which may be positive or negative.
balance - The account balance after the transaction (sum of current and all previous amount). This is computed by the system.
What data structure is optimized for quickly displaying or updating the correct balance of all transactions, assuming the user can insert, delete or modify the amount of very old transactions?
My current option is organizing the transactions in a B-tree of order M, then store the sum of amount on each node. Then if some very old transaction is updated, I only update the corresponding node sum and all its parents up the root, which is very fast. It also allows me to show the total balance with a single read of the root node. However in order to display the right balance value of future records, I eventually need to read M nodes, which is kind of slow assuming each node is on cloud storage.
Is there a better solution?
Solution with B-tree may be enhanced further. You may store a list of delta modifications in RAM. This list (which also may be a binary tree) contains only updates and is sorted by timestamp.
For example, this list may look like following at some point:
(t1, +5), (t10, -6), (t15, +80)
This means that when you need to display balance of transaction with timestamp
less than t1 - do nothing
between [t1, t10) - you add 5
between [t10, t15) - decrement by 6
[t15, inf) - add 80
Now suppose that we need to make modification (t2, -3). We
Insert this node into list at proper position
Update all nodes to the right with delta (-3)
Update this node`s value with value from its left neighbor (+5 -3 = +2)
List becomes:
(t1, +5), (t2, +2), (t10, -9), (t15, +77)
Eventually, when delta list becomes large, you will need to apply it to your B-tree.

Building an array of numbers no repeating order doesn't matter (Gaming stats)

So I play heroes of newerth. I have the desire to make a statistical program that shows which team of 5 heroes vs another 5 heroes wins the most. Given there are 85 heroes and games are 85 choose 5 vs 80 choose 5, that's a lot of combinations.
Essentially I'm going to take the stats data the game servers allow me to get and just put a 1 in an array which has heroes when they get a win [1,2,3,4,5][6,7,8,9,10][W:1][L:0]
So after I parse and build the array from the historical game data, I can put in what 5 heroes I want to see, and I can get back all the relevant game data telling me which 5 hero lineup has won/lost the most.
What I need help starting is a simple algorithm to write out my array. Here's similar output I need: (I have simplified this to 1-10, where the code I get I can just change 10 to x for how many heroes there are).
[1,2,3,4,5][6,7,8,9,10]
[1,2,3,4,6][5,7,8,9,10]
[1,2,3,4,7][5,6,8,9,10]
[1,2,3,4,8][5,6,7,9,10]
[1,2,3,4,9][5,6,7,8,10]
[1,2,3,4,10][5,6,7,8,9]
[1,2,3,5,6][4,7,8,9,10]
[1,2,3,5,7][4,6,8,9,10]
[1,2,3,5,8][4,6,7,9,10]
[1,2,3,5,9][4,6,7,8,10]
[1,2,3,5,10][4,6,7,8,9]
[1,2,3,6,7][4,5,8,9,10]
[1,2,3,6,8][4,5,7,9,10]
[1,2,3,6,9][4,5,7,8,10]
[1,2,3,6,10][4,5,7,8,9]
[1,2,3,7,8][4,5,6,9,10]
[1,2,3,7,9][4,5,6,8,10]
[1,2,3,7,10][4,5,6,8,9]
[1,2,3,8,9][4,5,6,7,10]
[1,2,3,8,10][4,5,6,7,9]
[1,2,3,9,10][4,5,6,7,8]
[1,2,4,5,6][3,7,8,9,10]
[1,2,4,5,7][3,6,8,9,10]
[1,2,4,5,8][3,6,7,9,10]
[1,2,4,5,9][3,6,7,8,10]
[1,2,4,5,10][3,6,7,8,9]
[1,2,4,6,7][3,5,8,9,10]
[1,2,4,6,8]...
[1,2,4,6,9]
[1,2,4,6,10]
[1,2,4,7,8]
[1,2,4,7,9]
[1,2,4,7,10]
[1,2,4,8,9]
[1,2,4,8,10]
[1,2,4,9,10]
...
You get the Idea. No repeating and order doesn't matter. Its essentially cut in half doesn't matter the order of the arrays either. Just need a list of all the combinations of teams that can be played against each other.
EDIT: additional thinking...
After quite a bit of thinking. I have come up with some ideas. Instead of writting out the entire array of [85*84*83*82*81][80*79*78*77*76*75] possible combinations of characters, which would have to be made larger for the introduction of of new heroes as to keep the array relevant and constantly updating.
To instead when reading from the server parse the information and build the array from there. It would be much simpler to just make an element in the array when one is not found, ei the combinations have never been played before. Then parsing the data would be 1 pass, and build your array as it complies along. Yes it might take a while, but the values that are created will be worth the wait. It can be done over time too. Starting with a small test case say 1000 games and working up the the number of matches that have been played. Another Idea would be to start from our current spot in time and build the data base from there. There is no need to go back to the first games ever played based off the amount of changes that have occurred to heroes over that time frame, but say go back 2-3 months to give it some foundation and reliability of data, and with each passing day only getting more accurate.
Example parse and build of the array:
get match(x)
if length < 15/25, x++; //determine what length matches we want and discredit shorter than 15 for sure.
if players != 10, x++; //skip the match because it didn't finish with 10 players.
if map != normal_mm_map // rule out non mm games, and mid wars
if != mm, rule out custom games
//and so forth
match_psr = match(x).get(average_psr);
match_winner = match(x).get(winner);
//Hero ids of winners
Wh1 = match.(x).get(winner.player1(hero_id)))
Wh2 = match.(x).get(winner.player2(hero_id)))
Wh3 = match.(x).get(winner.player3(hero_id)))
Wh4 = match.(x).get(winner.player4(hero_id)))
Wh5 = match.(x).get(winner.player5(hero_id)))
//hero ids of losers
Lh1 = match.(x).get(loser.player1(hero_id)))
Lh2 = match.(x).get(loser.player2(hero_id)))
Lh3 = match.(x).get(loser.player3(hero_id)))
Lh4 = match.(x).get(loser.player4(hero_id)))
Lh5 = match.(x).get(loser.player5(hero_id)))
//some sort of sorting algorithim to put the Wh1-5 in order of hero id from smallest to largest
//some sort of sorting algorithim to put the Lh1-5 in order of hero id from smallest to largest
if(array([Wh1, Wh2, Wh3, Wh4, Wh5],[Lh1,Lh2,Lh3,Lh4,Lh5],[],[],[],[],[],[],[],[],[]) != null)
array([Wh1, Wh2, Wh3, Wh4, Wh5],[Lh1,Lh2,Lh3,Lh4,Lh5],[],[],[],[],[],[],[],[],[]) += array([],[],[1],[][][][](something with psr)[][][[])
else(array.add_element([Wh1, Wh2, Wh3, Wh4, Wh5],[Lh1,Lh2,Lh3,Lh4,Lh5],[1],[][][][](something with psr)[][][[])
Any thoughts?
Encode each actor in the game using a simple scheme 0 ... 84
You can maintain a 2D matrix of 85*85 actors in the game.
Initialize each entry in this array to zero.
Now use just the upper triangular portion of your matrix.
So, for any two players P1,P2 you have a unique entry in the array, say array[small(p1,p2)][big(p1,p2)].
array(p1,p2) signifies how much p1 won against p2.
You event loop can be like this :
For each stat like H=(H1,H2,H3,H4,H5) won against L=(L1,L2,L3,L4,L5) do
For each tuple in H*L (h,l) do
if h<l
increment array[h][l] by one
else
decrement array[l][h] by one
Now, at the end of this loop, you have an aggregate information about players information against each other. Next step is an interesting optimization problem.
wrong approach : select 5 fields in this matrix such that no two field's row and column are same and the summation of their absolute values is maximum. I think you can get good optimization algorithms for this problem. Here, we will calculate five tuples (h1,l1), (h2,l2), (h3,l3) ... where h1 wins against l1 is maximized but you still did not see it l1 is good against h2.
The easier and correct options is to use brute force on the set of (85*84)C5 tuples.

Web stats: Calculating/estimating unique visitors for arbitary time intervals

I am writing an application which is recording some 'basic' stats -- page views, and unique visitors. I don't like the idea of storing every single view, so have thought about storing totals with a hour/day resolution. For example, like this:
Tuesday 500 views 200 unique visitors
Wednesday 400 views 210 unique visitors
Thursday 800 views 420 unique visitors
Now, I want to be able to query this data set on chosen time periods -- ie, for a week. Calculating views is easy enough: just addition. However, adding unique visitors will not give the correct answer, since a visitor may have visited on multiple days.
So my question is how do I determine or estimate unique visitors for any time period without storing each individual hit. Is this even possible? Google Analytics reports these values -- surely they don't store every single hit and query the data set for every time period!?
I can't seem to find any useful information on the net about this. My initial instinct is that I would need to store 2 sets of values with different resolutions (ie day and half-day), and somehow interpolate these for all possible time ranges. I've been playing with the maths, but can't get anything to work. Do you think I may be on to something, or on the wrong track?
Thanks,
Brendon.
If you are OK with approximations, I think tom10 is onto something, but his notion of random subsample is not the right one or needs a clarification. If I have a visitor that comes on day1 and day2, but is sampled only on day2, that is going to introduce a bias in the estimation. What I would do is to store full information for a random subsample of users (let's say, all users whose hash(id)%100 == 1). Then you do the full calculations on the sampled data and multiply by 100. Yes tom10 said about just that, but there are two differences: he said "for example" sample based on the ID and I say that's the only way you should sample because you are interested in unique visitors. If you were interested in unique IPs or unique ZIP codes or whatever you would sample accordingly. The quality of the estimation can be assessed using the normal approximation to the binomial if your sample is big enough. Beyond this, you can try and use a model of user loyalty, like you observe that over 2 days 10% of visitors visit on both days, over three days 11% of visitors visit twice and 5% visit once and so forth up to a maximum number of day. These numbers unfortunately can depend on time of the week, season and even modeling those, loyalty changes over time as the user base matures, changes in composition and the service changes as well, so any model needs to be re-estimated. My guess is that in 99% of practical situations you'd be better served by the sampling technique.
You could store a random subsample of the data, for example, 10% of the visitor IDs, then compare these between days.
The easiest way to do this is to store a random subsample of each day for future comparisons, but then, for the current day, temporarily store all your IDs and compare them to the subsampled historical data and determine the fraction of repeats. (That is, you're comparing the subsampled data to a full dataset for a given day and not comparing two subsamples -- it's possible to compare two subsamples and get an estimate for the total but the math would be a bit trickier.)
You don't need to store every single view, just each unique session ID per hour or day depending on the resolution you need in your stats.
You can keep these log files containing session IDs sorted to count unique visitors quickly, by merging multiple hours/days. One file per hour/day, one unique session ID per line.
In *nix, a simple one-liner like this one will do the job:
$ sort -m sorted_sid_logs/2010-09-0[123]-??.log | uniq | wc -l
It counts the number of unique visitors during the first three days of September.
You can calculate the uniqueness factor (UF) on each day and use it to calculate the composite (week by example) UF.
Let's say that you counted:
100 visits and 75 unique session id's on monday (you have to store the sessions ID's at least for a day, or the period you use as unit).
200 visits and 100 unique session id's on tuesday.
If you want to estimate the UF for the period Mon+Tue you can do:
UV = UVmonday + UVtuesday = TVmonday*UFmonday + TVtuesday*UFtuesday
being:
UV = Unique Visitors
TV = Total Visits
UF = Uniqueness Factor
So...
UV = (Sum(TVi*UFi))
UF = UV / TV
TV = Sum(TVi)
I hope it helps...
This math counts two visits of the same person as two unique visitors. I think it's ok if the only way you have to identify somebody is via the session ID.

Resources