Find all paths to one ultimate parent based on a dataset showing 1-1 parent child ownership using recursion in Python - loops

I have the following example dataset with 3 columns, child entity, parent entity, ownership percentage of parent in that child:
Child Parent Ownership
A B 50%
C D 10%
B C 20%
E D 30%
B E 40%
Note that D is the ultimate parent, and I want to use python to calculate the ownership percentage of D in each of the child entities. To do this, I need to find all paths that lead to D, and multiply the ownership percentage up the chain and add up the percentages if there are multiple paths that lead to D. See the result I want to reach below:
For A: [A->B->C->D],[A->B->E->D]. D_ownership = 50%*20%*10% + 50%*40%*30%=7%
For B: [B->C->D],[B->E->D]. D_ownership = 20%*10% + 40%*30% = 14%
For C: [C->D]. D_ownership = 10%
For E: [E->D]. D_ownership = 30%
At first I used a nested for loop where I first use a loop to find the parent of each child, I use that parent as the new child and loop through the dataset again to find the next level parent until I reach D. However, I realized this is way too complicated, and did a lot of research and found that this is a recursive problem and maybe I should use breadth first search or depth first search to find all the paths. But the dataset I have is not a graph yet and I'm not sure about the nodes. So I want to ask for inspirations on how to solve this. I think there's a much easier solution than what I had but I can't think of it yet.
Thanks all!

Your dataframe represents a weighted directed acyclic graph (DAG), a fancy name for "something that looks kinda like a tree".
At first I used a nested for loop where I first use a loop to find the parent of each child, I use that parent as the new child and loop through the dataset again to find the next level parent until I reach D.
The idea is good, and this will work. However, you can make it a bit more simple and a lot more efficient, if you first convert your data structure to a different data structure representing the same data, but so that finding the child of each parent, or the parent of each child, can be done directly without a loop. In python, such a structure is called a dict. So I suggest building a dict, either mapping each parent to its list of children, or mapping each child to its list of parents. In the code below I mapped each parent to its list of children, but the other way around would have worked too.
However, I realized this is way too complicated, and did a lot of research and found that this is a recursive problem and maybe I should use breadth first search or depth first search to find all the paths.
Very good observation. There is no reason to prefer one over the other, you can solve this problem either with depth-first-search or with breadth-first-search. However, depth-first-search is somewhat simpler to implement with a recursive function, so I suggest using depth-first-search.
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'Child': list('ACBEB'), 'Parent': list('BDCDE'), 'Ownership': [.5,.1,.2,.3,.4]})
dag_as_dict = defaultdict(list) # {parent: list of (child, weight)}
for idx, row in df.iterrows():
dag_as_dict[row['Parent']].append((row['Child'], row['Ownership']))
print(dag_as_dict)
# defaultdict(<class 'list'>,
# {'B': [('A', 0.5)],
# 'D': [('C', 0.1), ('E', 0.3)],
# 'C': [('B', 0.2)],
# 'E': [('B', 0.4)]})
def get_weights_of_descendants(dag_as_dict, root, root_weight=1.0, result=None):
if result is None:
result = defaultdict(float)
for child, weight in dag_as_dict[root]:
new_weight = weight * root_weight
result[child] += new_weight
get_weights_of_descendants(dag_as_dict, child, new_weight, result) # result is mutable so it will be updated by this recursive call
return result
weights = get_weights_of_descendants(dag_as_dict, 'D')
print(weights)
# defaultdict(<class 'float'>, {'C': 0.1, 'B': 0.14, 'A': 0.07, 'E': 0.3})

Related

Why are the inputs to my guess_nonlinear() all 1s?

The N2 diagram for my full problem is below.
The N2 diagram for the coupled portion of the problem is below.
I have a DirectSolver handling the coupling between LLTForces and ImplicitLiftingLine, and an LNBGS solver handling the coupling between LiftingLineGroup and TestCL.
The gist for the problem is here: https://gist.github.com/eufren/31c0e569ed703b2aea3e2ef5360610f7
I have implemented guess_nonlinear() on ImplicitLiftingLine, which should use various outputs from LLTGeometry to give a good initial guess for the vortex strengths based on a linearised form of the governing equations.
def guess_nonlinear(self, inputs, outputs, resids):
freestream_unit_vector = inputs['freestream_unit_vector']
freestream_velocity = inputs['freestream_velocity']
n = inputs['normal_vectors']
A = inputs['surface_areas']
l = inputs['bound_vortices']
ic_tot = inputs['influence_coefficients_total']
v_inf = freestream_velocity
v_inf_vec = v_inf*freestream_unit_vector
lin_numerator = np.pi * v_inf * A * np.sum(n * v_inf_vec, axis=1)
lin_denominator = (np.linalg.norm(np.cross(v_inf_vec, l), axis=1) - np.pi * v_inf * A * np.sum(np.sum(n * ic_tot, axis=2), axis=1))
lin_vtx_str = lin_numerator / lin_denominator
outputs['vortex_strengths'] = lin_vtx_str
However, when the problem is run for the first time, any inputs not explicitly set with p.set_val() are all 1s. This causes guess_nonlinear() to give a bad output and so the system fails to converge:
As far as I can tell, the execution order for the LLT group is correct, and the geometry components should be being executed before the implicit component. I'm confused as to why this doesn't seem to actually be happening when the code is run, and instead these inputs are taking their default values.
What do I need to change to get this to work properly? Additionally, I've found difficulty in getting LNBGS to converge (hence adding guess_nonlinear()) during optimisation - only DirectSolver gets all the way through the optimisation without issues, but it's very slow for large numbers of LLT nodes). How can I improve the linear and nonlinear solver selection, and improve the reliability of the iterative solver?
Note: Thanks for providing a testable example. It made figuring out the answer to your question a lot simpler. Your problem was a bit subtle and I would not have been able to give a good answer without runnable code
Your first question: "Why are all the inputs 1"
"Short" Answer
You have put the nonlinear solver to high in the model hierarchy, which then included a key precurser component that computed your input values. By moving the solver down to a lower level of the model, I was able to ensure that the precurser component (LTTGeometry) ran and had valid outputs before you got to the guess_nonlinear of implicit component.
Here is what you had (Notice the implicit solver included LTTGeometry even though the data cycle does not require that component:
I moved both the nonlinear solver and the linear solver down into the LTTCycle group, which then allows the LTTGeometry component to execute before getting to the nonlinear solver and guess_nonlinear step:
My fix is only partially correct, since there is a secondary cycle from the TestCL component that also needs a solver and does not have one. However, that cycle still does not involve the LTTGeometry group. So the fully correct fix is to restructure you model top run geometry first, and then put the LTTCycle and TestCL groups together so you can run a solver over just them. That was a bit more hacking than I wanted to do on your test problem, but you can see the general idea from the adjusted N2 above.
Long Answer
The guess_nonlinear sequence in OpenMDAO does NOT run the compute method of explicit components or of groups. It follows the execution hierarchy, and calls any guess_nonlinear that it finds. So that means that any explicit components you have in your model will NOT get executed, their outputs will not get updated with computed values, and those computed values will not get passed to the inputs of downstream components.
Things get a little tricky when you have deep model hierarchies. The guess_nonlinear method is called as the first step in the nonlinear solver process. If you have a NonLinearRunOnce solver at the top level, it will follow the compute chain down the line calling compute or solve_nonlinear on each child and doing a data transfer after each one. If one of those children happens to be a group with a nonlinear solver, then that solver will call guess_nonlinear on its children (grandchildren of the top group with the NonLinearRunOnce solver) as the first step. So any outputs that were computed by the siblings of this group will be valid, but none of the outputs from the grandchild level will have been computed yet.
You may be wondering why not just have the guess_nonlinear method call the compute for any explicit components? There is a difficult to balance trade off here. If you assume that all explicit components are very cheap to run, then it might make sense to run the compute methods --- or it might not. A lot depends on the cyclic data structure. If some early component in the group needs guesses from the later one, then running its compute isn't going to help you much at all. Perhaps more importantly though, not all explicit components are cheap to run. You might have a very expensive computation, and calling compute as part of the guess process would be way too costly.
The compromise here, if you need some kind of top level guess process, is that you can implement guess_nonlinear at the group level. It's less common to do, but it gives you total control over what happens. You can call whatever you need to call in whatever sequence.
So the absolute key thing to remember is that the only data you have available to you when a guess_nonlinear is called is any data that was computed before your containing solver was executed. That means any thing that was computed before you got to the model scope of the containing solver (not the scope of the component with the guess_method itself).
Your second question: "How can I speed this up when the number of nodes gets large?"
This one not possible to give a generic answer to at all. I noticed that you have already specified sparse partial derivatives. That is a great start, but if its still not fast enough for you then it means you're reaching the limits of what you can do with a DirectSolver. You note that this solver is the only one that gets you through the optimization without issues, which I will take to mean that ScipyKryloventer link description here and PetscKrylov are not converging the linear system well for you --- at least not by themselves. Thats not surprising, as krylov solvers almost always require some kind of preconditioner... and this is why I can't offer a generic answer. Setting up efficient linear solvers for larger-scale compute is a tricky subject. If you look into the literature, you'll find some good suggestions. You can also study open source implementations like VSPAero for some tips.
effectively, you've reached the limit of what simple linear solvers can offer you. From this point forward, OpenMDAO can help a bit by making it easier to implement some preconditioning, but you'll have to suffer the math side yourself.

Clone a Binary Tree with Random Pointers

Can anyone explain the way of cloning the binary tree with random pointers apart from left to right? every node has following structure.
struct node {
int key;
struct node *left,*right,*random;
}
This is very popular interview question and I am able to figure out the solution based on hashing(which is similar to cloning of linked lists). I tried to understand the solution given in Link (approach 2) but am not able to figure out what does it want to convey by reading code also.
I don't expect solution based on hashing as it is intuitive and pretty straight forward. Please explain solution based on modifying binary tree and cloning it.
The solution presented is based on the idea of interleaving both trees, the original one and its clone.
For every node A in the original tree, its clone cA is created and inserted as A's left child. The original left child of A is shifted one level down in the tree structure and becomes a left child of cA.
For each node B, which is a right child of its parent P (i.e., B == P->right), a pointer to its clone node cB is copied to a clone of its parent.
P P
/ \ / \
/ \ / \
A B cP B
/ \ / \ / \
/ \ / \ / \
X Z A cB Z
/ \ /
cA cZ
/
X
/
cX
Finally we can extract the cloned tree by traversing the interleaved tree and unlinking every other node on each 'left' path (starting from root->left) together with its 'rightmost' descendants path and, recursively, every other 'left' descendant of those and so on.
What's important, each cloned node is a direct left child of its original node. So in the middle part of the algorithm, after inserting the cloned nodes but before extracting them, we can traverse the whole tree walking on original nodes, and whenever we find a random pointer, say A->random == Z, we can copy the binding into clones by setting cA->random = cZ, which resolves to something like
A->left->random = A->random->left;
This allows cloning random pointers directly and does not require additional hash maps (at the cost of interleaving new nodes into the original tree and extracting them later).
The interleaving method can be simplified a little, I think.
1) For every node A in the original tree, create clone cA with the same left and right pointers as A. Then, set As left pointer to cA.
P P
/ \ /
/ \ /
A B cP
/ \ / \
/ \ / \
X Z A B
/ /
cA cB
/ \
X Z
/ /
cX cZ
2) Now given a node and it's clone (which is just node.left), the random pointer for the clone is: node.random.left (if node.random exists).
3) Finally, the binary tree can be un-interleaved.
I find this interleaving makes reasoning about the code much simpler.
Here is the code:
def clone_and_interleave(root):
if not root:
return
clone_and_interleave(root.left)
clone_and_interleave(root.right)
cloned_root = Node(root.data)
cloned_root.left, cloned_root.right = root.left, root.right
root.left = cloned_root
root.right = None # This isn't necessary, but doesn't hurt either.
def set_randoms(root):
if not root:
return
cloned_root = root.left
set_randoms(cloned_root.left)
set_randoms(cloned_root.right)
cloned_root.random = root.random.left if root.random else None
def unterleave(root):
if not root:
return (None, None)
cloned_root = root.left
cloned_root.left, root.left = unterleave(cloned_root.left)
cloned_root.right, root.right = unterleave(cloned_root.right)
return (cloned_root, root)
def cloneTree(root):
clone_and_interleave(root)
set_randoms(root)
cloned_root, root = unterleave(root)
return cloned_root
The terminology used in those interview questions is absurdly bad. It’s the case of one unwitting kuckledgragger somewhere calling that pointer the “random” pointer and everyone just nods and accept this as if it was some CS mantra from an ivory tower. Alas, it’s sheer lunacy.
Either what you have is a tree or it isn’t. A tree is an acyclic directed graph with at most a single edge directed toward any node, and adding extra pointers can’t change it - the things the pointers point to must retain this property.
But when the node has a pointer that can point to any other node, it’s not a tree. You got a proper directed graph with cycles in it, and looking at it as if it were a tree is silly at this point. It’s not a tree. It’s just a generic directed edge graph that you’re cloning. So any relevant directed graph cloning technique will work, but the insistence on using the terms “tree” and “random pointer” obscure this simple fact, and confuse the matters terribly.
This snafu indicates that whoever came up with the question was not qualified to be doing any such interviewing. This stuff is covered in any decent introductory data structure textbook so you’d think it shouldn’t present some astronomical uphill effort to just articulate what you need in a straightforward manner. Let the interviewees deal with users who can’t articulate themselves once they get that job - the data structure interview is neither the place nor time for that. It reeks of stupidity and carelessness, and leaves permanently bad aftertaste. It’s probably yet another stupid thing that ended up in some “interview question bank” because one poor soul got it asked by a careless idiot once and now everyone treats it as gospel. It’s yet again the blind leading the blind and cluelessness abounds.
Copying arbitrary graphs is a well solved problem and in all cases you need to retain the state of your traversal somehow. Whether it’s done by inserting nodes into the original graph to mark the progress - one could call it intrusive marking - or by adding data to the copy in progress and removing it when done, or by using an auxiliary structure such as a hash, or by doing repeat traversal to check it you made a copy of that node elsewhere - is of secondary importance, since the purpose Is always the same: to retain the same state information, just encoding it in various ways, trading off speed and memory use (as always).
When thinking of this problem, you need to tell yourself what sort of state you need to finish the copy, and abstract it away, and implement the copy using this abstract interface. Then you can implement it in a few ways, but at that point the copy itself doesn’t obscure things since you look at this simple abstract state-preserving interface and not at the copy process.
In real life the choice of any particular implementation highly depends on the amount and structure of data being copied, and the extent you have control over it all. If you’re the one controlling the structure of the nodes, then you’ll usually find that they have some padding that you could use to store a bit of state information. Or you’ll find that the memory block allocated for the nodes is actually larger than requested: malloc will often end up providing a block larger than asked for, and all reasonable platforms have APIs that let you retrieve the actual size of the block and thus check if there’s maybe some leftover space just begging to be used. These APIs are not always fast so be careful there of course. But you see where this is going: such optimization requires benchmarks and a clear need driven by demands of the application. Otherwise, use whatever is least likely to be buggy - ideally a C library that provides data structures that you could use right away. If you need a cyclic graph there are libraries that do just that - use them first.
But boy, do I hate that idiotic “random” name of the pointer. Who comes up with this nonsense and why do they pollute so many minds? There’s nothing random about it. And a tree that’s not a tree is not a tree. I’d fail that interviewer in a split second…

How can I create arrays within a loop (within another loop) and pass them back out in Javascript?

I'm using Google Docs Spreadsheet API to keep track of a competition between some of my friends. I have some big ideas, but I'm stumped right now. I'm trying to create 20 different arrays inside of loops (see below) and then evaluate each one outside of the loop, and I really don't want to write 20 "if...then" statements.
NOTE: the following SUMMARY may or may not help you answer my question. You might want to skip down to the code, then read this if you need it :)
Summary of the program: Each player assigns point values in favor of one possible outcome of a set of binary-outcome events. As the events happen, players either gain the points assigned if their outcome occurs, or gain no points if the opposite outcome occurs. My goal is to 1) figure out exactly when a player is eliminated, and 2) highlight all remaining events that must be won for them to have a chance at tying for first.
Instead of trying to somehow evaluate all possibilities (5 players picking, 2^16 outcomes... I have zero computer science knowledge, but this seems like an incredibly huge task even for the modern computer) I've come up with an alternate idea. The script loops through each player, against each other opponent. It calculates the maximum number of points a player can score based on their value assignments and the already determined game. For one player and one opponent, it checks the best possible outcome by the player against that opponent, and if there is any opponent he cannot beat, even in the best case, then he is eliminated.
This part is easy-- after the loop runs inside the loop, I just adjust a global variable that I created earlier, and when the outer loop is done, just grab those variables and write them to the sheet.
Unfortunately, this misses the case of where he could have a best case against each individual opponent, but not against multiple opponents at once.
So the next step is what I'm trying to do now. I'm not even sure I can give a good explanation without just showing you the entire spreadsheet w/script, but I'll try. So what I want to do now is calculate the "value" of each event for each player against a given other player. If both player and opponent assigned points in favor of the same event outcome for one event, the event's value is the difference between the picks (positive if player picked higher, negative if lower), and it's the SUM if they picked opposite event outcomes. Now, I do the same thing as before-- take a best-case scenario for a player against a given opponent-- but now I check by how much the player can beat the opponent in a best-case scenario. Then I evaluate the (absolute value of the) event value against this difference, and if it's greater, then the event is a must win (or must lose if the event value is negative). And, if an event is both a "must-win" and a "must lose" event, then the player is eliminated.
The problem is that this second step requires me to create a new array of values for each player-opponent combination, and then do things with the values after they're created.
I realize one approach would be to create 20 different arrays, and throughout the entire loops, keep checking "if (player == "1" && opponent == 2){}" and populate the arrays accordingly, but this seems kind of ridiculous. And more importantly, this entire project is my attempt at learning javascript, so what's the point in using a time-intensive workaround that doesn't teach me anything new?
I'm trying to understand square bracket notation, since it seems to be the answer to my question, but a lot of people are also suggesting that it's impossible to create variable names by concatenating with the value of another variable... so anyway, here's what I'm trying. I'd really appreciate either a fix to my approach, or a better approach.
for (var player=1; player<6; player++){
if(player==1){look up certain columns in the spreadsheet and save them to variables}
//ditto for other players
for(var opponent=1; opponent<6; opponent++){
if(player!=opponent){
if(opponent==1){save more values to variables}
//ditto for other players
for(var row=9; row<24; row++) {
//Now the script goes down each row of an array containing the original
//spreadsheet info, and, based on information determined by the variables
//created above, get values corresponding to the player and opponent.
//So what I'd like to do here is create "array[1,2]==" and then "array[1,3]=="
//and so forth, creating them globally (I think I understand that term by now)
//so I can refer to them outside of the loops later to do some evaluatin'.
}
}}
//get array[1,2]...array[5,4] and do some operations with them.
Thanks for getting through this little novel... really looking forward to your advice and ideas!
How can I create arrays within a loop (within another loop)?
Code update 2
As you said: "i am trying to understand square bracket notation" You may take a look at my new demo and the code:
function getTeam(){
var array = [[1,2,3],[4,5,6],[7,8,9]]; // arrays within arrays
// array myTeam
var myTeam = [[],[],[],[]];
var playerNames = ["John", "Bert", "Dave", "Milton"];
var ages =[];
var weight = 104;
// loop over the team arrayadd each player (name, age and weight) to the team
for (i=0; i < myTeam.length; i++){
// fill the age array in a loop
for (j=0;j<myTeam.length;j++) {
ages[j] = 23 + j;
}
myTeam[i].push([playerNames[i], ages[i], weight]);
}
return myTeam;
}
And pass them back out in Javascript
Could you elaborate on this part?
Update
var valuesOfPlayers=[];
for (var player=1; player<6; player++){
// look up certain columns in the spreadsheet and save them to variables
// you could call a funcntion and return the values you
// collected in an array within an array as in the demo above
valuesOfPlayers[player] = lookupColumnValues(player);
for(var opponent=1; opponent<6; opponent++){
if(player!=opponent){
// save more values to variables
valuesOfPlayers[player] = addValuesToVar(player);
}
for(var row=9; row<24; row++) {
// if you collect the values in your first and second if clause
// what other information do you want to collect
// Please elaborate this part?
}
}}
One workaround:
I could create an array before the execution of the loops.
At the start of each loop, I could push a string literal to the array containing the value of player and opponent.
After the loops are done, I could split the array into multiple arrays, or just evaluate them in one big array using regular expressions.
I'd still rather create new arrays each time-- seems like it is a more universal way of doing this, and learning how would be more educational for me than using this workaround.

Graph-Traversal: How do I query for "friends and friends of friends" using Gremlin

In my graph database I have Branches and Leaves. Branches can "contain" Leaves and Branches can "contain" Branches.
How, using Gremlin, can I find all leaves for a given branch, that are directly or indirectly related to it?
I got this to work in Cypher:
START v=node(1) MATCH v-[:contains*1..2]->i RETURN v,i
Where the *1..2 means "friends and friends of friends".
I thought maybe LoopV was the way forward, but I just get an Exception:
Error reading JArray from JsonReader. Current JsonReader item is not an array: String
You can do the following in Gremlin 1.4+.
g.v(1).out('contains').loop(1){true}{it.out('contains').count() == 0}
This says:
Start at vertex with id 1
Take the outgoing "contains" edges.
Loop over the out('contains') section.
Loop "infinitely" (make sure your tree doesn't have loops in it)
Emit only those vertices touched that don't have more outgoing 'contains'-edges. (i.e. the leaves)
However, looking at what you wanted from Cypher, it looks like you only want 2 steps. Thus, to do that, simply do:
g.v(1).out('contains').loop(1){it.loops < 3}
Perhaps I misunderstood your question --- either way, that should give you enough to play with.

In a tree data structure, display tree nodes level by level

Question: how can we display tree nodes level by level ?. could you please give me time and space efficient solution .
Example :
A
/ \
B C
/ \ / \
D E F G
void PrintTree(struct tree *root);
Output:
You have to print tree nodes level by level
A
B C
D E F G
If you're feeling brutish, and want to think very simply about the level you are at...
You will need:
Two queues
A slight twist on Jack's approach
So, start with root.
Tack its children onto the first queue.
Step through them, tacking their children onto the second queue as you go.
Switch to the second queue, step through, pushing their children onto the first queue.
Wax on, wax off.
Really it's just a slight expansion of the same idea, the breadth first search or sweep, which is worth thinking about as a pattern, since it applies to a variety of data structures. Almost anything that's a tree or trie, and a few things that aren't, in fact!
To save space and time on SO:
http://thecodecracker.com/c-programming/bfs-and-dfs/
This kind of visit is called Breadth-first or Level Order. You can see additional infos here.
Basically you
first visit the current node
then all the children of that node
then all the children of every children and so on
This should be achieved easily with a FIFO structure:
push the root
until queue is empty
take first element, visit it, and push all its children to the end of the queue
repeat

Resources