I have encountered variations of this problem multiple times, and most recently it became a bottleneck in my arithmetic coder implementation. Given N (<= 256) segments of known non-negative size Si laid out in order starting from the origin, and for a given x, I want to find n such that
S0 + S1 + ... + Sn-1 <= x < S0 + S1 + ... + Sn
The catch is that lookups and updates are done at about the same frequency, and almost every update is in the form of increasing the size of a segment by 1. Also, the bigger a segment, the higher the probability it will be looked up or updated again.
Obviously some sort of tree seems like the obvious approach, but I have been unable to come up with any tree implementation that satisfactorily takes advantage of the known domain specific details.
Given the relatively small size of N, I also tried linear approaches, but they turned out to be considerably slower than a naive binary tree (even after some optimization, like starting from the back of the list for numbers above half the total)
Similarly, I tested introducing an intermediate step that remaps values in such a way as to keep segments ordered by size, to make access faster for the most frequently used, but the added overhead exceeded gains.
Sorry for the unclear title -- despite it being a fairly basic problem, I am not aware of any specific names for it.
I suppose some BST would do... You may try to add a new numeric member (int or long) to each node to keep a sum of values of all left descendants. Then you'll seek for each item in approximately logarithmic time, and once an item is added, removed or modified you'll have to update just its ancestors on the returning path from the recursion. You may apply some self-organizing tree structure, for example AVL to keep the worst-case search optimal or a splay tree to optimize searches for those most often used items. Take care to update the left-subtree-sums during rebalancing or splaying.
You could use a binary tree where each node n contains two integers A_n
and U_n, where initially
A_n = S_0 + .. S_n and U_n = 0.
Let, at any fixed subsequent time, T_n = S_0 + .. + S_n.
When looking for the place of a query x, you would go along the tree, knowing that for each node m the current corresponding value of T_m is A_m + U_m + sum_{p : ancestors of m, we visited the right child of p to attain m} U_p.
This solves look up in O(log(N)).
For update of the n-th interval (increasing its size by y), you just look for it in the tree, increasing the value of U_m og y for each node m that you visit along the way. This also solves update in O(log(N)).
Related
PREMISE
So lately i have been thinking of a problem that is common to databases: trying to optimize insertion, search, deletion and update of data.
Usually i have seen that most of the databases nowadays use the BTree or B+Tree to solve such a problem, but they are usually used to store data inside the disk and i wanted to work with in-memory data, so i thought about using the AVLTree (the difference should be minimal because the purpose of the BTrees is kind of the same of the AVLTree but the implementation is different and so are the effects).
Before continuing with the reasoning behind this i would like to get in a deeper level of what i am trying to solve.
So in a modern database data stored in a table with a PRIMARY KEY which tends to be INDEXED (i am not very experienced in indexing so what i will say is basic reasoning i put into this problem), usually the PRIMARY KEY is an increasing number (even though nowadays is a bad practice) starting from 1.
Using normally an AVLTree should be more then enough to solve the problem cause this particular tree is always balanced and offers O(log2(n)) operations, BUT i wanted to reach this on a deeper level trying to optimize it even more then needed.
THEORY
So as the title of the question suggests i am trying to optimize the AVLTree merging it with a Btree.
Basically every node of this new Tree is lets say an array of ten elements every node as also the corresponding height in the tree and every element of the array is ordered ascending.
INSERTION
The insertion initally fills the array of the root node when the root node is full it generates the left and right children which also contains an array of 10 elements.
Whenever a new node is added the Tree autorebalances the nodes based on the first key of the vectors of the left and right child using also their height (note that this is actually how the AVLTree behaves but the AVLTree only has 2 nodes and no vector just the values).
SEARCH
Searching an element works this way: staring from the root we compare the value we are searching K with the first and last key of the array of the current node if the value is in between, we know that it surely will be in the array of the current node so we can start using a binarySearch with O(log2(n)) complexity into this array of ten elements, otherise we go on the left if the key we are searcing is smaller then the first key or we go to the right if it is bigger.
DELETION
The same of the searching but we delete the value.
UPDATE
The same of the searching but we update the value.
CONCLUSION
If i am not wrong this should have a complexity of O(log10(log2(10))) which is always logarithmic so we shouldn't care about this optimization, but in my opinion this could make the height of the tree so much smaller while providing also quick time on the search.
B tree and B+ tree are indeed used for disk storage because of the block design. But there is no reason why they could not be used also as in-memory data structure.
The advantages of a B tree include its use of arrays inside a single node. Look-up in a limited vector of maybe 10 entries can be very fast.
Your idea of a compromise between B tree and AVL would certainly work, but be aware that:
You need to perform tree rotations like in AVL in order to keep the tree balanced. In B trees you work with redistributions, merges and splits, but no rotations.
Like with AVL, the tree will not always be perfectly balanced.
You need to describe what will be done when a vector is full and a value needs to be added to it: the node will have to split, and one half will have to be reinjected as a leaf.
You need to describe what will be done when a vector gets a very low fill-factor (due to deletions). If you leave it like that, the tree could degenerate into an AVL tree where every vector only has 1 value, and then the additional vector overhead will make it less efficient than a genuine AVL tree. To keep the fill-factor of a vector above a minimum you cannot easily apply the redistribution mechanism with a sibling node, as would be done in B-trees. It would work with leaf nodes, but not with internal nodes. So this needs to be clarified...
You need to describe what will be done when a value in a vector is updated. Of course, you would insert it in its sorted position: but if it becomes the first or last value in that vector, this may violate the order with regards to left and right children, and so also there you may need to define more precisely the algorithm.
Binary search in a vector of 10 may be overkill: a simple left-to-right scan may be faster, as CPUs are optimised to read consecutive memory. This does not impact the time complexity, since we set that the vector size is limited to 10. So we are talking about doing either at most 4 comparisons (3-4 on average depending on binary search implementation) or at most 10 comparisons (5 on average).
If I am not wrong this should have a complexity of O(log10(log2(n))) which is always logarithmic
Actually, if that were true, it would be sub-logarithmic, i.e. O(loglogn). But there is a mistake here. The binary search in a vector is not related to n, but to 10. Also, this work comes in addition to finding the node with that vector. So it is not a logarithm of a logarithm, but a sum of logarithms:
O(log10n + log210) = O(log n)
Therefore the time complexity is no different than the one for AVL or B-tree -- provided that the algorithm is completed with the missing details, keeping within the logarithmic complexity.
You should maybe also consider to implement a pure B tree or B+ tree: that way you also benefit from some of the advantages that neither the AVL, nor the in-between structure has:
The leaves of the tree are all at the same level
No rotations are needed
The tree height only changes at one spot: the root.
B+ trees provide a very fast mean for iterating all values in their order.
Given the heuristic values h(A)=5, h(B)=1, using A* graph search, it will put A and B on the frontier with f(A)=2+5=7, f(B)=4+1=5, then select B for expansion, then put G on frontier with f(G)=4+4=8, then it will select A for expansion, but will not do anything since both S and B are already expanded and not on frontier, and therefore it will select G next and return a non-optimal solution.
Is my argument correct?
There are two heuristic concepts here:
Admissible heuristic: When for each node n in the graph, h(n) never overestimates the cost of reaching the goal.
Consistent heuristic: When for each node n in the graph and each node m of its successors, h(n) <= h(m) + c(n,m), where c(n,m) is the cost of the arc from n to m.
Your heuristic function is admissible but not consistent, since as you have shown:
h(A) > h(B) + c(A,B), 5 > 2.
If the heuristic is consistent, then the estimated final cost of a partial solution will always grow along the path, i.e. f(n) <= f(m) and as we can see again:
f(A) = g(A) + h(A) = 7 > f(B) = g(B) + h(B) = 5,
this heuristic function does not satisfy this property.
With respect to A*:
A* using an admissible heuristic guarantees to find the shortest path from the start to the goal.
A* using a consistent heuristic, in addition to find the shortest path, also guarantees that once a node is explored we have already found the shortest path to this node, and therefore no node needs to be reexplored.
So, answering your question, A* algorithm has to be implemented to reopen nodes when a shorter path to a node is found (updating also the new path cost), and this new path will be added to the open set or frontier, therefore your argument is not correct, since B has to be added again to the frontier (now with the path S->A->B and cost 3).
If you can restrict A* to be used only with consistent heuristic functions then yes, you can discard path to nodes that have been already explored.
You maintain an ordered priority queue of objects on the frontier. You then take the best candidate, expand in all available directions, and put the new nodes in the priority queue. So it's possible for A to be pushed to the back of queue even though in fact the optimal path goes through it. It's also possible for A to be hemmed in by neighbours which were reached through sub-optimal paths, in which case most algorithms won't try to expand it as you say.
A star is only an a way of finding a reasonable path, it doesn't find the globally optimal path.
I am trying to solve this problem on SPOJ. I found this problem in the segment tree section, so I am pretty sure that there could be some possible solution that uses segment tree. But I am unable to come up with the metadata that should be stored in the tree node. The maximum sum can be computed using Kadane's Algo. But how to compute that using segment tree. If we store just the output of algo for a range that would be correct for query for that particular range, but would be incorrect for parents to use that value. If we store some more information like negative sum prefix as well as negative sum suffix. I am able to solve some of the test cases. But its not completely correct. Please provide me some pointers as to how should I approach the metadata for solving this particular problem.
Thanks for helping.
You can solve it by building a segment tree on the prefix sums
sum[i] = sum[i - 1] + a[i]
and then keeping the following information in a node:
node.min = the minimum sum[i], x <= i <= y
([x, y] being the interval associated to node)
= minimum(node.left.min, node.right.min)
node.max = same but with maximum
node.best = maximum(node.left.best,
node.right.best,
node.right.max - node.left.min
)
Basically, the best field gives you the sum of the maximum sum subarray in the associated interval. This is either one of the maximum sum subarrays in the two child nodes, or a sequence that crosses both of the child intervals, which is obtained by subtracting the minimum in the left child from the maximum in the right child, which we also do in a possible linear solution: find the minimum sum[j], j < i for each each i, then compare sum[i] - sum[j] with the global max.
Now, to answer a query you will need to consider the nodes whose associated intervals make up your queried interval and do something similar to how we built the tree. You should try to figure it out on your own, but let me know if you get stuck somewhere.
Background
I work with very large datasets from Synthetic Aperture Radar satellites. These can be thought of as high dynamic range greyscale images of the order of 10k pixels on a side.
Recently, I've been developing applications of a single-scale variant of Lindeberg's scale-space ridge detection algorithm method for detecting linear features in a SAR image. This is an improvement on using directional filters or using the Hough Transform, methods that have both previously been used, because it is less computationally expensive than either. (I will be presenting some recent results at JURSE 2011 in April, and I can upload a preprint if that would be helpful).
The code I currently use generates an array of records, one per pixel, each of which describes a ridge segment in the rectangle to bottom right of the pixel and bounded by adjacent pixels.
struct ridge_t { unsigned char top, left, bottom, right };
int rows, cols;
struct ridge_t *ridges; /* An array of rows*cols ridge entries */
An entry in ridges contains a ridge segment if exactly two of top, left, right and bottom have values in the range 0 - 128. Suppose I have:
ridge_t entry;
entry.top = 25; entry.left = 255; entry.bottom = 255; entry.right = 76;
Then I can find the ridge segment's start (x1,y1) and end (x2,y2):
float x1, y1, x2, y2;
x1 = (float) col + (float) entry.top / 128.0;
y1 = (float) row;
x2 = (float) col + 1;
y2 = (float) row + (float) entry.right / 128.0;
When these individual ridge segments are rendered, I get an image something like this (a very small corner of a far larger image):
Each of those long curves are rendered from a series of tiny ridge segments.
It's trivial to determine whether two adjacent locations which contain ridge segments are connected. If I have ridge1 at (x, y) and ridge2 at (x+1, y), then they are parts of the same line if 0 <= ridge1.right <= 128 and ridge2.left = ridge1.right.
Problem
Ideally, I would like to stitch together all of the ridge segments into lines, so that I can then iterate over each line found in the image to apply further computations. Unfortunately, I'm finding it hard to find an algorithm for doing this which is both low complexity and memory-efficient and suitable for multiprocessing (all important consideration when dealing with really huge images!)
One approach that I have considered is scanning through the image until I find a ridge which only has one linked ridge segment, and then walking the resulting line, flagging any ridges in the line as visited. However, this is unsuitable for multiprocessing, because there's no way to tell if there isn't another thread walking the same line from the other direction (say) without expensive locking.
What do readers suggest as a possible approach? It seems like the sort of thing that someone would have figured out an efficient way to do in the past...
I'm not entirely sure this is correct, but I thought I'd throw it out for comment. First, let me introduce a lockless disjoint set algorithm, which will form an important part of my proposed algorithm.
Lockless disjoint set algorithm
I assume the presence of a two-pointer-sized compare-and-swap operation on your choice of CPU architecture. This is available on x86 and x64 architectures at the least.
The algorithm is largely the same as described on the Wikipedia page for the single threaded case, with some modifications for safe lockless operation. First, we require that the rank and parent elements to both be pointer-sized, and aligned to 2*sizeof(pointer) in memory, for atomic CAS later on.
Find() need not change; the worst case is that the path compression optimization will fail to have full effect in the presence of simultaneous writers.
Union() however, must change:
function Union(x, y)
redo:
x = Find(x)
y = Find(y)
if x == y
return
xSnap = AtomicRead(x) -- read both rank and pointer atomically
ySnap = AtomicRead(y) -- this operation may be done using a CAS
if (xSnap.parent != x || ySnap.parent != y)
goto redo
-- Ensure x has lower rank (meaning y will be the new root)
if (xSnap.rank > ySnap.rank)
swap(xSnap, ySnap)
swap(x, y)
-- if same rank, use pointer value as a fallback sort
else if (xSnap.rank == ySnap.rank && x > y)
swap(xSnap, ySnap)
swap(x, y)
yNew = ySnap
yNew.rank = max(yNew.rank, xSnap.rank + 1)
xNew = xSnap
xNew.parent = y
if (!CAS(y, ySnap, yNew))
goto redo
if (!CAS(x, xSnap, xNew))
goto redo
return
This should be safe in that it will never form loops, and will always result in a proper union. We can confirm this by observing that:
First, prior to termination, one of the two roots will always end up with a parent pointing to the other. Therefore, as long as there is no loop, the merge succeeds.
Second, rank always increases. After comparing the order of x and y, we know x has lower rank than y at the time of the snapshot. In order for a loop to form, another thread would need to have increased x's rank first, then merged x and y. However in the CAS that writes x's parent pointer, we check that rank has not changed; therefore, y's rank must remain greater than x.
In the event of simultaneous mutation, it is possible that y's rank may be increased, then return to redo due to a conflict. However, this implies that either y is no longer a root (in which case rank is irrelevant) or that y's rank has been increased by another process (in which case the second go around will have no effect and y will have correct rank).
Therefore, there should be no chance of loops forming, and this lockless disjoint-set algorithm should be safe.
And now on to the application to your problem...
Assumptions
I make the assumption that ridge segments can only intersect at their endpoints. If this is not the case, you will need to alter phase 1 in some manner.
I also make the assumption that co-habitation of a single integer pixel location is sufficient for ridge segments can be connected. If not, you will need to change the array in phase 1 to hold multiple candidate ridge segments+disjoint-set pairs, and filter through to find ones that are actually connected.
The disjoint set structures used in this algorithm shall carry a reference to a line segment in their structures. In the event of a merge, we choose one of the two recorded segments arbitrarily to represent the set.
Phase 1: Local line identification
We start by dividing the map into sectors, each of which will be processed as a seperate job. Multiple jobs may be processed in different threads, but each job will be processed by only one thread. If a ridge segment crosses a sector, it is split into two segments, one for each sector.
For each sector, an array mapping pixel position to a disjoint-set structure is established. Most of this array will be discarded later, so its memory requirements should not be too much of a burden.
We then proceed over each line segment in the sector. We first choose a disjoint set representing the entire line the segment forms a part of. We first look up each endpoint in the pixel-position array to see if a disjoint set structure has already been assigned. If one of the endpoints is already in this array, we use the assigned disjoint set. If both are in the array, we perform a merge on the disjoint sets, and use the new root as our set. Otherwise, we create a new disjoint-set, and associate with the disjoint-set structure a reference to the current line segment. We then write back into the pixel-position array our new disjoint set's root for each of our endpoints.
This process is repeated for each line segment in the sector; by the end, we will have identified all lines completely within the sector by a disjoint set.
Note that since the disjoint sets are not yet shared between threads, there's no need to use compare-and-swap operations yet; simply use the normal single-threaded union-merge algorithm. Since we do not free any of the disjoint set structures until the algorithm completes, allocation can also be made from a per-thread bump allocator, making memory allocation (virtually) lockless and O(1).
Once a sector is completely processed, all data in the pixel-position array is discarded; however data corresponding to pixels on the edge of the sector is copied to a new array and kept for the next phase.
Since iterating over the entire image is O(x*y), and disjoint-merge is effectively O(1), this operation is O(x*y) and requires working memory O(m+2*x*y/k+k^2) = O(x*y/k+k^2), where t is the number of sectors, k is the width of a sector, and m is the number of partial line segments in the sector (depending on how often lines cross borders, m may vary significantly, but it will never exceed the number of line segments). The memory carried over to the next operation is O(m + 2*x*y/k) = O(x*y/k)
Phase 2: Cross-sector merges
Once all sectors have been processed, we then move to merging lines that cross sectors. For each border between sectors, we perform lockless merge operations on lines that cross the border (ie, where adjacent pixels on each side of the border have been assigned to line sets).
This operation has running time O(x+y) and consumes O(1) memory (we must retain the memory from phase 1 however). Upon completion, the edge arrays may be discarded.
Phase 3: Collecting lines
We now perform a multi-threaded map operation over all allocated disjoint-set structure objects. We first skip any object which is not a root (ie, where obj.parent != obj). Then, starting from the representative line segment, we move out from there and collect and record any information desired about the line in question. We are assured that only one thread is looking at any given line at a time, as intersecting lines would have ended up in the same disjoint-set structure.
This has O(m) running time, and memory usage dependent on what information you need to collect about these line segments.
Summary
Overall, this algorithm should have O(x*y) running time, and O(x*y/k + k^2) memory usage. Adjusting k gives a tradeoff between transient memory usage on the phase 1 processes, and the longer-term memory usage for the adjacency arrays and disjoint-set structures carried over into phase 2.
Note that I have not actually tested this algorithm's performance in the real world; it is also possible that I have overlooked concurrency issues in the lockless disjoint-set union-merge algorithm above. Comments welcome :)
You could use a non-generalized form of the Hough Transform. It appears that it reaches an impressive O(N) time complexity on N x N mesh arrays (if you've got access to ~10000x10000 SIMD arrays and your mesh is N x N - note: in your case, N would be a ridge struct, or cluster of A x B ridges, NOT a pixel). Click for Source. More conservative (non-kernel) solutions list the complexity as O(kN^2) where k = [-π/2, π]. Source.
However, the Hough Transform does have some steep-ish memory requirements, and the space complexity will be O(kN) but if you precompute sin() and cos() and provide appropriate lookup tables, it goes down to O(k + N), which may still be too much, depending on how big your N is... but I don't see you getting it any lower.
Edit: The problem of cross-thread/kernel/SIMD/process line elements is non-trivial. My first impulse tells me to subdivide the mesh into recursive quad-trees (dependent on a certain tolerance), check immediate edges and ignore all edge ridge structs (you can actually flag these as "potential long lines" and share them throughout your distributed system); just do the work on everything INSIDE that particular quad and progressively move outward. Here's a graphical representation (green is the first pass, red is the second, etc). However, my intuition tells me that this is computationally-expensive..
If the ridges are resolved enough that the breaks are only a few pixels then the standard dilate - find neighbours - erode steps you would do for finding lines / OCR should work.
Joining longer contours from many segments and knowing when to create a neck or when to make a separate island is much more complex
Okay, so having thought about this a bit longer, I've got a suggestion that seems like it's too simple to be efficient... I'd appreciate some feedback on whether it seems sensible!
1) Since I can easily determine whether each ridge_t ridge segment at is connected to zero, one or two adjacent segments, I could colour each one appropriately (LINE_NONE, LINE_END or LINE_MID). This can easily be done in parallel, since there is no chance of a race condition.
2) Once colouring is complete:
for each `LINE_END` ridge segment X found:
traverse line until another `LINE_END` ridge segment Y found
if X is earlier in memory than Y:
change X to `LINE_START`
else:
change Y to `LINE_START`
This is also free of race conditions, since even if two threads are simultaneously traversing the same line, they will make the same change.
3) Now every line in the image will have exactly one end flagged as LINE_START. The lines can be located and packed into a more convenient structure in a single thread, without having to do any look-ups to see if the line has already been visited.
It's possible that I should consider whether statistics such as line length should be gathered in step 2), to help with the final re-packing...
Are there any pitfalls that I've missed?
Edit: The obvious problem is that I end up walking the lines twice, once to locate RIDGE_STARTs and once to do the final re-packing, leading to a computational inefficiency. It's still appears to be O(N) in terms of storage and computation time, though, which is a good sign...
When searching in a tree, my understanding of uniform cost search is that for a given node A, having child nodes B,C,D with associated costs of (10, 5, 7), my algorithm will choose C, as it has a lower cost. After expanding C, I see nodes E, F, G with costs of (40, 50, 60). It will choose 40, as it has the minimum value from both 3.
Now, isn't it just the same as doing a Greedy-Search, where you always choose what seems to be the best action?
Also, when defining costs from going from certain nodes to others, should we consider the whole cost from the beginning of the tree to the current node, or just the cost itself from going from node n to node n'?
Thanks
Nope. Your understanding isn't quite right.
The next node to be visited in case of uniform-cost-search would be D, as that has the lowest total cost from the root (7, as opposed to 40+5=45).
Greedy Search doesn't go back up the tree - it picks the lowest value and commits to that. Uniform-Cost will pick the lowest total cost from the entire tree.
In a uniform cost search you always consider all unvisited nodes you have seen so far, not just those that are connected to the node you looked at. So in your example, after choosing C, you would find that visiting G has a total cost of 40 + 5 = 45 which is higher than the cost of starting again from the root and visiting D, which has cost 7. So you would visit D next.
The difference between them is that the Greedy picks the node with the lowest heuristic value while the UCS picks the node with the lowest action cost. Consider the following graph:
If you run both algorithms, you'll get:
UCS
Picks: S (cost 0), B (cost 1), A (cost 2), D (cost 3), C (cost 5), G (cost 7)
Answer: S->A->D->G
Greedy:
*supposing it chooses the A instead of B; A and B have the same heuristic value
Picks: S , A (h = 3), C (h = 1), G (h = 0)
Answer: S->A->C->G
So, it's important to differentiate the action cost to get to the node from the heuristic value, which is a piece of information that is added to the node, based on the understanding of the problem definition.
Greedy search (for most of this answer, think of greedy best-first search when I say greedy search) is an informed search algorithm, which means the function that is evaluated to choose which node to expand has the form of f(n) = h(n), where h is the heuristic function for a given node n that returns the estimated value from this node n to a goal state. If you're trying to travel to a place, one example of a heuristic function is one that returns the estimated distance from node n to your destination.
Uniform-cost search, on the other hand, is an uninformed search algorithm, also known as a blind search strategy. This means that the value of the function f for a given node n, f(n), for uninformed search algorithms, takes into consideration g(n), the total action cost from the root node to the node n, that is, the path cost. It doesn't have any information about the problem apart from the problem description, so that's all it can know. You don't have any information that can help you decide how close one node is to a goal state, only to the root node. You can watch the nodes expanding here (Animation of the Uniform Cost Algorithm) and see how the cost from node n to the root is used to choose which nodes to expand.
Greedy search, just like any greedy algorithm, takes locally optimal solutions and uses a function that returns an estimated value from a given node n to the goal state. You can watch the nodes expanding here (Greedy Best First Search | Quick Explanation with Visualization) and see how the return of the heuristic function from node n to the goal state is used to choose which nodes to expand.
By the way, sometimes, the path chosen by greedy search is not a global optimum. In the example in the video, for example, node A is never expanded because there are always nodes with smaller values of h(n). But what if A has such a high value, and the values for the next nodes are very small and therefore a global optimum? That can happen. A bad heuristic function can cause this. Getting stuck in a loop is also possible. A*, which is also a greedy search algorithm, fixes this by making use of both the path cost (which implies knowing nodes already visited) and a heuristic function, that is, f(n) = g(n) + h(n).
It's possible that to this point, it's still not clear to you HOW uniform-cost knows there is another path that looks better locally but not globally. It should become clear after telling you that if all paths have the same cost, uniform cost search is the same thing as the breadth-first search (BFS). It would expand all nodes just like BFS.
UCS cares about history,
Greedy does not.
In your example, after expanding C, the next node would be D according to the UCS. Because, it's our history. UCS can't forget the past and remember that the total cost of D is much lower than E.
Don't be Greedy. Be UCS and if going back is really a better choice, don't afraid of going back!