What is the difference between SOM (Self Organizing Maps) and K-Means? - artificial-intelligence

There is only one question related to this in stackoverflow, and it is more about which one is better. I just dont really understand the difference. I mean they both work with vectors, which are assigned randomly to clusters, they both work with the centroids of the different clusters in order to determine the winning output node. I mean, where exactly lies the difference?

In K-means the nodes (centroids) are independent from each other. The winning node gets the chance to adapt each self and only that. In SOM the nodes (centroids) are placed onto a grid and so each node is consider to have some neighbors, the nodes adjacent or near to it in repspect with their position on the grid. So the winning node not only adapts itself but causes a change for its neighbors also. K-Means can be considered a special case of SOM were no neighbors are taken into account when modifing centroids vectors. For more, you can still google it ....

Related

creating a cost function in jgrapht

jgrapht supports the idea of putting a wehight(a cost) on an edge/vertex between two nodes. This can be achieved using the class DefaultWeightedEdge.
In my graph I do have the requirement to not find the shortest path but the cheapest one. The cheapest path might be longer/have more hops nodes to travel then the shortest path.
Therefor, one can use the DijkstraShortestPath algorithm to achieve this.
However, my use case is a bit more complex: It needs to also evaluate costs on actions that need to be executed when arriving at a node.
Let's say, you have a graph like a chess board(8x8 fields, each field beeing a node). All the edges have a weight of 1. To move in a car from left bottom to the diagonal corner(right upper), there are many paths with the cost of 16. You can take a diagonal path in a zic zac style, or you can first travel all nodes to the right and then all nodes upwards.
The difference is: When taking a zic zac, you need to rotate yourself in the direction of moving. You rotate 16 times.
When moving first all to the right and then upwards, you need to rotate only once (maybe twice, depending on your start orientation).
So the zic zac path is, from a Djikstra point of view, perfect. From a logical point of view, it's the worst.
Long story short: How can I put some costs on a node or edge depending on the previous edge/node in that path? I did not find anything related in the source code of jgrapht.
Or is there a better algorithm to use?
This is not a JGraphT issue but a graph algorithm issue. You need to think about how to encode this problem and formalize that in more detail
Incorporating weights on vertices is in general easy. Say that every vertex represents visiting a customer, which takes a_i time. This can be encoded in the graph by adding a_i/2 to the cost of every incoming arc in node i, as well as a_i/2 to the cost of every outgoing arc.
A cost function where the cost of traveling from j to k dependents on the arc (i,j) you used to travel to j is more complicated.
Approach a.: Use a dynamic programming (labeling) algorithm. This is perhaps the easiest. You can define your cost function as a recursive function, where the cost of traversing an arc depends on the cost of the previous arc.
Approach b.: With some tricks you may be able to encode the costs in the graph by adding extra nodes to it. Here's an example:
Given a graph with vertices {a,b,c,d,e}, with arcs: (a,e), (e,b), (c,e), (e,d). This graph represents a crossroad with vertex e being in the middle. Going from a->e->b (straight) is free, however, a turn from a->e->d takes additional time. Similar for c->e->d (straight) is free and c->e->b (turning) should be penalized.
Decouple vertex e in 4 new vertices: e1,e2,e3,e4.
Add the following arcs:
(a,e1), (e3,b), (c,e2), (e4,d), (e2, e3), (e1, e3), (e1, e4), (e2, e4).
(e1,e4) and (e2,e3) can have a positive weight to penalize turning.

Implementing a basic predator-prey simulation

I am trying to implement a predator-prey simulation, but I am running into a problem.
A predator searches for nearby prey, and eats it. If there are no near by prey, they move to a random vacant cell.
Basically the part I am having trouble with is when I advanced a "generation."
Say I have a grid that is 3x3, with each cell numbered from 0 to 8.
If I have 2 predators in 0 and 1, first predator 0 is checked, it moves to either cell 3 or 4
For example, if it goes to cell 3, then it goes on to check predator 1. This may seem correct
but it kind of "gives priority" to the organisms with lower index values.. I've tried using 2 arrays, but that doesn't seem to work either as it would check places where organisms are but aren't. ._.
Anyone have an idea of how to do this "fairly" and "correctly?"
I recently did a similar task in Java. Processing the predators starting from the top row to bottom not only gives "unfair advantage" to lower indices but also creates patterns in the movement of the both preys and predators.
I overcame this problem by choosing both row and columns in random ordered fashion. This way, every predator/prey has the same chance of being processed at early stages of a generation.
A way to randomize would be creating a linked list of (row,column) pairs. Then shuffle the linked list. At each generation, choose a random index to start from and keep processing.
More as a comment then anything else if your prey are so dense that this is a common problem I suspect you don't have a "population" that will live long. Also as a comment update your predators randomly. That is, instead of stepping through your array of locations take your list of predators and randomize them and then update them one by one. I think is necessary but I don't know if it is sufficient.
This problem is solved with a technique called double buffering, which is also used in computer graphics (in order to prevent the image currently being drawn from disturbing the image currently being displayed on the screen). Use two arrays. The first one holds the current state, and you make all decisions about movement based on the first array, but you perform the movement in the other array. Then, you swap their roles.
Edit: Looks like I didn't read your question thoroughly enough. Double buffering and randomization might both be needed, depending on how complex your rules are (but if there are no rules other than the ones you've described, randomization should suffice). They solve two distinct problems, though:
Double buffering solves the problem of correctness when you have rules where decisions about what will happen to a creature in a cell depends on the contents of neighbouring cells, and the decisions about neighbouring cells also depend on this cell. If you e.g. have a rule that says that if two predators are adjacent, they will both move away from each other, you need double buffering. Otherwise, after you've moved the first predator, the second one won't see any adjacent predator and will remain in place.
Randomization solves the problem of fairness when there are limited resources, such as when a prey only can be eaten by one predator (which seems to be the problem that concerned you).
How about some sort of round robin method. Put your predators in a circular linked list and keep a pointer to the node that's currently "first". Then, advance that first pointer to the next place in the list each generation. You could insert new predators either at the front or the back of your circular list with ease.

Need help solving a problem using graphs in C

i'm coding a c project for an algorithm class and i really need some help!
Here's the problem:
I have a set of names like this one N = (James,John,Robert,Mary,Patricia,Linda Barbara) wich are stored in an RB tree.
Starting from this set of names a series of couple like those ones are formed:
(James,Mary)
(James,Patricia)
(John,Linda)
(John,Barbara)
(Robert,Linda)
(Robert,Barbara)
Now i need to merge the elements in a way that i can form n subgroups with the constraint that each pairing is respected and the group has the smallest possible cardinality.
With the couples in the example they will form two groups
(James,Mary,Patricia) and (John,Robert,Barbara,Linda).
The task is to return the maximum number of groups formed and the number of males and females in the group with the maximum cardinality.
In this case it would be 2 2 2
I was thinking about building a graph where every name is represented by a vertex and two vertex are in an edge only if they are paired.
I can then use an algorithm (like Kruskal) to find the Minimum spanning tree.Is that right?
The problem is that the graph would not be completely connected.
I also need to find a way to map the names to the edges of the Graph and vice-versa.
Can the edges be indexed by a string?
Every help is really appreciated :)
Thanks in advice!
You don't need to find the minimum spanning tree. That is really for finding the "best" edges in a graph that will still keep the graph connected. In other words, you don't care how John and Robert are connected, just that they are.
You say that the problem is that the graph would not be completely connected, but I think that is actually the point. If you represent graph edges by using the couples as you suggest, then the vertices that are connected form the groups that you are looking for.
In your example, James is connected to Mary and also James is connected to Patricia. No other person connects to any of those three vertices (if they did, you would have another couple that included them), which is why they form a single group of (James, Mary, Patricia). Similarly all of John, Robert, Barbara, and Linda are connected to each other.
Your task is really to form the graph and find all of the connected subgraphs that are disjoint from each other.
While not a full algorithm, I hope that helps get you started.
I think that you can easily solve this with a dfs and connected components. Because every person(node) has a relation with an other one (edge). So you have an outer loop and run an explore function for every node which is unvisited and add the same number for every node explored by the explore function.
e.g
dfs() {
int group 0;
for(int i=0;i<num_nodes;i++) {
if(nodes[i].visited==false){
explore(nodes[i],group);
group++;
}
}
then you simple have to sort the node by the group and then you are ready. if you want to track the path you can use a pre number which indicates which node was explored first, second..etc
(sorry for my bad english)!
The sets of names and pairs of names already form a graph. A data structure with nodes and pointers to other nodes is just another representation, one that you don't necessarily need. Disjoint sets are easier to implement IMO, and their purpose in life is exactly to keep track of sameness as pairs of things are joined together.

R Tree 50,000 foot overview?

I'm working on a school project that involves taking a lat/long point and finding the top five closest points in a known list of places. The list is to be stored in memory, with the caveat that we must choose an "appropriate data structure" -- that is, we cannot simply store all the places in an array and compare distances one-by-one in a linear fashion. The teacher suggested grouping the place data by US State to prevent calculating the distance for places that are obviously too far away. I think I can do better.
From my research online it seems like an R-Tree or one of its variants might be a neat solution. Unfortunately, that sentence is as far as I've gotten with understanding the actual technique, as the literature is simply too dense for my non-academic head.
Can somebody give me a really high overview of what the process is for populating an R-Tree with lat/long data, and then traversing the tree to find those 5 nearest neighbors of a given point?
Additionally the project is in C, and I don't have to reinvent the wheel on this, so if you've used an existing open source C implementation of an R Tree I'd be interested in your experiences.
UPDATE: This blog post describes a straightforward search algorithm for a regionally partitioned space (like a PR quadtree). Hope that helps a future reader.
Have you considered alternative data structures?
I believe, instead of R-tree a Point Quadtree would be more effective for your need.Spatial Index Demos provides some demos for a list of possible data structures including R-tree and Point Quadtree. Hope it gives an insight.
Quad Trees
A quad tree takes a square of space and divides it into four children with half the dimensions along the X and Y axis.
+---+---+
| | | Each square is a child
| | | of the parent; when you
+---+---+ get to leaves a node has
| | | a single point or a list
| | | of points.
+---+---+
This data structure is recursive and you search for points by checking which child holds the point until you get to the leaf. A leaf either has a single member (point with X,Y coords) or a list of members, depending on the implementation. If you fill up a node you split it into 4 and distribute the children. Essentially, the data structure is a generalisation of a binary tree, so it is not necessarily balanced.
Balancing a quad tree may not be necessary for your purposes and is left as an exercise for the reader - try searching on the web for 'balanced quad tree'
Note that this data structure cannot index items that can overlap, but if you're only storing points this won't be a problem.
Finding nearest neighbours in a quad tree
Off the top of my head, here's a quick and dirty algorithm for finding the 'n' nearest neighbours to your point. It's not necessarily optimially efficient, but it will be fairly simple to implement. If someone has a link to a better one, feel free to post it in a comment or answer.
Locate the quad tree node containing
your point, keeping a list of its
parents.
Push all of the points in the
node into a priority queue based on
their distance from your base point
(i.e. by the length of the hypotenuse
per Pythagoras' theorem). Depending
on the implementation there may be
one or more per node. For a simple
implementation of a priority queue
data structure, look up 'binary
heap'.
If any of the 'n' points are further away then the edges of the bounding box, add the contents of its neighbours. i.e. If your base point is close to the edge of the bounding box, it is possible that neighbouring tree nodes might contain points that are closer than the points found within your bounding box. You will need to back up the tree to do this, which is why you need to keep track of your parent nodes.
When all of the 'n' closest points are closer than the edges of your bounding box you know that there could not possibly be neighbours that you have missed. Therefore, the 'n' closest points within this box must be your 'n' closest neighbours.

Spatial Data Structures in C

I do work in theoretical chemistry on a high performance cluster, often involving molecular dynamics simulations. One of the problems my work addresses involves a static field of N-dimensional (typically N = 2-5) hyper-spheres, that a test particle may collide with. I'm looking to optimize (read: overhaul) the the data structure I use for representing the field of spheres so I can do rapid collision detection. Currently I use a dead simple array of pointers to an N-membered struct (doubles for each coordinate of the center) and a nearest-neighbor list. I've heard of oct- and quad- trees but haven't found a clear explanation of how they work, how to efficiently implement one, or how to then do fast collision detection with one. Given the size of my simulations, memory is (almost) no object, but cycles are.
How best to approach this for your problem depends on several factors that you have not described:
- Will the same hypersphere arrangement be used for many particle collision calculations?
- Are the hyperspheres uniform size?
- What is the movement of the particle (e.g. straight line/curve) and is that movement affected by the spheres?
- Do you consider the particle to have zero volume?
I assume that the particle does not have simple straight line movement as that would be the relatively fast calculation of finding the closest point between a line and a point, which is likely going to be about the same speed as finding which of the boxes the line intersects with (to determine where in the n-tree to examine).
If your hypersphere positions are fixed for a lot of particle collisions then computing a voronoi decomposition/Dirichlet tessellation would give you a fast way of later finding exactly which sphere is closest to your particle for any given point in the space.
However to answer your original question about octrees/quadtrees/2^n-trees, in n dimensions you start with a (hyper)-cube that contains the area of space that you are interested in. This will be subdivided into 2^n hypercubes if you deem the contents to be too complicated. This continues recursively until you have only simple elements (e.g. one hypersphere centroid) in the leaf nodes.
Now that the n-tree is built you use it for collision detection by taking the path of your particle and intersecting it with the outer hypercube. The intersection position will tell you which hypercube in the next level down of the tree to visit next, and you determine the position of intersection with all 2^n hypercubes at that level, following downwards until you reach a leaf node. Once you reach the leaf you can examine interactions between your particle path and the hypersphere stored at that leaf. If you have collision you have finished, otherwise you have to find the exit point of the particle path from the current hypercube leaf and determine which hypercube it moves to next. Continue until you find a collision or entirely leave the overall bounding hypercube.
Efficiently finding the neighbouring hypercube when exiting a hypercube is one of the most challenging parts of this approach. For 2^n trees Samet's approaches {1, 2} can be adapted. For kd-trees (binary trees) an approach is suggested in {3} section 4.3.3.
Efficient implementation can be as simple as storing a list of 8 pointers from each hypercube to its children hypercubes, and marking the hypercube in a special way if it is a leaf (e.g. make all pointers NULL).
A description of dividing space to create a quadtree (which you can generalise to n-tree) can be found in Klinger & Dyer {4}
As others have mentioned kd-trees may be more suited than 2^n-trees as extension to an arbitrary number of dimensions is more straightforward, however they will result in a deeper tree. It is also easier to adapt the split positions to match the geometry of your
hyperspheres with a kd-tree. The description above of collision detection in a 2^n tree is equally applicable to a kd-tree.
{1} Connected Component Labeling, Hanan Samet, Using Quadtrees Journal of the ACM Volume 28 , Issue 3 (July 1981)
{2} Neighbor finding in images represented by octrees, Hanan Samet, Computer Vision, Graphics, and Image Processing Volume 46 , Issue 3 (June 1989)
{3} Convex hull generation, connected component labelling, and minimum distance
calculation for set-theoretically defined models, Dan Pidcock, 2000
{4} Experiments in picture representation using regular decomposition, Klinger, A., and Dyer, C.R. E, Comptr. Graphics and Image Processing 5 (1976), 68-105.
It sounds like you'd want to implement a kd-tree, which would allow you to more quickly search the N-dimensional space. There's some more information and links to implementations at the Stony Brook Algorithm Repository.
Since your field is static (by which I'm assuming you mean that the hyper spheres don't move), then the fastest solution I know of is a Kdtree.
You can either make your own, or use someone else's, like this one:
http://libkdtree.alioth.debian.org/
A Quad tree is a 2 dimensional tree, in which at each level a node has 4 children, each of which covers 1/4 of the area of the parent node.
An Oct tree is a 3 dimensional tree, in which at each level a node has 8 children, each of which contains 1/8th of the volume of the parent node. Here is picture to help you visualize it: http://en.wikipedia.org/wiki/Octree
If you're doing N dimensional intersection tests, you could generalize this to an N tree.
Intersection algorithms work by starting at the top of the tree and recursively traversing into any child nodes that intersect the object being tested, at some point you get to leaf nodes, which contain the actual objects.
An octree will work as long as you can specify the spheres by their centres - it hierarchically bins points into cubic regions with eight children. Working out neighbours in an octree data structure will require you to do sphere-intersecting-cube calculations (to some extent easier than they look) to work out which cubic regions in an octree are within the sphere.
Finding the nearest neighbours means walking back up the tree until you get a node with more than one populated child and all surrounding nodes included (this ensures the query gets all sides).
From memory, this is the (somewhat naive) basic algorithm for sphere-cube intersection:
i. Is the centre within the cube (this gets the eponymous situation)
ii. Are any of the corners of the cube within radius r of the centre (corners within the sphere)
iii. For each surface of the cube (you can eliminate some of the surfaces by working out which side of the surface the centre lies on) work out (this is all first-year vector arithmetic):
a. A normal of the surface that goes to the centre of the sphere
b. The distance from the centre of the sphere to the intersection of the normal with the plane of the surface (chord intersets plane the surface of the cube)
c. Intersection of the plane lies within the side of the cube (one condition of chord intersection to the cube)
iv. Calculate the size of the chord (Sin of Cos^-1 of ratio of normal length to radius of sphere)
v. If the nearest point on the line is less than the distance of the chord and the point lies between the ends of the line the chord intersects one of the edges of the cube (chord intersects cube surface somewhere along one of the edges).
Slightly dimly remembered but this is something I did for a situation involving spherical regions using an octee data structure (many years ago). You may also wish to check out KD-trees as some of the other posters suggest but your initial question sounds very similar to what I did.

Resources