Solving Lights out for AI Course - artificial-intelligence

Solving Lights out for AI Course - artificial-intelligence

So I was given the following task: Given that all lights in a 5x5 version of a game are turned on, write an algorithm using UCS / A* / BFS / Greedy best first search that finds a solution.
What I did first was realize that UCS would be unnecessary as the cost from moving from one state to another is 1(pressing a button that flips itself and neighbouring ones). So what I did is wrote BFS instead. It turned out that it works too long and fills up a queue, even though I was paying attention to removing parent nodes when I was finished with them not to overflow the memory. It would work for around 5-6mins and then crash because of memory.
Next, what I did is write DFS(even though it was not mentioned as one of possibilities) and it did find a solution in 123 secs, at depth 15(I used depth-first limited because I knew that there was a solution at depth 15).
What I am wondering now is am I missing something? Is there some good heuristics to try to solve this problem using A* search? I figured out absolutely nothing when it's about heuristics, because it doesn't seem any trivial to find one in this problem.
Thanks very much. Looking forward to some help from you guys
Here is my source code(I think it's pretty straightforward to follow):
struct state
{
bool board[25];
bool clicked[25];
int cost;
int h;
struct state* from;
};
int visited[1<<25];
int dx[5] = {0, 5, -5};
int MAX_DEPTH = 1<<30;
bool found=false;
struct state* MakeStartState()
{
struct state* noviCvor = new struct state();
for(int i = 0; i < 25; i++) noviCvor->board[i] = false, noviCvor->clicked[i] = false;
noviCvor->cost = 0;
//h=...
noviCvor->from = NULL;
return noviCvor;
};
struct state* MakeNextState(struct state* temp, int press_pos)
{
struct state* noviCvor = new struct state();
for(int i = 0; i < 25; i++) noviCvor->board[i] = temp->board[i], noviCvor->clicked[i] = temp->clicked[i];
noviCvor->clicked[press_pos] = true;
noviCvor->cost = temp->cost + 1;
//h=...
noviCvor->from = temp;
int temp_pos;
for(int k = 0; k < 3; k++)
{
temp_pos = press_pos + dx[k];
if(temp_pos >= 0 && temp_pos < 25)
{
noviCvor->board[temp_pos] = !noviCvor->board[temp_pos];
}
}
if( ((press_pos+1) % 5 != 0) && (press_pos+1) < 25 )
noviCvor->board[press_pos+1] = !noviCvor->board[press_pos+1];
if( (press_pos % 5 != 0) && (press_pos-1) >= 0 )
noviCvor->board[press_pos-1] = !noviCvor->board[press_pos-1];
return noviCvor;
};
bool CheckFinalState(struct state* temp)
{
for(int i = 0; i < 25; i++)
{
if(!temp->board[i]) return false;
}
return true;
}
int bijection_mapping(struct state* temp)
{
int temp_pow = 1;
int mapping = 0;
for(int i = 0; i < 25; i++)
{
if(temp->board[i])
mapping+=temp_pow;
temp_pow*=2;
}
return mapping;
}
void BFS()
{
queue<struct state*> Q;
struct state* start = MakeStartState();
Q.push(start);
struct state* temp;
visited[ bijection_mapping(start) ] = 1;
while(!Q.empty())
{
temp = Q.front();
Q.pop();
visited[ bijection_mapping(temp) ] = 2;
for(int i = 0; i < 25; i++)
{
if(!temp->clicked[i])
{
struct state* next = MakeNextState(temp, i);
int mapa = bijection_mapping(next);
if(visited[ mapa ] == 0)
{
if(CheckFinalState(next))
{
printf("NADJENO RESENJE\n");
exit(0);
}
visited[ mapa ] = 1;
Q.push(next);
}
}
}
delete temp;
}
}
PS. As I am not using map anymore(switched to array) for visited states, my DFS solution improved from 123 secs to 54 secs but BFS still crashes.

First of all, you may already recognize that in Lights Out you never have to flip the same switch more than once, and it doesn't matter in which order you flip the switches. You can thus describe the current state in two distinct ways: either in terms of which lights are on, or in terms of which switches have been flipped. The latter, together with the starting pattern of lights, gives you the former.
To employ a graph-search algorithm to solve the problem, you need a notion of adjacency. That follows more easily from the second characterization: two states are adjacent if there is exactly one switch about which they they differ. That characterization also directly encodes the length of the path to each node (= the number of switches that have been flipped), and it reduces the number of subsequent moves that need to be considered for each state considered, since all possible paths to each node are encoded in the pattern of switches.
You could use that in a breadth-first search relatively easily (and this may be what you in fact tried). BFS is equivalent to Dijkstra's algorithm in that case, even without using an explicit priority queue, because you enqueue new nodes to explore in priority (path-length) order.
You can also convert that to an A* search with addition of a suitable heuristic. For example, since each move turns off at most five lights, one could take as the heuristic the number of lights still on after each move, divided by 5. Though that's a bit crude, I'm inclined to think that it would be of some help. You do need a real priority queue for that alternative, however.
As far as implementation goes, do recognize that you can represent both the pattern of lights currently on and the pattern of switches that have been pressed as bit vectors. Each pattern fits in a 32-bit integer, and a list of visited states requires 225 bits, which is well within the capacity of modern computing systems. Even if you use that many bytes, instead, you ought to be able to handle it. Moreover, you can perform all needed operations using bitwise arithmetic operators, especially XOR. Thus, this problem (at its given size) ought to be computable relatively quickly.
Update:
As I mentioned in comments, I decided to solve the problem for myself, with -- it seemed to me -- very good success. I used a variety of techniques to achieve good performance and minimize memory usage, and in this case, those mostly were complementary. Here are some of my tricks:
I represented each whole-system state with a single uint64_t. The top 32 bits contain a bitmask of which switches have been flipped, and the bottom 32 contain a bitmask of which lights are on as a result. I wrapped these in a struct along with a single pointer to link them together as elements of a queue. A given state can be tested as a solution with one bitwise-and operation and one integer comparison.
I created a pre-initialized array of 25 uint64_t bitmasks representing the effect of each move. One bit set among the top 32 of each represents the switch that is flipped, and between 3 and five bits set among the bottom 32 represent the lights that are toggled as a result. The effect of flipping one switch can then be computed simply as new_state = old_state ^ move[i].
I implemented plain breadth-first search instead of A*, in part because I was trying to put something together quickly, and in particular because that way I could use a regular queue instead of a priority queue.
I structured my BFS in a way that naturally avoided visiting the same state twice, without having to actually track which states had ever been enqueued. This was based on some insight into how to efficiently generate distinct bit patterns without repeating, with those having fewer bits set generated before those having more bits set. The latter criterion was satisfied fairly naturally by the queue-based approach required anyway for BFS.
I used a second (plain) queue to recycle dynamically-allocated queue nodes after they were removed from the main queue, to minimize the number calls to malloc().
Overall code was a bit less than 200 lines, including blank and comment lines, data type declarations, I/O, queue implementation (plain C, no STL) -- everything.
Note, by the way, that the priority queue employed in standard Dijkstra and in A* is primarily about finding the right answer (shortest path), and only secondarily about doing so efficiently. Enqueueing and dequeueing from a standard queue can both be O(1), whereas those operations on a priority queue are o(log m) in the number of elements in the queue. A* and BFS both have worst-case queue size upper bounds of O(n) in the total number of states. Thus, BFS will scale better than A* with problem size; the only question is whether the former reliably gives you the right answer, which in this case, it does.

Related

Fastest ways to find duplicates

So my interviewer showed me the following code,
struct test {
uint8_t inuse;
int32_t val;
};
#define MAX_LIST_SIZE 100
struct test list[MAX_LIST_SIZE];
int checkAndAdd(int32_t val) {
for(int i=0; i<MAX_LIST_SIZE; i++) {
if(list[i].inuse && list[i].value == value)
return DUPLICATE;
}
for(int i=0; i<MAX_LIST_SIZE; i++) {
if(!list[i].inuse) {
list[i].inuse = 1;
list[i].value = value;
return ADDED;
}
}
return EA_FAIL;
}
and asked me the following questions.
How to make that function faster?
What are the other fastest methods to find duplicates in array?
My answers were
1.
int checkAndAdd(int32_t val) {
int32_t addedIndex = -1;
for(int i=0; i<MAX_LIST_SIZE; i++) {
if(list[i].inuse && list[i].value == value) {
if (addedIndex != -1) {
list[addedIndex].inuse = 0;
list[addedIndex].value = 0;
}
return DUPLICATE;
} else if (!list[i].inuse && (addedIndex == -1)) {
list[i].inuse = 1;
list[i].value = value;
addedIndex = i;
}
}
if (addedIndex)
return ADDED;
return EA_FAIL;
}
You can't have a faster duplicate check than O(n).
Were my answers correct? Please suggest any other good approaches and answer to 2. Thanks.

The fastest, most general way to find duplicates is with a hash table. That gives essentially constant-time access (with just a little added overhead to handle hash collisions).
If the set of integers is sufficiently dense, you could alternatively use an array indexed by value, but this is only practical if the percentage of holes is sufficiently small.

It's somewhat dependent on data patterns, but interpolative search is O(log log n).
Using binary search gives you O(log n), and for n <= 100, a maximum of 7 search steps, making interpolative search probably not worthwhile.
Edit to add side note: it's curious that the struct has int32_t but the argument to the function is plain int. Probably not broken (are there any ILP64 systems out there?), but seems a bit sloppy.

If you need to be able to quickly insert, delete, and avoid duplicates: what you want is a set probably implemented with a hash table where the key and value both point to the same data.
Hash tables inherently cannot have duplicates. They're on average O(1) for inserts, deletions, and lookups and O(n) on space. The only downside is there is no inherent order to the values. Since your original code does not appear to be preserving order that would be fine.

If I gave you that as an interview question I might want to discuss Hashes etc and that would be a good sign, hashes are fundamental data structures but I'd be really looking to see if you could merge the two loops and you did that. Your second answer for a duplicate check is correct for lists ie O(n) but incorrect generally because it's dependent on data structure, if you use a hash it's O(1).
Please note that O(1) can sometimes and quite often turns out to be slower than O(n) in real life ie by the time you've hashed the thing to get a key, done the lookup and traversed the list of the hash structure you might have found the single item in the 5 you were looking for in a linked list.

Efficiently choose an integer distinct from all elements of a list

I have a linked list of objects each containing a 32-bit integer (and provably fewer than 232 such objects) and I want to efficiently choose an integer that's not present in the list, without using any additional storage (so copying them to an array, sorting the array, and choosing the minimum value not in the array would not be an option). However, the definition of the structure for list elements is under my control, so I could add (within reason) additional storage to each element as part of solving the problem. For example, I could add an extra set of prev/next pointers and merge-sort the list. Is this the best solution? Or is there a simpler or more efficient way to do it?

Given the conditions that you outline in the comments, especially your expectation of many identical values, you must expect a sparse distribution of used values.
Consequently, it might actually be best to just guess a value randomly and then check whether it coincides with a value in the list. Even if half the available value range were used (which seems extremely unlikely from your comments), you would only traverse the list twice on average. And you can drastically decrease this factor by simultaneously checking a number of guesses in one pass. Done correctly, the factor should always be close to one.
The advantage of such a probabilistic approach is that you are immune to bad sequences of values. Such sequences are always possible with range based approaches: If you calculate the min and max of the data, you run the risk, that the data contains both 0 and 2^32-1. If you sequentially subdivide an interval, you run the risk of always getting values in the middle of the interval, which can shrink it to zero in 32 steps. With a probabilistic approach, these sequences can't hurt you.
I think, I would use something like four guesses for very small lists, and crank it up to roughly 16 as the size of the list approaches the limit. The high starting value is due to the fact that any such algorithm will be memory bound, i. e. your CPU has ample amounts of time to check a value while it waits for the next values to arrive from memory, so you better make good use of that time to reduce the number of passes required.
A further optimization would instantly replace a busted guess with a new one and keep track of where the replacement happened, so that you can avoid a complete second pass through the data. Also, move the busted guess to the end of the list of guesses, so that you only need to check against the start position of the first guess in your loop to stop as early as possible.

If you can spare one pointer in each object, you get an O(n) worst-case algorithm easily (standard divide-and-conquer):
Divide the range of possible IDs equally.
Make a singly-linked list covering each subrange.
If one subrange is empty, choose any id in it.
Otherwise repeat with the elements of the subrange with fewest elements.
Example code using two sub-ranges per iteration:
unsigned getunusedid(element* h) {
unsigned start = 0, stop = -1;
for(;h;h = h->mainnext)
h->next = h->mainnext;
while(h) {
element *l = 0, *r = 0;
unsigned cl = 0, cr = 0;
unsigned mid = start + (stop - start) / 2;
while(h) {
element* next = h->next;
if(h->id < mid) {
h->next = l;
cl++;
l = h;
} else {
h->next = r;
cr++;
r = h;
}
h = next;
}
if(cl < cr) {
h = l;
stop = mid - 1;
} else {
h = r;
start = mid;
}
}
return start;
}
Some more remarks:
Beware of bugs in the above code; I have only proved it correct, not tried it.
Using more buckets (best keep to a power of 2 for easy and efficient handling) each iteration might be faster due to better data-locality (though only try and measure if it's not fast enough otherwise), as #MarkDickson rightly remarks.
Without those extra-pointers, you need full sweeps each iteration, raising the bound to O(n*lg n).
An alternative would be using 2+ extra-pointers per element to maintain a balanced tree. That would speed up id-search, at the expense of some memory and insertion/removal time overhead.

If you don't mind an O(n) scan for each change in the list and two extra bits per element, whenever an element is inserted or removed, scan through and use the two bits to represent whether an integer (element + 1) or (element - 1) exists in the list.
For example, inserting the element, 2, the extra bits for each 3 and 1 in the list would be updated to show that 3-1 (in the case of 3) and 1+1 (in the case of 1) now exist in the list.
Insertion/deletion time can be reduced by adding a pointer from each element to the next element with the same integer.

I am supposing that integers have random values not controlled by your code.
Add two unsigned integers in your list class:
unsigned int rangeMinId = 0;
unsigned int rangeMaxId = 0xFFFFFFFF ;
Or if not possible to change the List class add them as global variables.
When the list is empty you will always know that the range if free. When you add a new item in the list check if its ID is between rangeMinId and rangeMaxId and if so change the nearest of them to this ID.
It may happen after a lot of time that rangeMinId to become equal to rangeMaxId-1, then you need a simple function which traverses the whole list and search for another free range. But this will not happens very frequently.
Other solutions are more complex and involves using of sets, binary trees or sorted arrays.
Update:
The free range search function can be done in O(n*log(n)). An example of such function is given below(I have not extensively tested it). The example is for integer array but easily can be adapted for a list.
int g_Calls = 0;
bool _findFreeRange(const int* value, int n, int& left, int& right)
{
g_Calls ++ ;
int l=left, r=right,l2,r2;
int m = (right + left) / 2 ;
int nl=0, nr=0;
for(int k = 0; k < n; k++)
{
const int& i = value[k] ;
if(i > l && i < r)
{
if(i-l < r-i)
l = i;
else
r = i;
}
if(i < m)
nl ++ ;
else
nr ++ ;
}
if ( (r - l) > 1 )
{
left = l;
right = r;
return true ;
}
if( nl < nr)
{
// check first left then right
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
else
{
// check first right then left
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
return false;
}
bool findFreeRange(const int* value, int n, int& left, int& right, int maxx)
{
g_Calls = 1;
left = 0;
right = maxx;
if(!_findFreeRange(value, n, left, right))
return false ;
left++;
right--;
return (right - left) >= 0 ;
}
If it returns false list is filled and there is no free range (very least possible), maxm is the maximal limit of the range in this case 0xFFFFFFFF.
The idea is first to search the biggest range of the list and then if no free hole is found to recursively search the subranges for holes which may have been left during the first pass. If the list is sparsely filled it is very least probable that function will be called more than once. However when the list become almost completely filled it can happen the range search to take longer. Thus in this most worst case scenario, when the list becomes closed to filled, its better to start keeping all free ranges in a list.

This reminds me of the book Programming Pearls, and in particular the very first column, "Cracking the Oyster". What is the real problem you are trying to solve?
If your list is small, then a simple linear search to find max/min would work and it would work quickly.
When your list gets large and linear search becomes unwieldy, you can build a bitmap to represent the unused numbers for much less memory than adding 2 extra pointers at each node in the linked list. In fact, it would only be 2^(32-8) = 16KB of RAM compared to your linked list being potentially >10GB.
Then to find an unused number, you can just traverse the bitmap one machine-word at a time, checking if it's non-zero. If it is, then at least one number in that 32- or 64- bit block is unused, and you can inspect the word to find out exactly which bit is set. As you add numbers to the list, all you have to do is clear the corresponding bit in the bitmap.

One possible solution is to take the min and max of the list with a simple O(n) iteration, then pick a number between max and min + (1 << 32). This is simple to do since overflow/underflow behavior is well-defined for unsigned integers:
uint32_t min, max;
// TODO: compute min and max here
// exclude max from choice space (min will be an exclusive upper bound)
max++;
uint32_t choice = rand32() % (min - max) + max; // where rand32 is a random unsigned 32-bit integer
Of course, if it doesn't need to be random, then you can just use one more than the maximum of the list.
Note: the only case where this fails is if min is 0 and max is UINT32_MAX (aka 4294967295).

Ok. Here is one really simple solution. Some of the answers have become too theoretical and complicated for optimization. If you need a quick solution do this:
1.In your List add a member:
unsigned int NextFreeId = 1;
add also an std::set<unsigned int> ids
When you add item in the list add also the integer in the set and keep track of the NextFreeId:
int insert(unsigned int id)
{
ids.insert(id);
if (NextFreeId == id) //will not happen too frequently
{
unsigned int TheFreeId ;
unsigned int nextid = id+1, previd = id-1;
while(true )
{
if(nextid < 0xFFFFFFF && !ids.count(nextid))
{
NextFreeId = nextid ;
break ;
}
if(previd > 0 && !ids.count(previd))
{
NextFreeId = previd ;
break ;
}
if(prevId == 0 && nextid == 0xFFFFFFF)
break; // all the range is filled, there is no free id
nextid++ ;
previd -- ;
}
}
return 1;
}
Sets are very efficient to check if a value is contained so the complexity will be O(log(N)). It is quick to implement. Also set is searched not each time but only when the NextFreeId is filled. List is not traversed at all.

A-star implementation with no path finding

I'm dealing with a task from the ai class that is the following
I need to use the A* algorithm
I have an n-digits display between 0-9
Every digit is identified with Ci, I=0...n-1
Under every digit there is a button that the agent can press to modify their value.
Every time the agent presses the button, digits change following 2 rules
Ci = (Ci+1)%10
C (I+j)%n = [C (I+j)%n +k]%10
Rule 1) is known to the agent while rule 2) isn't. Constants j and k are unknown to the agenti
My goal is to reach a goal-state starting from a 00...0 Configuration.
The goal-state can be generated applying random actions by the agent, aka pressing buttons randomly, so I'm sure there is at least one way to solve the problem.
My difficulties are:
How do I represent an n-digits display as a node?
How do I choose a right heuristics?
I'm stuck and frustrated with this exam.
(Sorry for English mistakes, I'm italian!)

An A* algorithm is characterized namely by:
The nodes forming the search tree.
The definition of the "neighbor" concept.
The definition of an optimistic heuristic for a node.
Nodes in the search tree
As you are putting the problem, it seems that each node of the search tree is a configuration of the n digits of the display. Note that this can be easily represented as an array of integers of size n, where each position of the array represents a digit in the display, and the value represents the actual value of the digit in the display. For example, if a display of 5 digits showing 25847 could be represented by [2, 5, 8, 4, 7]. Easy, right?
"Neighbor" concept
That is, how do yo change the display status? As you said, you can only use one of the n buttons below each digit. So, for every status, you will have exactly n possible "neighbors", or "descendants", if you think on the search tree. You will need a function that gives you the node resulting of pressing a particular button (which would simulate the agent pressing the button). Something like the following (in Java):
static int[] pressButton(int[] node, int button, int j, int k) {
int n = node.length;
int[] newNode = Arrays.copyOf(node, n);
newNode[button] = (newNode[button] + 1) % 10;
newNode[(button + j) % n] = (newNode[(button + j) % n] + k) % 10;
return newNode;
}
Now you have this, you can generate every child of the current node with something as simple as:
for (int i = 0; i < node.length; i++) {
int[] newNode = pressButton(node, i, j, k);
// compute heuristic of the new node
// add it to the A* priority queue
}
Optimistic heuristic
Now you just need an heuristic that provides an optimistic estimation of the distance from a given node to the goal. One simple idea would be to assume that, for a given node, you will have to press the buttons at least as many times as the number of digits that differ from the goal; for example, if the goal is 41243 and the current node is 37253, you will have to press a button at least three times, because there are three digits that differ from the goal. In Java:
static int heuristic(int[] currentNode, int[] goal) {
int h = 0;
for (int i = 0; i < goal.length; i++) {
if (goal[i] != currentNode[i]) {
h = h + 1;
}
}
return h;
}
Note, however, that this heuristic is wrong. For example, if the goal is 81730, your current node is 71230, j = 2 and k = 5, this heuristic would give a value of 2; however, we could reach the goal state from the current node in just 1 step pressing button one, so the heuristic would be pessimist in this case. This is because each time a button is pressed, two digits are affected, which could make as get to the solution faster than we thought. To avoid this, we could just substract one from the heuristic (if it is bigger than one):
static int heuristic(int[] currentNode, int[] goal) {
int h = 0;
for (int i = 0; i < goal.length; i++) {
if (goal[i] != currentNode[i]) {
h = h + 1;
}
}
if (h > 1) {
h = h - 1;
}
return h;
}
More accurate heuristics could be defined based not only on the number of digits that differ from the goal, but also on how much they differ (the bigger the difference, the more times you need to press the button); however, I don't think there is an easy way to define an optimistic heuristic based on this (specially keeping in mind corner cases like j = 0 or j = n, negative k, etc.). If you are allowed to use the values of j and k in your heuristic (which I assumed not, because that's what I interpreted when you said that these are unknown to the agent), maybe there could be room for some sophistication, but even then, I would have a hard time trying to define it.
Finally, given the nature of the problem, it's very easy to reach already visited states when going from one node to another. If you want to keep your tree as an actual tree (and finite), and not as an infinite graph, you will have to use a set of already visited nodes, to avoid creating cycles in the search space.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}

1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.

First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).

Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.

Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Embedded C - How to create a cache for expensive external reads?

I am working with a microcontroller that has an external EEPROM containing tables of information.
There is a large amount of information, however there is a good chance that we will request the same information cycle to cycle if we are fairly 'stable' - i.e. if we are at a constant temperature for example.
Reads from the EEPROM take around 1ms, and we do around 30 per cycle. Our cycle is currently about 100ms so there is significant savings to be had.
I am therefore looking at implementing a RAM cache. A hit should be significantly faster than 1ms since the microcontroller core is running at 8Mhz.
The lookup involves a 16-bit address returning 16-bit data. The microcontroller is 32-bit.
Any input on caching would be greatly appreciated, especially if I am totally missing the mark and should be using something else, like a linked list, or even a pre-existing library.
Here is what I think I am trying to achieve:
-A cache made up of an array of structs. The struct would contain the address, data and some sort of counter indicating how often this piece of data has been accessed (readCount).
-The array would be sorted by address normally. I would have an efficient lookup() function to lookup an address and get the data (suggestions?)
-If I got a cache miss, I would sort the array by readCount to determine the least used cached value and throw it away. I would then fill its position with the new value I have looked up from EEPROM. I would then reorder the array by address. Any sorting would use an efficient sort (shell sort? - not sure how to handle this with arrays)
-I would somehow decrement all of the readCount variables to that they would tend to zero if not used. This should preserve constantly used variables.
Here are my thoughts so far (pseudocode, apologies for my coding style):
#define CACHE_SIZE 50
//one piece of data in the cache
struct cacheItem
{
uint16_t address;
uint16_t data;
uint8_t readCount;
};
//array of cached addresses
struct cacheItem cache[CACHE_SIZE];
//function to get data from the cache
uint16_t getDataFromCache(uint16_t address)
{
uint8_t cacheResult;
struct cacheItem * cacheHit; //Pointer to a successful cache hit
//returns CACHE_HIT if in the cache, else returns CACHE_MISS
cacheResult = lookUpCache(address, cacheHit);
if(cacheResult == CACHE_MISS)
{
//Think this is necessary to easily weed out the least accessed address
sortCacheByReadCount();//shell sort?
removeLastCacheEntry(); //delete the last item that hasn't been accessed for a while
data = getDataFromEEPROM(address); //Expensive EEPROM read
//Add on to the bottom of the cache
appendToCache(address, data, 1); //1 = setting readCount to 1 for new addition
//Think this is necessary to make a lookup function faster
sortCacheByAddress(); //shell sort?
}
else
{
data = cacheHit->data; //We had a hit, so pull the data
cacheHit->readCount++; //Up the importance now
}
return data;
}
//Main function
main(void)
{
testData = getDataFromCache(1234);
}
Am I going down the completely wrong track here? Any input is appreciated.

Repeated sorting sounds expensive to me. I would implement the cache as a hash table on the address. To keep things simple, I would start by not even counting hits but rather evicting old entries immediately on seeing a hash collision:
const int CACHE_SIZE=32; // power of two
struct CacheEntry {
int16_t address;
int16_t value
};
CacheEntry cache[CACHE_SIZE];
// adjust shifts for different CACHE_SIZE
inline int cacheIndex(int adr) { return (((adr>>10)+(adr>>5)+adr)&(CACHE_SIZE-1)); }
int16_t cachedRead( int16_t address )
{
int idx = cacheIndex( address );
CacheEntry * pCache = cache+idx;
if( address != pCache->address ) {
pCache->value = readEeprom( address );
pCache->address = address;
}
return pCache->value
}
If this proves not effective enough, I would start by fiddling around with the hash function.

Don't be afraid to do more computations, in most cases I/O is slower.
This is the simpliest implementation I can think of:
#define CACHE_SIZE 50
something cached_vals[CACHE_SIZE];
short int cached_item_num[CACHE_SIZE];
char cache_hits[CACHE_SIZE]; // 0 means free.
void inc_hits(char index){
if (cache_hits[index] > 127){
for (int i = 0; i < CACHE_SIZE; i++)
cache_hits[i] <<= 1;
cache_hits[i]++; // 0 is reserved as "free" marker
};
cache_hits[index]++;
}:
int get_new_space(short int item){
for (int i = 0; i < CACHE_SIZE; i++)
if (!cache_hits[i]) {
inc_hits(i);
return i;
};
// no free values, dropping the one with lowest count
int min_val = 0;
for (int i = 1; i < CACHE_SIZE; i++)
min_val = min(cache_hits[min_val], cache_hits[i]);
cache_hits[min_val] = 2; // just to give new values more chanches to "survive"
cached_item_num[min_val] = item;
return min_val;
};
something* get_item(short int item){
for (int i = 0; i < CACHE_SIZE; i++){
if (cached_item_num[i] == item){
inc_hits(i);
return cached_vals + i;
};
};
int new_item = get_new_space(item);
read_from_eeprom(item, cached_vals + new_item);
return chached_vals + new_item;
};

Sorting and moving data seems like a bad idea, and it's not clear you gain anything useful from it.
I'd suggest a much simpler approach. Allocate 4*N (for some N) bytes of data, as an array of 4-byte structs each containing an address and the data. To look up a value at address A, you look at the struct at index A mod N; if its stored address is the one you want, then use the associated data, otherwise look up the data off the EEPROM and store it there along with address A. Simple, easy to implement, easy to test, and easy to understand and debug later.
If the location of your current lookup tends to be near the location of previous lookups, that should work quite well -- any time you're evicting data, it's going to be from at least N locations away in the table, which means you're probably not likely to want it again any time soon -- I'd guess that's at least as good a heuristic as "how many times did I recently use this". (If your EEPROM is storing several different tables of data, you could probably just do a cache for each one as the simplest way to avoid collisions there.)

You said that which entry you need from the table relates to the temperature, and that the temperature tends to remain stable. As long as the temperature does not change too quickly then it is unlikely that you will need an entry from the table which more than 1 entry away from the previously needed entry.
You should be able to accomplish your goal by keeping just 3 entries in RAM. The first entry is the one you just used. The next entry is the one corresponding to the temperature just below the last temperature measurement, and the other one is the temperature just above the last temperature measurement. When the temperature changes one of these entries probably becomes the new current one. You can then preform whatever task it is you need using this data, and then go ahead and read the entry you need (higher or lower than the current temperature) after you have finished other work (before reading the next temperature measure).
Since there are only 3 entries in RAM at a time you don't have to be clever about what data structure you need to store them in to access them efficiently, or even keeping them sorted because it will never be that long.
If temperatures can move faster than 1 unit per examination period then you could just increase the size of your cache and maybe have a few more anticipatory entries (in the direction that temperature seems to be heading) than you do trailing entries. Then you may want to store the entries in an efficient structure, though. I wouldn't worry about how recently you accessed an entry, though, because next temperature probability distribution predictions based on current temperature will usually be pretty good. You will need to make sure you handle the case where you are way off and need to read in the entry for a just read temperature immediately, though.

There are my suggestions:
Replace oldest, or replace least recent policy would be better, as reolacing least accessed would quickly fill up cache and then just repeatedly replace last element.
Do not traverse all array, but take some pseudo-random (seeded by address) location to replace. (special case of single location is already presented by #ruslik).
My idea would be:
#define CACHE_SIZE 50
//one piece of data in the cache
struct cacheItem
{
uint16_t address;
uint16_t data;
uint8_t whenWritten;
};
//array of cached addresses
struct cacheItem cache[CACHE_SIZE];
// curcular cache write counter
unit8_t writecount = 0;
// this suggest cache location either contains actual data or to be rewritten;
struct cacheItem *cacheLocation(uint16_t address) {
struct cacheLocation *bestc, *c;
int bestage = -1, age, i;
srand(address); // i'll use standard PRNG to acquire locations; as it initialized
// it will always give same sequence for same location
for(i = 0; i<4; i++) { // any number of iterations you find best
c = &(cache[rand()%CACHE_SIZE]);
if(c->address == address) return c; // FOUND!
age = (writecount - whenWritten) & 0xFF; // after age 255 comes age 0 :(
if(age > bestage) {
bestage = age;
bestc = c;
}
}
return c;
}
....
struct cacheItem *c = cacheLocation(addr);
if(c->address != addr) {
c->address = addr;
c->data = external_read(addr);
c->whenWritten = ++writecount;
}
cache age will wrap after 255 to 0 but but it's hust slightly randomizes cache replacements, so it did not make workaround.