Optimizing a word parser - c

Context:
I have a code/text editor than I'm trying to optimize. Currently, the bottleneck of the program is the language parser than scans all the keywords (there is more than one, but they're written generally the same).
On my computer, the the editor delays on files around 1,000,000 lines of code. On lower-end computers, like Raspberry Pi, the delay starts happening much sooner (I don't remember exactly, but I think around 10,000 lines of code). And although I've never quite seen documents larger than 1,000,000 lines of code, I'm sure that they're there and I want my program to be able to edit them.
Question:
This leads me to the question: what's the fastest way to scan for a list of words within large, dynamic string?
Here's some information that may affect the design of the algorithm:
the keywords
qualifying characters allowed to be part of a keyword, (I call them qualifiers)
the large string
Bottleneck-solution:
This is (roughly) the method I'm currently using to parse strings:
// this is just an example, not an excerpt
// I haven't compiled this, I'm just writing it to
// illustrate how I'm currently parsing strings
struct tokens * scantokens (char * string, char ** tokens, int tcount){
int result = 0;
struct tokens * tks = tokens_init ();
for (int i = 0; string[i]; i++){
// qualifiers for C are: a-z, A-Z, 0-9, and underscore
// if it isn't a qualifier, skip it
while (isnotqualifier (string[i])) i++;
for (int j = 0; j < tcount; j++){
// returns 0 for no match
// returns the length of the keyword if they match
result = string_compare (&string[i], tokens[j]);
if (result > 0){ // if the string matches
token_push (tks, i, i + result); // add the token
// token_push (data_struct, where_it_begins, where_it_ends)
break;
}
}
if (result > 0){
i += result;
} else {
// skip to the next non-qualifier
// then skip to the beginning of the next qualifier
/* ie, go from:
'some_id + sizeof (int)'
^
to here:
'some_id + sizeof (int)'
^
*/
}
}
if (!tks->len){
free (tks);
return 0;
} else return tks;
}
Possible Solutions:
Contextual Solutions:
I'm considering the following:
Scan the large string once, and add a function to evaluate/adjust the tokens markers every time there is user input (instead of re-scanning the entire document over and over). I expect that this will fix the bottleneck because there is much less parsing involved. But, it doesn't completely fix the program, because the initial scan may still take a really long time.
Optimize token-scanning algorithm (see below)
I've also considered, but have rejected, these optimizations:
Scanning the code that is only on the screen. Although this would fix the bottleneck, it would limit the ability to find user-defined tokens (ie variable names, function names, macros) that appear earlier on than where the screen starts.
Switching the text into a linked list (a node per line), rather than a monolithic array. This doesn't really help the bottleneck. Although insertions/deletions would be quicker, the loss of indexed access slows down the parser. I think that, also, a monolithic array is more likely to be cached, than a broken-up list.
Hard coding a scan-tokens function for every language. Although this could be the best optimization for performance, it's doesn't seem practical in a software development point of view.
Architectural solution:
With assembly language, a quicker way to parse these strings would be to load characters into registers and compare them 4 or 8 bytes at a time. There are some additional measures and precautions that would have to be taken into account, such as:
Does the architecture support unaligned memory access?
All strings would have to be of size s, where s % word-size == 0, to prevent reading violations
Others?
But these issues seem like they can be easily fixed. The only problem (other than the usual ones that come with writing in assembly language) is that it's not so much an algorithmic solution as it is a hardware solution.
Algorithmic Solution:
So far, I've considered having the program rearrange the list of keywords to make a binary search algorithm a little more possible.
One way I've thought about rearranging them for this is by switch the dimensions of the list of keywords. Here's an example of that in C:
// some keywords for the C language
auto // keywords[0]
break // keywords[1]
case char const continue // keywords[2], keywords[3], keywords[4]
default do double
else enum extern
float for
goto
if int
long
register return
short signed sizeof static struct switch
typedef
union unsigned
void volatile
while
/* keywords[i] refers to the i-th keyword in the list
*
*/
Switching the dimensions of the two-dimensional array would make it look like this:
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
-----------------------------------------------------------------
1 | a b c c c c d d d e e e f f g i i l r r s s s s s s t u u v v w
2 | u r a h o o e o o l n x l o o f n o e e h i i t t w y n n o o h
3 | t e s a n n f u s u t o r t t n g t o g z a r i p i s i l i
4 | o a e r s t a b e m e a o g i u r n e t u t e o i d a l
5 | k i u l r t s r t e o i c c d n g t e
6 | n l e n t n d f c t h e n i
7 | u t e f e l
8 | e r d e
// note that, now, keywords[0] refers to the string "abccccdddeeefffiilrr"
This makes it more efficient to use a binary search algorithm (or even a plain brute force algorithm). But it only words for the first characters in each keyword, afterwards nothing can be considered 'sorted'. This may help in small sets of words like an a programming language, but it wouldn't be enough for a larger set of words (like in the entire English language).
Is there more than can be done to improve this algorithm?
Is there another approach that can be taken to increase performance?
Notes:
This question from SO doesn't help me. The Boyer-Moore-Horspool algorithm (as I understand it) is an algorithm for finding a sub-string within a string. Since I'm parsing for multiple strings I think there's much more room for optimization.

Aho-Corasick is a very cool algorithm but it's not ideal for keyword matches, because keyword matches are aligned; you can't have overlapping matches because you only match a complete identifier.
For the basic identifier lookup, you just need to build a trie out of your keywords (see note below).
Your basic algorithm is fine: find the beginning of the identifier, and then see if it's a keyword. It's important to improve both parts. Unless you need to deal with multibyte characters, the fastest way to find the beginning of a keyword is to use a 256-entry table, with one entry for each possible character. There are three possibilities:
The character can not appear in an identifier. (Continue the scan)
The character can appear in an identifier but no keyword starts with the character. (Skip the identifier)
The character can start a keyword. (Start walking the trie; if the walk cannot be continued, skip the identifier. If the walk finds a keyword and the next character cannot be in an identifier, skip the rest of the identifier; if it can be in an identifier, try continuing the walk if possible.)
Actually steps 2 and 3 are close enough together that you don't really need special logic.
There is some imprecision with the above algorithm because there are many cases where you find something that looks like an identifier but which syntactically cannot be. The most common cases are comments and quoted strings, but most languages have other possibilities. For example, in C you can have hexadecimal floating point numbers; while no C keyword can be constructed just from [a-f], a user-supplied word might be:
0x1.deadbeef
On the other hand, C++ allows user-defined numeric suffixes, which you might well want to recognize as keywords if the user adds them to the list:
274_myType
Beyond all of the above, it's really impractical to parse a million lines of code every time the user types a character in an editor. You need to develop some way of caching tokenization, and the simplest and most common one is to cache by input line. Keep the input lines in a linked list, and with every input line also record the tokenizer state at the beginning of the line (i.e., whether you're in a multi-line quoted string; a multi-line comment, or some other special lexical state). Except in some very bizarre languages, edits cannot affect the token structure of lines preceding the edit, so for any edit you only need to retokenize the edited line and any subsequent lines whose tokenizer state has changed. (Beware of working too hard in the case of multi-line strings: it can create lots of visual noise to flip the entire display because the user types an unterminated quote.)
Note: For smallish (hundreds) numbers of keywords, a full trie doesn't really take up that much space, but at some point you need to deal with bloated branches. One very reasonable datastructure, which can be made to perform very well if you're careful about data layout, is a ternary search tree (although I'd call it a ternary search trie.)

It will be hard to beat this code.
Suppose your keywords are "a", "ax", and "foo".
Take the list of keywords, sorted, and feed it into a program that prints out code like this:
switch(pc[0]){
break; case 'a':{
if (0){
} else if (strcmp(pc, "a")==0 && !alphanum(pc[1])){
// push "a"
pc += 1;
} else if (strcmp(pc, "ax")==0 && !alphanum(pc[2])){
// push "ax"
pc += 2;
}
}
break; case 'f':{
if (0){
} else if (strcmp(pc, "foo")==0 && !alphanum(pc[3])){
// push "foo"
pc += 3;
}
// etc. etc.
}
// etc. etc.
}
Then if you don't see a keyword, just increment pc and try again.
The point is, by dispatching on the first character, you quickly get into the subset of keywords starting with that character.
You might even want to go to two levels of dispatch.
Of course, as always, take some stack samples to see what the time is being used for.
Regardless, if you have data structure classes, you're going to find that consuming a large part of your time, so keep that to a minimum (throw religion to the wind :)

The fastest way to do it would be a finite state machine built to the word set. Use Lex to build the FSM.

The best algorithm for this problem is probably Aho-Corasick. There already exist C implementations, e.g.,
http://sourceforge.net/projects/multifast/

Related

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

String expansion in openCL

I have a simple task of expanding the string FX according to the following rules:
X -> X+YF+
Y-> -FX-Y
In OpenCL, string manipulation is not supported but the use of an array of characters is. How would a kernel program that expands this string in parallel look like in openCL?
More details:
Consider the expansion of 'FX' in the python code below.
axiom = "FX"
def expand(s):
switch = {
"X": "X+YF+",
"Y": "-FX-Y",
}
return switch.get(s, s)
def expand_once(string):
return [expand(c) for c in string]
def expand_n(s, n):
for i in range(n):
s = ''.join(expand_once(s))
return s
expanded = expand_n(axiom, 200)
The result expanded will be a result of expanding the axiom 'FX' 200 times. This is a rather slow process thus the need to do it on openCL for parallelization.
This process results in an array of strings which I will then use to draw a dragon curve.
below is an example of how I would come up with such a dragon curve: This part is not of much importance. The expansion on OpenCL is the crucial part.
import turtles
from PIL import Image
turtles.setposition(5000, 5000)
turtles.left(90) # Go up to start.
for c in expanded:
if c == "F":
turtles.forward(10)
elif c == "-":
turtles.left(90)
elif c == "+":
turtles.right(90)
# Write out the image.
im = Image.fromarray(turtles.canvas)
im.save("dragon_curve.jpg")
Recursive algorithms like this don't especially lend themselves to GPU acceleration, especially as the data set changes its size on each iteration.
If you do really need to do this iteratively, the challenge is for each work-item to know where in the output string to place its result. One way to do this would be to assign work groups a specific substring of the input, and on every iteration, keep count of the total number of Xs and Ys in each workgroup-sized substring of the output. From this you can calculate how much that substring will expand in one iteration, and if you accumulate those values, you'll know the offset of the output of each substring expansion. Whether this is efficient is another question. :-)
However, your algorithm is actually fairly predictable: you can calculate precisely how large the final string will be given the initial string and number of iterations. The best way to generate this string with OpenCL would be to come up with a non-recursive function which analytically calculates the character at position N given M iterations, and then call that function once per work-item, with the (known!) final length of the string as the work size. I don't know if it's possible to come up with such a function, but it seems like it might be, and if it is possible, this is probably the most efficient way to do it on a GPU.
It seems like this might be possible: as far as I can tell, the result will be highly periodic:
FX
FX+YF+
FX+YF++-FX-YF+
FX+YF++-FX-YF++-FX+YF+--FX-YF+
FX+YF++-FX-YF++-FX+YF+--FX-YF++-FX+YF++-FX-YF+--FX+YF+--FX-YF+
^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^
A* B A B A B A B
As far as I can see, those A blocks are all identical, and so are the Bs. (apart from the first A which is effectively at position -1) You can therefore determine the characters at 14 positions out of every 16 completely deterministically. I strongly suspect it's possible to work out the pattern of +s and -s that connects them too. If you figure that out, the solution becomes pretty easy.
Note though that when you have that function, you probably don't even need to put the result in a giant string: you can just feed your drawing algorithm with that function directly.

Looping through all character combinations with increasing number of elements

What I want to achieve:
I have a function where I want to loop through all possible combinations of printable ascii-characters, starting with a single character, then two characters, then three etc.
The part that makes this difficult for me is that I want this to work for as many characters as I can (leave it overnight).
For the record: I know that abc really is 97 98 99, so a numeric representation is fine if that's easier.
This works for few characters:
I could create a list of all possible combinations for n characters, and just loop through it, but that would require a huge amount of memory already when n = 4. This approach is literally impossible for n > 5 (at least on a normal desktop computer).
In the script below, all I do is increment a counter for each combination. My real function does more advanced stuff.
If I had unlimited memory I could do (thanks to Luis Mendo):
counter = 0;
some_function = #(x) 1;
number_of_characters = 1;
max_time = 60;
max_number_of_characters = 8;
tic;
while toc < max_time && number_of_characters < max_number_of_characters
number_of_characters = number_of_characters + 1;
vectors = [repmat({' ':'~'}, 1, number_of_characters)];
n = numel(vectors);
combs = cell(1,n);
[combs{end:-1:1}] = ndgrid(vectors{end:-1:1});
combs = cat(n+1, combs{:});
combs = reshape(combs, [], n);
for ii = 1:size(combs, 1)
counter = counter + some_function(combs(ii, :));
end
end
Now, I want to loop through as many combinations as possible in a certain amount of time, 5 seconds, 10 seconds, 2 minutes, 30 minutes, so I'm hoping to create a function that's only limited by the available time, and uses only some reasonable amount of memory.
Attempts I've made (and failed at) for more characters:
I've considered pre-computing the combinations for two or three letters using one of the approaches above, and use a loop only for the last characters. This would not require much memory, since it's only one (relatively small) array, plus one or more additional characters that gets looped through.
I manage to scale this up to 4 characters, but beyond that I start getting into trouble.
I've tried to use an iterator that just counts upwards. Every time I hit any(mod(number_of_ascii .^ 1:n, iterator) == 0) I increment the m'th character by one. So, the last character just repeats the cycle !"# ... ~, and every time it hits tilde, the second character increments. Every time the second character hits tilde, the third character increments etc.
Do you have any suggestions for how I can solve this?
It looks like you're basically trying to count in base-26 (or base 52 if you need CAPS). Each number in that base will account for a specific string of character. For example,
0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,10,11,12,...
Here, cap A through P are just symbols that are used to represent number symbols for base-26 system. The above simply represent this string of characters.
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,ba,bb,bc,...
Then, you can simply do this:
symbols = ['0','1','2','3','4','5','6','7','8','9','A','B','C','D','E',...
'F','G','H','I','J','K','L','M','N','O','P']
characters = ['a','b','c','d','e','f','g','h','i','j','k','l',...
'm','n','o','p','q','r','s','t','u','v','w','x','y','z']
count=0;
while(true)
str_base26 = dec2base(count,26)
actual_str = % char-by-char-lookup-of-str26 to chracter string
count=count+1;
end
Of course, it does not represent characters that begin with trailing 0's. But that should be pretty simple.
You were not far with your idea of just getting an iterator that just counts upward.
What you need with this idea is a map from the integers to ASCII characters. As StewieGriffin suggested, you'd just need to work in base 95 (94 characters plus whitespace).
Why whitespace : You need something that will be mapped to 0 and be equivalent to it. Whitespace is the perfect candidate. You'd then just skip the strings containing any whitespace. If you don't do that and start directly at !, you'll not be able to represent strings like !! or !ab.
First let's define a function that will map (1:1) integers to string :
function [outstring,toskip]=dec2ASCII(m)
out=[];
while m~=0
out=[mod(m,95) out];
m=(m-out(1))/95;
end
if any(out==0)
toskip=1;
else
toskip=0;
end
outstring=char(out+32);
end
And then in your main script :
counter=1;
some_function = #(x) 1;
max_time = 60;
max_number_of_characters = 8;
currString='';
tic;
while numel(currString)<=max_number_of_characters&&toc<max_time
[currString,toskip]=dec2ASCII(counter);
if ~toskip
some_function(currString);
end
counter=counter+1;
end
Some random outputs of the dec2ASCII function :
dec2ASCII(47)
ans =
O
dec2ASCII(145273)
ans =
0)2
In terms of performance I can't really elaborate as I don't know what you want to do with your some_function. The only thing I can say is that the running time of dec2ASCII is around 2*10^(-5) s
Side note : iterating like this will be very limited in terms of speed. With the function some_function doing nothing, you'd just be able to cycle through 4 characters in around 40 minutes, and 5 characters would already take up to 64 hours. Maybe you'd want to reduce the amount of stuff you want to pass through the function you iterate on.
This code, though, is easily parallelizable, so if you want to check more combinations, I'd suggest trying to do it in a parallel manner.

Haskell : Increment index in a loop

I have a function that calculates f(n) in Haskell.
I have to write a loop so that it will start calculating values from f(0) to f(n), and will every time compare the value of f(i) with some fixed value.
I am an expert in OOP, hence I am finding it difficult to think in the functional way.
For example, I have to write something like
while (number < f(i))
i++
How would I write this in Haskell?
The standard approach here is
Create an infinite list containing all values of f(n).
Search this list until you find what you're after.
For example,
takeWhile (number <) $ map f [0..]
If you want to give up after you reach "n", you can easily add that as a separate step:
takeWhile (number <) $ take n $ map f [0..]
or, alternatively,
takeWhile (number <) $ map f [0 .. n]
You can do all sorts of other filtering, grouping and processing in this way. But it requires a mental shift. It's a bit like the difference between writing a for-loop to search a table, versus writing an SQL query. Think about Haskell as a bit like SQL, and you'll usually see how to structure your code.
You can generate the list of the is such that f i is larger than your number:
[ i | i<-[0..] , f i > number ]
Then, you can simply take the first one, if that's all you want:
head [ i | i<-[0..] , f i > number ]
Often, many idiomatic loops in imperative programming can be rephrased as list comprehensions, or expressed through map, filter, foldl, foldr. In the general case, when the loop is more complex, you can always exploit recursion instead.
Keep in mind that a "blind" translation from imperative to functional programming will often lead to non-idiomatic, hard-to-read code, as it would be the case when translating in the opposite direction. Still, I find it relieving that such translation is always possible.
If you are new to functional programming, I would advise against learning it by translating what you know about imperative programming. Rather, start from scratch following a good book (LYAH is a popular choice).
The first thing that's weird from a functional approach is that it's unclear what the result of your computation is. Do you care about the final result of f (i)? Perhaps you care about i itself. Without side effects everything neends to have a value.
Let's assume you want the final value of the function f (i) as soon as some comparison fails. You can simulate your own while loops using recursion and guards!
while :: Int -> Int -> (Int -> Int) -> Int
while start number f
| val >= number = val
| otherwise = while (start + 1) number f
where
val = f start
Instead of explicit recursion, you can use until e.g.
findGreaterThan :: (Int -> Int) -> Int -> Int -> (Int, Int)
findGreaterThan f init max = until (\(v, i) -> v >= max) (\(v, i) -> (f v, i + 1)) (init, 0)
this returns a pair containing the first value to fail the condition and the number of iterations of the given function.

Iterating with respect to two variables in haskell

OK, continuing with my solving of the problems on Project Euler, I am still beginning to learn Haskell and programming in general.
I need to find the lowest number divisible by the numbers 1:20
So I started with:
divides :: Int -> Int -> Bool
divides d n = rem n d == 0
divise n a | divides n a == 0 = n : divise n (a+1)
| otherwise = n : divise (n+1) a
What I want to happen is for it to keep moving up for values of n until one magically is evenly divisible by [1..20].
But this doesn't work and now I am stuck as from where to go from here. I assume I need to use:
[1..20]
for the value of a but I don't know how to implement this.
Well, having recently solved the Euler problem myself, I'm tempted to just post my answer for that, but for now I'll abstain. :)
Right now, the flow of your program is a bit chaotic, to sound like a feng-shui person. Basically, you're trying to do one thing: increment n until 1..20 divides n. But really, you should view it as two steps.
Currently, your code is saying: "if a doesn't divide n, increment n. If a does divide n, increment a". But that's not what you want it to say.
You want (I think) to say "increment n, and see if it divides [Edit: with ALL numbers 1..20]. If not, increment n again, and test again, etc." What you want to do, then, is have a sub-test: one that takes a number, and tests it against 1..20, and then returns a result.
Hope this helps! Have fun with the Euler problems!
Edit: I really, really should remember all the words.
Well, as an algorithm, this kinda sucks.
Sorry.
But you're getting misled by the list. I think what you're trying to do is iterate through all the available numbers, until you find one that everything in [1..20] divides. In your implementation above, if a doesn't divide n, you never go back and check b < a for n+1.
Any easy implementation of your algorithm would be:
lcmAll :: [Int] -> Maybe Int
lcmAll nums = find (\n -> all (divides n) nums) [1..]
(using Data.List.find and Data.List.all).
A better algorithm would be to find the lcm's pairwise, using foldl:
lcmAll :: [Int] -> Int
lcmAll = foldl lcmPair 1
lcmPair :: Int -> Int -> Int
lcmPair a b = lcmPair' a b
where lcmPair' a' b' | a' < b' = lcmPair' (a+a') b'
| a' > b' = lcmPair' a' (b + b')
| otherwise = a'
Of course, you could use the lcm function from the Prelude instead of lcmPair.
This works because the least common multiple of any set of numbers is the same as the least common multiple of [the least common multiple of two of those numbers] and [the rest of the numbers]
The function 'divise' never stops, it doesn't have a base case. Both branches calls divise, thus they are both recursive. Your also using the function divides as if it would return an int (like rem does), but it returns a Bool.
I see you have already started to divide the problem into parts, this is usually good for understanding and making it easier to read.
Another thing that can help is to write the types of the functions. If your function works but your not sure of its type, try :i myFunction in ghci. Here I've fixed the type error in divides (although other errors remains):
*Main> :i divise
divise :: Int -> Int -> [Int] -- Defined at divise.hs:4:0-5
Did you want it to return a list?
Leaving you to solve the problem, try to further divide the problem into parts. Here's a naive way to do it:
A function that checks if one number is evenly divisible by another. This is your divides function.
A function that checks if a number is dividable by all numbers [1..20].
A function that tries iterates all numbers and tries them on the function in #2.
Here's my quick, more Haskell-y approach, using your algorithm:
Prelude> let divisibleByUpTo i n = all (\x -> (i `rem` x) == 0) [1..n]
Prelude> take 1 $ filter (\x -> snd x == True) $ map (\x -> (x, divisibleByUpTo x 4)) [1..]
[(12,True)]
divisibleByUpTo returns a boolean if the number i is divisible by every integer up to and including n, similar to your divides function.
The next line probably looks pretty difficult to a Haskell newcomer, so I'll explain it bit-by-bit:
Starting from the right, we have map (\x -> (x, divisibleByUpTo x 4)) [1..] which says for every number x from 1 upwards, do divisibleByUpTo x 4 and return it in a tuple of (x, divisibleByUpTo x 4). I'm using a tuple so we know which number exactly divides.
Left of that, we have filter (\x -> snd x == True); meaning only return elements where the second item of the tuple is True.
And at the leftmost of the statement, we take 1 because otherwise we'd have an infinite list of results.
This will take quite a long time for a value of 20. Like others said, you need a better algorithm -- consider how for a value of 4, even though our "input" numbers were 1-2-3-4, ultimately the answer was only the product of 3*4. Think about why 1 and 2 were "dropped" from the equation.

Resources