Related
What I want to achieve:
I have a function where I want to loop through all possible combinations of printable ascii-characters, starting with a single character, then two characters, then three etc.
The part that makes this difficult for me is that I want this to work for as many characters as I can (leave it overnight).
For the record: I know that abc really is 97 98 99, so a numeric representation is fine if that's easier.
This works for few characters:
I could create a list of all possible combinations for n characters, and just loop through it, but that would require a huge amount of memory already when n = 4. This approach is literally impossible for n > 5 (at least on a normal desktop computer).
In the script below, all I do is increment a counter for each combination. My real function does more advanced stuff.
If I had unlimited memory I could do (thanks to Luis Mendo):
counter = 0;
some_function = #(x) 1;
number_of_characters = 1;
max_time = 60;
max_number_of_characters = 8;
tic;
while toc < max_time && number_of_characters < max_number_of_characters
number_of_characters = number_of_characters + 1;
vectors = [repmat({' ':'~'}, 1, number_of_characters)];
n = numel(vectors);
combs = cell(1,n);
[combs{end:-1:1}] = ndgrid(vectors{end:-1:1});
combs = cat(n+1, combs{:});
combs = reshape(combs, [], n);
for ii = 1:size(combs, 1)
counter = counter + some_function(combs(ii, :));
end
end
Now, I want to loop through as many combinations as possible in a certain amount of time, 5 seconds, 10 seconds, 2 minutes, 30 minutes, so I'm hoping to create a function that's only limited by the available time, and uses only some reasonable amount of memory.
Attempts I've made (and failed at) for more characters:
I've considered pre-computing the combinations for two or three letters using one of the approaches above, and use a loop only for the last characters. This would not require much memory, since it's only one (relatively small) array, plus one or more additional characters that gets looped through.
I manage to scale this up to 4 characters, but beyond that I start getting into trouble.
I've tried to use an iterator that just counts upwards. Every time I hit any(mod(number_of_ascii .^ 1:n, iterator) == 0) I increment the m'th character by one. So, the last character just repeats the cycle !"# ... ~, and every time it hits tilde, the second character increments. Every time the second character hits tilde, the third character increments etc.
Do you have any suggestions for how I can solve this?
It looks like you're basically trying to count in base-26 (or base 52 if you need CAPS). Each number in that base will account for a specific string of character. For example,
0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,10,11,12,...
Here, cap A through P are just symbols that are used to represent number symbols for base-26 system. The above simply represent this string of characters.
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,ba,bb,bc,...
Then, you can simply do this:
symbols = ['0','1','2','3','4','5','6','7','8','9','A','B','C','D','E',...
'F','G','H','I','J','K','L','M','N','O','P']
characters = ['a','b','c','d','e','f','g','h','i','j','k','l',...
'm','n','o','p','q','r','s','t','u','v','w','x','y','z']
count=0;
while(true)
str_base26 = dec2base(count,26)
actual_str = % char-by-char-lookup-of-str26 to chracter string
count=count+1;
end
Of course, it does not represent characters that begin with trailing 0's. But that should be pretty simple.
You were not far with your idea of just getting an iterator that just counts upward.
What you need with this idea is a map from the integers to ASCII characters. As StewieGriffin suggested, you'd just need to work in base 95 (94 characters plus whitespace).
Why whitespace : You need something that will be mapped to 0 and be equivalent to it. Whitespace is the perfect candidate. You'd then just skip the strings containing any whitespace. If you don't do that and start directly at !, you'll not be able to represent strings like !! or !ab.
First let's define a function that will map (1:1) integers to string :
function [outstring,toskip]=dec2ASCII(m)
out=[];
while m~=0
out=[mod(m,95) out];
m=(m-out(1))/95;
end
if any(out==0)
toskip=1;
else
toskip=0;
end
outstring=char(out+32);
end
And then in your main script :
counter=1;
some_function = #(x) 1;
max_time = 60;
max_number_of_characters = 8;
currString='';
tic;
while numel(currString)<=max_number_of_characters&&toc<max_time
[currString,toskip]=dec2ASCII(counter);
if ~toskip
some_function(currString);
end
counter=counter+1;
end
Some random outputs of the dec2ASCII function :
dec2ASCII(47)
ans =
O
dec2ASCII(145273)
ans =
0)2
In terms of performance I can't really elaborate as I don't know what you want to do with your some_function. The only thing I can say is that the running time of dec2ASCII is around 2*10^(-5) s
Side note : iterating like this will be very limited in terms of speed. With the function some_function doing nothing, you'd just be able to cycle through 4 characters in around 40 minutes, and 5 characters would already take up to 64 hours. Maybe you'd want to reduce the amount of stuff you want to pass through the function you iterate on.
This code, though, is easily parallelizable, so if you want to check more combinations, I'd suggest trying to do it in a parallel manner.
I have a programm that reads an array of letters (it can be any text). Then I need to compare the 1st and the 4th element of each line of code but the programm doesn't allow me to do this. How can I get an access to those elements in order to compare them?
Program acmp387;
uses crt;
var
n, i, answer : integer;
letters : array[1..1000] of string;
Begin
read(n);
for i:=1 to n do
begin
read(letters[i]);
if ord(letters[i][1]) = ord(letters[i][4])
then answer := answer + 1;
end;
writeln(answer);
readkey;
End.
I'm interested in this line:
if ord(letters[i][1]) = ord(letters[i][4])
Your access is OK (if all strings have at least four characters, for strings with 0 to 3 characters there may be an error/message). May be you have a problem to run your program and it does not behave as expected.
Your program will work as expected if you replace the read statements by readln. A read statements makes sense only in limited situations, in interactive programs you will almost always use readln. With these changes and the input
5
abcdef
abcabc
0101010101010101
10011001
123456
you will get the result display 2 (the lines/strings abcabc and 10011001 meet the criterion and will increment answer).
Context:
I have a code/text editor than I'm trying to optimize. Currently, the bottleneck of the program is the language parser than scans all the keywords (there is more than one, but they're written generally the same).
On my computer, the the editor delays on files around 1,000,000 lines of code. On lower-end computers, like Raspberry Pi, the delay starts happening much sooner (I don't remember exactly, but I think around 10,000 lines of code). And although I've never quite seen documents larger than 1,000,000 lines of code, I'm sure that they're there and I want my program to be able to edit them.
Question:
This leads me to the question: what's the fastest way to scan for a list of words within large, dynamic string?
Here's some information that may affect the design of the algorithm:
the keywords
qualifying characters allowed to be part of a keyword, (I call them qualifiers)
the large string
Bottleneck-solution:
This is (roughly) the method I'm currently using to parse strings:
// this is just an example, not an excerpt
// I haven't compiled this, I'm just writing it to
// illustrate how I'm currently parsing strings
struct tokens * scantokens (char * string, char ** tokens, int tcount){
int result = 0;
struct tokens * tks = tokens_init ();
for (int i = 0; string[i]; i++){
// qualifiers for C are: a-z, A-Z, 0-9, and underscore
// if it isn't a qualifier, skip it
while (isnotqualifier (string[i])) i++;
for (int j = 0; j < tcount; j++){
// returns 0 for no match
// returns the length of the keyword if they match
result = string_compare (&string[i], tokens[j]);
if (result > 0){ // if the string matches
token_push (tks, i, i + result); // add the token
// token_push (data_struct, where_it_begins, where_it_ends)
break;
}
}
if (result > 0){
i += result;
} else {
// skip to the next non-qualifier
// then skip to the beginning of the next qualifier
/* ie, go from:
'some_id + sizeof (int)'
^
to here:
'some_id + sizeof (int)'
^
*/
}
}
if (!tks->len){
free (tks);
return 0;
} else return tks;
}
Possible Solutions:
Contextual Solutions:
I'm considering the following:
Scan the large string once, and add a function to evaluate/adjust the tokens markers every time there is user input (instead of re-scanning the entire document over and over). I expect that this will fix the bottleneck because there is much less parsing involved. But, it doesn't completely fix the program, because the initial scan may still take a really long time.
Optimize token-scanning algorithm (see below)
I've also considered, but have rejected, these optimizations:
Scanning the code that is only on the screen. Although this would fix the bottleneck, it would limit the ability to find user-defined tokens (ie variable names, function names, macros) that appear earlier on than where the screen starts.
Switching the text into a linked list (a node per line), rather than a monolithic array. This doesn't really help the bottleneck. Although insertions/deletions would be quicker, the loss of indexed access slows down the parser. I think that, also, a monolithic array is more likely to be cached, than a broken-up list.
Hard coding a scan-tokens function for every language. Although this could be the best optimization for performance, it's doesn't seem practical in a software development point of view.
Architectural solution:
With assembly language, a quicker way to parse these strings would be to load characters into registers and compare them 4 or 8 bytes at a time. There are some additional measures and precautions that would have to be taken into account, such as:
Does the architecture support unaligned memory access?
All strings would have to be of size s, where s % word-size == 0, to prevent reading violations
Others?
But these issues seem like they can be easily fixed. The only problem (other than the usual ones that come with writing in assembly language) is that it's not so much an algorithmic solution as it is a hardware solution.
Algorithmic Solution:
So far, I've considered having the program rearrange the list of keywords to make a binary search algorithm a little more possible.
One way I've thought about rearranging them for this is by switch the dimensions of the list of keywords. Here's an example of that in C:
// some keywords for the C language
auto // keywords[0]
break // keywords[1]
case char const continue // keywords[2], keywords[3], keywords[4]
default do double
else enum extern
float for
goto
if int
long
register return
short signed sizeof static struct switch
typedef
union unsigned
void volatile
while
/* keywords[i] refers to the i-th keyword in the list
*
*/
Switching the dimensions of the two-dimensional array would make it look like this:
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
-----------------------------------------------------------------
1 | a b c c c c d d d e e e f f g i i l r r s s s s s s t u u v v w
2 | u r a h o o e o o l n x l o o f n o e e h i i t t w y n n o o h
3 | t e s a n n f u s u t o r t t n g t o g z a r i p i s i l i
4 | o a e r s t a b e m e a o g i u r n e t u t e o i d a l
5 | k i u l r t s r t e o i c c d n g t e
6 | n l e n t n d f c t h e n i
7 | u t e f e l
8 | e r d e
// note that, now, keywords[0] refers to the string "abccccdddeeefffiilrr"
This makes it more efficient to use a binary search algorithm (or even a plain brute force algorithm). But it only words for the first characters in each keyword, afterwards nothing can be considered 'sorted'. This may help in small sets of words like an a programming language, but it wouldn't be enough for a larger set of words (like in the entire English language).
Is there more than can be done to improve this algorithm?
Is there another approach that can be taken to increase performance?
Notes:
This question from SO doesn't help me. The Boyer-Moore-Horspool algorithm (as I understand it) is an algorithm for finding a sub-string within a string. Since I'm parsing for multiple strings I think there's much more room for optimization.
Aho-Corasick is a very cool algorithm but it's not ideal for keyword matches, because keyword matches are aligned; you can't have overlapping matches because you only match a complete identifier.
For the basic identifier lookup, you just need to build a trie out of your keywords (see note below).
Your basic algorithm is fine: find the beginning of the identifier, and then see if it's a keyword. It's important to improve both parts. Unless you need to deal with multibyte characters, the fastest way to find the beginning of a keyword is to use a 256-entry table, with one entry for each possible character. There are three possibilities:
The character can not appear in an identifier. (Continue the scan)
The character can appear in an identifier but no keyword starts with the character. (Skip the identifier)
The character can start a keyword. (Start walking the trie; if the walk cannot be continued, skip the identifier. If the walk finds a keyword and the next character cannot be in an identifier, skip the rest of the identifier; if it can be in an identifier, try continuing the walk if possible.)
Actually steps 2 and 3 are close enough together that you don't really need special logic.
There is some imprecision with the above algorithm because there are many cases where you find something that looks like an identifier but which syntactically cannot be. The most common cases are comments and quoted strings, but most languages have other possibilities. For example, in C you can have hexadecimal floating point numbers; while no C keyword can be constructed just from [a-f], a user-supplied word might be:
0x1.deadbeef
On the other hand, C++ allows user-defined numeric suffixes, which you might well want to recognize as keywords if the user adds them to the list:
274_myType
Beyond all of the above, it's really impractical to parse a million lines of code every time the user types a character in an editor. You need to develop some way of caching tokenization, and the simplest and most common one is to cache by input line. Keep the input lines in a linked list, and with every input line also record the tokenizer state at the beginning of the line (i.e., whether you're in a multi-line quoted string; a multi-line comment, or some other special lexical state). Except in some very bizarre languages, edits cannot affect the token structure of lines preceding the edit, so for any edit you only need to retokenize the edited line and any subsequent lines whose tokenizer state has changed. (Beware of working too hard in the case of multi-line strings: it can create lots of visual noise to flip the entire display because the user types an unterminated quote.)
Note: For smallish (hundreds) numbers of keywords, a full trie doesn't really take up that much space, but at some point you need to deal with bloated branches. One very reasonable datastructure, which can be made to perform very well if you're careful about data layout, is a ternary search tree (although I'd call it a ternary search trie.)
It will be hard to beat this code.
Suppose your keywords are "a", "ax", and "foo".
Take the list of keywords, sorted, and feed it into a program that prints out code like this:
switch(pc[0]){
break; case 'a':{
if (0){
} else if (strcmp(pc, "a")==0 && !alphanum(pc[1])){
// push "a"
pc += 1;
} else if (strcmp(pc, "ax")==0 && !alphanum(pc[2])){
// push "ax"
pc += 2;
}
}
break; case 'f':{
if (0){
} else if (strcmp(pc, "foo")==0 && !alphanum(pc[3])){
// push "foo"
pc += 3;
}
// etc. etc.
}
// etc. etc.
}
Then if you don't see a keyword, just increment pc and try again.
The point is, by dispatching on the first character, you quickly get into the subset of keywords starting with that character.
You might even want to go to two levels of dispatch.
Of course, as always, take some stack samples to see what the time is being used for.
Regardless, if you have data structure classes, you're going to find that consuming a large part of your time, so keep that to a minimum (throw religion to the wind :)
The fastest way to do it would be a finite state machine built to the word set. Use Lex to build the FSM.
The best algorithm for this problem is probably Aho-Corasick. There already exist C implementations, e.g.,
http://sourceforge.net/projects/multifast/
I am about creating a distributed Password Cracker, in which I will use brute force technique, so I need every combination of string.
For the sake of distribution, Server will give Client a range of strings like from "aaaa" to "bxyz". I am supposing that string length will be of four. So I need to check every string between these two bounds.
I am trying to generate these strings in C. I am trying to make logic for this but I'm failing; I also searched on Google but no benefit. Any Idea?
EDIT
Sorry brothers, I would like to edit it
I want combination of string with in a range, lets suppose between aaaa and aazz that would be strings like aaaa aaab aaac aaad ..... aazx aazy aazz .. my character space is just upper and smaller English letters that would be like 52 characters. I want to check every combination of 4 characters. but Server will distribute range of strings among its clients. MY question was if one client gets range between aaaa and aazz so how will I generate strings between just these bounds.
If your strings will comprehend only the ASCII table, you'll have, as an upper limit, 256 characters, or 2^8 characters.
Since your strings are 4 characters length, you'll have 2^8 * 2^8 * 2^8 * 2^8 combinations,
or 2^8^4 = 2^32 combinations.
Simply split the range of numbers and start the combinations in each machine.
You'll probably be interested in this: Calculating Nth permutation step?
Edit:
Considering your edit, your space of combinations would be 52^4 = 7.311.616 combinations.
Then, you do simply need to divide these "tasks" for each machine to compute, so, 7.311.616 / n = r, having r as the amount of permutations calculated by each machine -- the last machine may compute r + (7.311.616 % n) combinations.
Since you know the amount of combinations to build in each machine, you'll have to execute the following, in each machine:
function check_permutations(begin, end, chars) {
for (i = begin; i < end; i++) {
nth_perm = nth_permutation(chars, i);
check_permutation(nth_perm); // your function of verification
}
}
The function nth_permutation() is not hard to derive, and I'm quite sure you can get it in the link I've posted.
After this, you would simply start a process with such a function as check_permutations, giving the begin, end, and the vector of characters chars.
You can generate a tree containing all the permutations. E.g., like in this pseudocode:
strings(root,len)
for(c = 'a' to 'z')
root->next[c] = c
strings(&root->next[c], len - 1)
Invoke by strings(root, 4).
After that you can traverse the tree to get all the permutations.
I'm trying to understand how does memmove work. I'm taking an example where I have data in memory in this manner.
Start at 0
First Memory Block(A) of size 10
Hence A->(0,10) where 0 being where it starts and 10 it's length.
Thus B-> (10,20)
C-> (30,50)
D-> (80,10)
Let's say that we have a variable X which records where can insert next which would be 90 in the example given above.
Now if I want to delete B, then I would like to move C and D to free space occupied by B.
input is input array.
So input array will look like having first 10 characters belonging to block A, next 20 belonging to block B etc.
This I think can be done using memmove as follows:
memmove(input+start(B),input+start(B)+length(B),X-(start(B)+length(B))
Now I want to try for reverse order.
So we start from behind
Start at 100
First memory block(A) of size 10
A-> (100,10) 100 is where it starts and 10 it's length
B-> (90,20)
C-> (70,50)
D-> (20,10)
Similar to first example, let's say we have a variable X where we record where we can insert next. This would be 10 for the example in reverse order.
Now if I want to delete block B, then I would like C and D to overlap in B's space. This would be memmove in reverse order.
I think this can be done in this manner:
memmove(input+start(B)-(start(B)-length(B)-X),input+X,start(B)-length(B)-X)
As per Alex comment, I think I've not kept the correct ordering of data. Data would be like,
A->(90,10)
B->(70,20)
C->(40,30)
D->(20,20)
and X which would be D's starting address i.e at 20
Now if we want to delete B,memmove would look something like this.
memmove(input+X+length(B), input+X,start(B)-X)
Are there better ways to do this?
Note this is not for homework.
C and D occupy together 50+10=60, so why 20 in memmove(input+start(B), input+start(B)+length(B), 20)?
As for the other part, in C objects don't start with their last byte (the first byte is at the lowest address and the last byte at the highest). This part is confusing.