Binary to unary turing machine - theory

How to implement a turing machine that converts the input from binary to unary?
Example:
given input-$101
The output should be -1^5=11111

The problem statement does not specifically request a single-tape Turing machine, which greatly simplifies matters. Since we know that multi-tape Turing machines have equivalent single-tape machines to which they correspond, we'll just worry about the multi-tape definition and leave transforming it to a single-tape variety as an exercise.
We will use a three-tape Turing machine that works as follows:
The first tape is the input tape, and we'll just be reading from it.
Second tape is our scratch tape. We'll be recording the unary equivalent of whatever digit we're currently looking at in the input string here (so 1, 11, 1111, etc.)
Third tape is our output tape. We'll record our answer here.
We begin by going to the last symbol of the input tape. We also initialize the scratch tape with the value 1.
If the current value of the input tape we're looking at is 1, then we copy the scratch tape to the output tape; otherwise, we do nothing.
Move the input tape head one space to the left and copy the scratch tape to the end of the scratch tape itself, effectively doubling the number of 1s on it.
Return to step 5 and repeat until the input tape head is at blank at the front of the single-sided tape.*
Example of how this would work:
Input:
#101# ######### ######
^ ^ ^
After initialization:
#101# #1####### ######
^ ^ ^
After first iteration of steps 5 & 6
#101# #11###### #1####
^ ^ ^
After second iteration of steps 5 & 6
#101# #1111#### #1####
^ ^ ^
After third iteration of steps 5 & 6
#101# #11111111 #111111#
^ ^ ^
I always assume we have a marker at the front of single-sided tapes so we know not to go over the edge. If you don't want to assume this, you can always enforce it yourself by shifting the input over one space and writing a blank at the front of the input tape, as the first step before doing all the other stuff.

Related

Could someone explain me what does this line of code mean?

I was wondering what this code really means. I mean, I would like to know what it does, in what order, and what do ? and : signs mean; all explained.
printf(" %c", ((sq & 8) && (sq += 7)) ? '\n' : pieces[board[sq] & 15]);
Thanks.
The first argument, " %c", means that printf needs to print out a character.
The second argument is the character that the function prints.
In this case, the second argument is a ternary operator. You can read the link provided, but in short, it's basically a short-hand for an if-else block. This is what it looks like in your example:
((sq & 8) && (sq += 7)) ? '\n' : pieces[board[sq] & 15]
Let's separate it into three parts:
((sq & 8) && (sq += 7))
'\n'
pieces[board[sq] & 15]
The first part is a condition (if);
this expression (sq & 8) uses what is called a bitwise AND operation (read more here). Basically, 8 in binary is 1000 and that part checks whether sq has a 1 in that position (it can be 1000, 11000, 101000 etc.); if it does, that expression equals 8 (any number bigger than zero means true) and if it doesn't, it equals 0 (which means false).
&& means AND, it just means that both left and right expression need to be true
sq += 7 will add 7 to sq and if it's not 0, it is true.
The second part \n is returned (and in your case printed out) if the condition is true; else the third part will be printed out (pieces[board[sq] & 15]).
This is fairly obfuscated code, so it's best to try to understand it in the context in which it appears. By obfuscating it in this way, the auther is trying to tell you "you don't really need to understand the details". So lets try to understand what this does from the 'top down' inferring the details of the context, rather than bottom up.
printf prints -- in this case " %c", which is a space and a single character. The single character will either be (from the ?-: ternary expression)
a newline '\n'
a piece from space sq on the board
which it will be depends on the condition before the ? -- it first tests a single bit of sq (the & 8 does a bitwise and with a constant with one set bit), and if that bit is set, adds 7 to sq and prints the newline1, while if it is not set, will print the piece.
So now we really need to know the context. This is probably in a loop that starts with sq = 0 and increments sq each time in the loop (ie, something like for (int sq = 0; ...some condition...; ++sq)). So what it is doing is printing out the pieces on some row of the board, and when it gets to the end of the row, prints a newline and goes on to the next row. A lot of this depends on how exactly the board array is organized -- it would seem to be a 1D array with a 2D board embedded in it; the first row at indexes 0..7, the second at indexes 16..23, the third at indexes 32..39 and so on2.
1technically, when the bit is set, it tests the result of adding 7, but that will be true unless sq was -7, which is probably impossible from the context (a loop that starts at 0 and only increments from there).
2The gaps here are inferred from the test in the line of code -- those indexes with bit 3 set (for which sq & 8 will be true) are not valid board spaces, but are instead "gaps" between the rows. They might be used for something else, elsewhere in the code
Ok, thank you all! I've looked at it and it now works as expected. Thanks!

Determine needed # of extra bytes to conduct buffer overflow attack (homework)

For assistance in completing Level 0 of this buffer overflow project (.pdf) in Assembly, I'm using this guide.
Goal = provide a string longer than getbuf can handle (getbuf() takes
an input of 32 characters), causing an overflow and pushing the rest
of the string onto the stack where you can then control where the
function getbuf returns after exection (we want to call the smoke() function).
When I get to step 2 of the guide, I need to calculate how many additional characters my input string should have, on top of the original 32-character input, in order to execute buffer overflow and call smoke().
(Step 2 Excerpt):
In my case, I find startingAddressOfBufVariable to be 0x55683ddc and %ebp is 0x55683df0.
I calculate buffer size = addressAt%ebp+4 - startingAddressOfBufVariable, which is 0x55683df0+4 - 0x55683ddc = 0x18 = 24.
In the next step, I'm supposed to subtract 32 from that result to determine how many additional characters (on top of the original 32) that my input string should have. However, 24 - 32 = -8. I get a negative number! I'm not sure what to do with that. Subtract 8 characters from my 32-character input string? I'm trying to conduct overflow, so that doesn't make sense.
For testing/guessing purposes, I moved on with the guide, pretending that the -8 result I got was actually a positive 8. So I moved on with the guide, intending to add 8 characters on top of my 32-character string.
Knowing the address of my smoke() function to be 0x08048b2b, I created my input file as such, per the instructions in step 4 (though why did they change aa to 61?):
perl -e 'print "AA "x32, "BB "x4, "CC "x4, "2b 8b 04 08" '>hex3
(Step 4 Excerpt:)
So, Am I using the incorrect math in Step 2 of the guide? Are my addresses I'm using in the math incorrect? If they are correct, how do I interpret the -8 result, and what does the -8 result mean, in terms of modifying my character input to execute the overflow attack?
They write in the example C code that the buffer size is 12. It would make sense that you would have to infer the buffer size from the registers. Maybe try again with a base string size of 0x18?

Loops, "have n items to process but only n-1 update steps"

Consider the following loop:
marker_stream = 0
for character in input_file:
if character != ',':
marker_stream |= 1
marker_stream <<= 1
For each character in input_file_contents, this loop does a processing step, stores the result of the processing step (either a 0 or 1 bit) in marker_stream, and then shifts marker_stream over by one position to prepare for the next iteration.
Here's the problem: I want to process each character in the input file, but I only want to shift marker_stream number of characters in the input file - 1 times. The loop above shifts marker_stream one too many times.
Now, I know I could add marker_stream >>= 1 after the for loop, or I could maintain some flag that says whether or not the character we're currently processing is the last character in the file, but neither of these solutions seem that great. The flag solution involves flags (yuck), and the extra line solution could be confusing if the processing loop was longer.
I'm looking for a more elegant solution to this problem, and, more generally the "I have n items to process but have an update step I want to run only n-1 times" problem.
Process the first element in the file separately; treat the file as a one-entry head with a tail holding the rest of the entries:
// Process head
entry <- readNext(inFile)
write(entry, outFile)
// Process tail
while (NOT inFile.endOfFile)
write(separator, outFile)
entry <- readNext(inFile)
write(entry, outFile)
endwhile
The head entry does not follow a separator; the tail entries all follow a separator. You get your 'n-1' effect by treating the single head entry differently from the n-1 tail entries in the file.

Optimizing a word parser

Context:
I have a code/text editor than I'm trying to optimize. Currently, the bottleneck of the program is the language parser than scans all the keywords (there is more than one, but they're written generally the same).
On my computer, the the editor delays on files around 1,000,000 lines of code. On lower-end computers, like Raspberry Pi, the delay starts happening much sooner (I don't remember exactly, but I think around 10,000 lines of code). And although I've never quite seen documents larger than 1,000,000 lines of code, I'm sure that they're there and I want my program to be able to edit them.
Question:
This leads me to the question: what's the fastest way to scan for a list of words within large, dynamic string?
Here's some information that may affect the design of the algorithm:
the keywords
qualifying characters allowed to be part of a keyword, (I call them qualifiers)
the large string
Bottleneck-solution:
This is (roughly) the method I'm currently using to parse strings:
// this is just an example, not an excerpt
// I haven't compiled this, I'm just writing it to
// illustrate how I'm currently parsing strings
struct tokens * scantokens (char * string, char ** tokens, int tcount){
int result = 0;
struct tokens * tks = tokens_init ();
for (int i = 0; string[i]; i++){
// qualifiers for C are: a-z, A-Z, 0-9, and underscore
// if it isn't a qualifier, skip it
while (isnotqualifier (string[i])) i++;
for (int j = 0; j < tcount; j++){
// returns 0 for no match
// returns the length of the keyword if they match
result = string_compare (&string[i], tokens[j]);
if (result > 0){ // if the string matches
token_push (tks, i, i + result); // add the token
// token_push (data_struct, where_it_begins, where_it_ends)
break;
}
}
if (result > 0){
i += result;
} else {
// skip to the next non-qualifier
// then skip to the beginning of the next qualifier
/* ie, go from:
'some_id + sizeof (int)'
^
to here:
'some_id + sizeof (int)'
^
*/
}
}
if (!tks->len){
free (tks);
return 0;
} else return tks;
}
Possible Solutions:
Contextual Solutions:
I'm considering the following:
Scan the large string once, and add a function to evaluate/adjust the tokens markers every time there is user input (instead of re-scanning the entire document over and over). I expect that this will fix the bottleneck because there is much less parsing involved. But, it doesn't completely fix the program, because the initial scan may still take a really long time.
Optimize token-scanning algorithm (see below)
I've also considered, but have rejected, these optimizations:
Scanning the code that is only on the screen. Although this would fix the bottleneck, it would limit the ability to find user-defined tokens (ie variable names, function names, macros) that appear earlier on than where the screen starts.
Switching the text into a linked list (a node per line), rather than a monolithic array. This doesn't really help the bottleneck. Although insertions/deletions would be quicker, the loss of indexed access slows down the parser. I think that, also, a monolithic array is more likely to be cached, than a broken-up list.
Hard coding a scan-tokens function for every language. Although this could be the best optimization for performance, it's doesn't seem practical in a software development point of view.
Architectural solution:
With assembly language, a quicker way to parse these strings would be to load characters into registers and compare them 4 or 8 bytes at a time. There are some additional measures and precautions that would have to be taken into account, such as:
Does the architecture support unaligned memory access?
All strings would have to be of size s, where s % word-size == 0, to prevent reading violations
Others?
But these issues seem like they can be easily fixed. The only problem (other than the usual ones that come with writing in assembly language) is that it's not so much an algorithmic solution as it is a hardware solution.
Algorithmic Solution:
So far, I've considered having the program rearrange the list of keywords to make a binary search algorithm a little more possible.
One way I've thought about rearranging them for this is by switch the dimensions of the list of keywords. Here's an example of that in C:
// some keywords for the C language
auto // keywords[0]
break // keywords[1]
case char const continue // keywords[2], keywords[3], keywords[4]
default do double
else enum extern
float for
goto
if int
long
register return
short signed sizeof static struct switch
typedef
union unsigned
void volatile
while
/* keywords[i] refers to the i-th keyword in the list
*
*/
Switching the dimensions of the two-dimensional array would make it look like this:
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
-----------------------------------------------------------------
1 | a b c c c c d d d e e e f f g i i l r r s s s s s s t u u v v w
2 | u r a h o o e o o l n x l o o f n o e e h i i t t w y n n o o h
3 | t e s a n n f u s u t o r t t n g t o g z a r i p i s i l i
4 | o a e r s t a b e m e a o g i u r n e t u t e o i d a l
5 | k i u l r t s r t e o i c c d n g t e
6 | n l e n t n d f c t h e n i
7 | u t e f e l
8 | e r d e
// note that, now, keywords[0] refers to the string "abccccdddeeefffiilrr"
This makes it more efficient to use a binary search algorithm (or even a plain brute force algorithm). But it only words for the first characters in each keyword, afterwards nothing can be considered 'sorted'. This may help in small sets of words like an a programming language, but it wouldn't be enough for a larger set of words (like in the entire English language).
Is there more than can be done to improve this algorithm?
Is there another approach that can be taken to increase performance?
Notes:
This question from SO doesn't help me. The Boyer-Moore-Horspool algorithm (as I understand it) is an algorithm for finding a sub-string within a string. Since I'm parsing for multiple strings I think there's much more room for optimization.
Aho-Corasick is a very cool algorithm but it's not ideal for keyword matches, because keyword matches are aligned; you can't have overlapping matches because you only match a complete identifier.
For the basic identifier lookup, you just need to build a trie out of your keywords (see note below).
Your basic algorithm is fine: find the beginning of the identifier, and then see if it's a keyword. It's important to improve both parts. Unless you need to deal with multibyte characters, the fastest way to find the beginning of a keyword is to use a 256-entry table, with one entry for each possible character. There are three possibilities:
The character can not appear in an identifier. (Continue the scan)
The character can appear in an identifier but no keyword starts with the character. (Skip the identifier)
The character can start a keyword. (Start walking the trie; if the walk cannot be continued, skip the identifier. If the walk finds a keyword and the next character cannot be in an identifier, skip the rest of the identifier; if it can be in an identifier, try continuing the walk if possible.)
Actually steps 2 and 3 are close enough together that you don't really need special logic.
There is some imprecision with the above algorithm because there are many cases where you find something that looks like an identifier but which syntactically cannot be. The most common cases are comments and quoted strings, but most languages have other possibilities. For example, in C you can have hexadecimal floating point numbers; while no C keyword can be constructed just from [a-f], a user-supplied word might be:
0x1.deadbeef
On the other hand, C++ allows user-defined numeric suffixes, which you might well want to recognize as keywords if the user adds them to the list:
274_myType
Beyond all of the above, it's really impractical to parse a million lines of code every time the user types a character in an editor. You need to develop some way of caching tokenization, and the simplest and most common one is to cache by input line. Keep the input lines in a linked list, and with every input line also record the tokenizer state at the beginning of the line (i.e., whether you're in a multi-line quoted string; a multi-line comment, or some other special lexical state). Except in some very bizarre languages, edits cannot affect the token structure of lines preceding the edit, so for any edit you only need to retokenize the edited line and any subsequent lines whose tokenizer state has changed. (Beware of working too hard in the case of multi-line strings: it can create lots of visual noise to flip the entire display because the user types an unterminated quote.)
Note: For smallish (hundreds) numbers of keywords, a full trie doesn't really take up that much space, but at some point you need to deal with bloated branches. One very reasonable datastructure, which can be made to perform very well if you're careful about data layout, is a ternary search tree (although I'd call it a ternary search trie.)
It will be hard to beat this code.
Suppose your keywords are "a", "ax", and "foo".
Take the list of keywords, sorted, and feed it into a program that prints out code like this:
switch(pc[0]){
break; case 'a':{
if (0){
} else if (strcmp(pc, "a")==0 && !alphanum(pc[1])){
// push "a"
pc += 1;
} else if (strcmp(pc, "ax")==0 && !alphanum(pc[2])){
// push "ax"
pc += 2;
}
}
break; case 'f':{
if (0){
} else if (strcmp(pc, "foo")==0 && !alphanum(pc[3])){
// push "foo"
pc += 3;
}
// etc. etc.
}
// etc. etc.
}
Then if you don't see a keyword, just increment pc and try again.
The point is, by dispatching on the first character, you quickly get into the subset of keywords starting with that character.
You might even want to go to two levels of dispatch.
Of course, as always, take some stack samples to see what the time is being used for.
Regardless, if you have data structure classes, you're going to find that consuming a large part of your time, so keep that to a minimum (throw religion to the wind :)
The fastest way to do it would be a finite state machine built to the word set. Use Lex to build the FSM.
The best algorithm for this problem is probably Aho-Corasick. There already exist C implementations, e.g.,
http://sourceforge.net/projects/multifast/

To check wether a given string seqence is palindrome or not using stacks

I want to know how to store the characters or integers popped from stack to be stored so that I can compare it with the original string or int value.
For example :
n = 1221;
n = n / 1000; // so that i get the last digit in this case 1 and dividing the remainder
// further each time by 100, 10 and 1
If I store each number that I get in a variable-- for example, say that I store 1 which I got from the above division in a variable named s, and push it onto the stack.
After this, I pop the values back out; when I do so, how can I check weather it is equal to the original number? I can check it with a if condition, for eg if i have 3 did then
(i == p && check for other two numbers)
but i don`t want that i want to check for any size number.
Please don't send the source code on how to do it, just give me a few snippets or a few hints, thanks. Also please let me know how you came up for the solution.
Thanks!
Dont send the source code on how to do it
ok ;)
just give me few snippets or few hints
Recursion
and while you give the solution let me know how you came up for the solution, weather you had seen a program like this or know the algorithm or came up with a solution now when you saw the question. Thanks
i have been doing this too long to remember where i first saw it =\
You can use recursion to solve the problem, like Justin mentioned.
Once you start your recursion you will divide the number with an appropriate divisor(1000,100,10,1) store the quotient in the stack. You do this till the end.
Once you reach the 'unit' place, you store the unit digit, and now start popping out of the stack.
Have an if-else ladder to return an integer from the recursive function.
The ladder will have conditions of returning the int after left shifting and returning the variable shifted.
You can do your check in the main function.
1 ->1221(left shift 122 OR with 1)
2 ->122(left shift 12 OR with 2)
2 ->12(left shift 1 OR with 2)
1---->1
Hope this helps.
Thanks
Aditya

Resources