DFA and regular languages - theory

I've been thinking about the following and I think the answer's in the affirmative.
Is it true that every subset of a DFA-acceptable language that is regular is also DFA-acceptable?

No. Counterexample: Alphabet is numbers digits. DFA accepts all natural numbers. Subset: DFA accepts all prime numbers.
Edit: Alphabet is digits. Sorry, wrong terminology there.
Natural numbers can be expressed as a regular language (and therefore a DFA can be constructed for them):
0|([1-9][0-9]*)

All finite automata -- deterministic as well as nondeterministic -- can be represented as a regular language and vice versa. If the subset of a language is regular, then yes it can be represented as a DFA.

Related

LISP arithmetics implementation

I'm making a toy lisp interpreter with D and I don't know the theory of Lisp very well.
I was wondering if Lisp can implement basic arithmetic functions (+, -, ×, ÷) by itself.
Most Lisp/Scheme dialects implemented it with the builtins of C, Java-like language and overload it as lisp code(duplicated implements?).
I want to write arithmetic functions to Lisp code purely.
Is it possible?
Unless you want to use Church numerals or the like, at some point you're going to have to get into the hardware arithmetic instructions (add, sub, mul, div) one way or another.
If going down the hardware instructions route, then depending on your Lisp implementation, it may be implemented using C code (especially for an interpreter-based implementation), or those instructions may be emitted directly (for a JIT compiler-based implementation).
If you're trying to be as first-principles as possible, you can implement multiplication and division using addition and subtraction instructions (in a pinch, you can implement them the same way you were taught to in school, though you're using word-sized digits—that is, for a 32-bit machine, each digit is base-4294967296 instead of base-10).
The very simple solution would always to use your host numeric tower, but I understand your desire to keep the primitives low. The result however is a language like the first LISPs which had a bad rep about performance.
As an alternative to Chris's Church numerals you can model numbers using lists. Eg. 1234 can be (+ 4 3 2 1). Now you either have a low numeric type as primitive or the digits you see are simply self evaluating symbols which your math functions know what are. If you have a low numeric type you can add a exponent so it becomes (+ 0 4 3 2 1) for 1234 and (+ 1 4 3 2 1) for 12340 and (+ -11 4 3 2 1) for 0.00000001234. All arithmetic would be list iterations using exactly the math you know from school. It's more effective than Church numerals and slightly more efficient and it's easier to print it and read it.
I have used this on my little lisp interpreter that only have lists and symbols.
If you are interested in implementing bignums in Lisp I can recommend this series by André van Meulebrouck:
https://web.archive.org/web/20101208222557/http://www.mactech.com/articles/mactech/Vol.08/08.03/BigNums/index.html
The link above is the web.archive.org. For some reason the original link while still live shows no contents ( http://www.mactech.com/articles/mactech/vol.08/08.03/bignums/index.html )
All of Peano arithmetic is based on three functions:
zero, represented in Lisps as zerop or zero? depending on dialect;
successor, represented in Lisps by add1; if your dialect doesn't have it you're going to have to implement + in your bootstrap language anyway;
identity, represented in Lisp by equal.
With those three functions you can build the whole of arithmetic, but it is not going to be easy!
In my opinion you would be wise to build your Lisp arithmetic primitives (add, subtract, multiply, divide) in your implementation language. This is particularly so if you want to have first class ratios and bignums.

Writing my own float parser

I am trying to write a parser in C and part of its job is to convert a series of characters into a double. Up to now I have been using strtod but I find it to be quite dangerous and it won't handle cases where the number is at the end of the buffer, which is not null terminated.
I thought I'd write my own. If I have a string representation of a number of the form a.b, will I be nieve to think that I can just calculate (double)a + ((double)b / (double)10^n), where n is the number of digits in b?
For example, 23.4563:
a = 23
b = 4563
final answer: 23 + (4563/10000)
Or would that produce inaccurate results with regard to the IEEE format of floats?
It is hard to read floating-point numerals accurately, in the sense that there are various problems that must be carefully addressed, and many people fail to do so. However, it is a solved problem. To start, see How to read floating point numbers accurately, June 1990, by William D. Clinger.
I agree with Roddy, you are likely better off copying the data into a buffer and using existing library functions. (However, you should check that your C implementation provides correctly rounded conversion of floating-point numerals. The C standard does not require it, and some implementations do not provide it.)
You may be interested in this answer of mine to a somewhat related question.
The parser in that answer converts decimal floating point numbers (represented as strings) into IEEE-754 floats and doubles with proper rounding.
As far as I remember, about the only issue in the code is that it may not handle the cases when the exponent part is too big (doesn't fit into an integer) and should amount to returning either an error or INF.
Otherwise, it should give you a good idea of what to do (if you have any idea at all of what you're doing:).
As already said, it's difficult, you need extra precision, etc...
But if you have restricted inputs, and want to know if you can still correctly convert these restricted decimal to binary with semi naive algorithm and standard IEEE 754 ops, you might be interested in my answer to
How to manually parse a floating point number from a string

Storing and printing integer values greater than 2^64

I am trying to write a program for finding Mersenne prime numbers. Using the unsigned long long type I was able to determine the value of the 9th Mersenne prime, which is (2^61)-1. For larger values I would need a data type that could store integer values greater than 2^64.
I should be able to use operators like *, *=, > ,< and % with this data type.
You can not do what you want with C natives types, however there are libraries that let handle arbitrarily large numbers, like the GNU Multiple Precision Arithmetic Library.
To store large numbers, there are many choices, which are given below in order of decreasing preferences:
1) Use third-party libraries developed by others on github, codeflex etc for your mentioned language, that is, C.
2) Switch to other languages like Python which has in-built large number processing capabilities, Java, which supports BigNum, or C++.
3) Develop your own data structures, may be in terms of strings (where 100 char length could refer to 100 decimal digits) with its custom operations like addition, subtraction, multiplication etc, just like complex number library in C++ were developed in this way. This choice could be meant for your research and educational purpose.
What all these people are basically saying is that the 64bit CPU will not be capable of adding those huge numbers with just an instruction but you rather need an algorithm that will be able to add those numbers. Such an algorithm would have to treat the 2 numbers in pieces.
And the libraries they listed will allow you to do that, a good exercise would be to develop one yourself (just the algorithm/function to learn how it's done).
There is no standard way for having data type greater than 64 bits. You should check the documentation of your systems, some of them define 128 bits integers. However, to really have flexible size integers, you should use an other representation, using an array for instance. Then, it's up to you to define the operators =, <, >, etc.
Fortunately, libraries such as GMP permits you to use arbitrary length integers.
Take a look at the GNU MP Bignum Library.
Use double :)
it will solve your problem!

Interview : Hash function: sine function

I was asked this interview question. I am not sure what the correct answer for it is (and the reasoning behind the answer):
Is sin(x) a good hash function?
If you mean sin(), it's not a good hashing function because:
it's quite predictable and for some x it's no better than just x itself. There should be no seemingly apparent relationship between the key and the hash of the key.
it does not produce an integer value. You cannot index/subscript arrays with floating-point indices and there must be some kind of array in the hash table.
floating-point is very implementation-specific and even if you make a hash function out of sin(), it may not work with a different compiler or on a different kind of CPU/computer.
sin() may be much slower than some simpler integer-arithmetic function.
Not really.
It's horribly slow.
You'll need to convert the result to some integer type anyway to avoid the insanity of floating-point equality comparisons. (Not actually the usual precision problems that are endemic to FP equality comparisons and which arise from calculating two things slightly different ways; I mean specifically the problems caused by things like the fact that 387-derived FPUs store extra bits of precision in their registers, so if a comparison is done between two freshly-calculated values in registers you could get a different answer than if exactly one of the operands was loaded into a register from memory.)
It's almost flat near the peaks and troughs, so the quantisation step (multiplying by some large number and rounding to an integer) will produce many hash values near the min and max, rather than an even distribution.
Based off of mathematical knowledge:
Sine(x) is periodic so it's going to reach the same number from different values of x, so Sine(x) would be awful as a hashing function because you will get multiple values hashing to the exact same point. There are **infinitely many values between 0 and pi for the return value, but then past that the values will repeat. So 0 & pi & 2*pi will all hash to the same point.
If you could make the increment small enough and have Sine(x) multiplied by say x^2 or something of that nature it'd be mediocre at best, but then again, if you were to do that why not just use x^2 anyway and toss out the periodic function all together.
**infinitely: a large enough number that I'm not willing to count.
NOTE: Sine(x) will have values that are small and could be affected by rounding error.
NOTE: Any value taken from a sine function should be multiplied by an integer and then either modded or the floor or ceiling taken so that the value can be used as an array offset, etc.
sin(x) is trigonometric function which repeats itself after every 360 degrees, so it's going to be a poor hash function as the hash will be repeated too often.
A simple refutation:
sin(0) == sin(360) == sin(720) == sin(..)
This is not a property of a goodhash function.
Even if you decide to use it, it's difficult to represent the value returned by sin.
Sin function:
sin x = x - x^3/3! + x^5/5! - ...
This can't accurately represented due to floating point precision issue, which means for a same value it may produce two different hashes!
Another point to note:
For sine(x) as hash function - Keys in a given close range will have hash values in close range too, it is not desirable. A good hash function evenly distributes hash values irrespective of the nature of the keys.
Hash values generally have to be integers to be useful. Since sin doesn't generate integers it wouldn't be appropriate.
Let's say we have a string s. It can be expressed as a number in hexadecimal and feeded to the function. If you added 2 pi it would cease to be a valid input, as it wouldn't be an integer anymore (only non-negative integers are accepted by the function). You have to find a string that gives a collision, not just multiply the hex expression of the string with 2 pi. And adding (concatenating?) 2 pi directly to the string wouldn't help finding a collision. There might be another way though but not that trivial.
I think sin(x) can make an excellent cryptographic hash function,
if used wisely. The input should be a natural number in radians
and never contain pi. We must use arbitrary-precision arithmetic.
For every natural number x (radians), sin(x)
is always a transcendental irrational number and there is no other
natural number with the same sine. But there's a catch: An attacker could gain
information about the input, by computing the arcsin of the hash.
In order to prevent this, we ignore the decimal part and some of the
first digits from the fractional part, keeping only the next n (say 100) digits,
making such an attack computationally infeasible.
It seems that a small change in the input gives a completely different result,
which is a desirable property.
The result of the function seems statistically random, again a good property.
I'm not sure how to prove that is is collision-resistant but i can't see why
it couldn't be. Also, i can't think of a way to find a specific input that results
in a specific hash. I'm not saying that we should blindly believe that it is
certainly a good crypt. hash function. I just think that it seems like a
good candidate to be one. We should give it a chance
and focus on proving that it is. And it might me a very good one.
To those that might say it is slow: Yes, it is. And that's good when hashing passwords.
Here i'm attaching some perl code for this idea. It runs on linux with bash and bc.
(bc is a command-line arbitrary-precision calculator, included in most distros)
I'll be checking this page for any answers, since this interests me a lot.
Don't be harsh though, i'm just a CS undergrad, willing to learn more.
use warnings;
use strict;
my $input='5AFF36B7';#Input for bc (as a hex number)
$input='1'.$input;#put '1' in front of input, so that 0x0 , 0x00 , 0x1 , 0x01 , etc ... ,
#all give different nonzero results
my $a=`bc -l -q <<< "scale=256;obase=16;ibase=16;s($input)"`;#call bc, keep result in $a
#keep only fractional part
$a=~tr/a-zA-Z0-9//cd;#Clean up string, keep only alphanumerics
my #m = $a =~ /./g;#Convert string to array of chars
#PRINT OUTPUT
#We ignore some digits, for security reasons:
#If we don't ignore any of the first digits, an attacker could gain information
#about the input by computing the inverse of sin (the arcsin of the hash)
#By ignoring enough of the first digits, it becomes computationally
#infeasible to compute arcsin
#Also, to avoid problems with roundoff error, we ignore some of the last digits
for (my $c=100;$c<200;$c++){
print $m[$c];
}

Difference between Turing-Decidable and Co-Turing-Decidable

I am really struggling with understanding the difference between these two. From my textbook, it essentially describes the difference by saying
a language is co-turing recognizable if it is complement of a turing-recognizable language.
I guess the part of this definition I don't understand is: what does it mean when it is a complement of a turing-recognizable language?
How exactly do you determine if it is a complement of another language?
(A note- the terms "Turing decidable" and "co-Turing decidable" are the same thing. However, "Turing-recognizable" and "co-Turing-recognizable" are not the same, and it's this that I've decided to cover in my answer. The reason for this is that if a language is decidable, then its complement must be decidable as well. The same is not true of recognizable languages.)
Intuitively, a language is Turing-recognizable if there is some computer program that, given a string in the language, can confirm that the string is indeed within the language. This program might loop infinitely if the string isn't in the language, but it's guaranteed to always eventually accept if you give it a string in the language.
While it's true that a language is co-Turing-recognizable if it's the complement of a language that's Turing-recognizable, this definition doesn't shed much light on what's going on. Intuitively, if a language is co-Turing-recognizable, it means that there is a computer program that, given a string not in the language, will eventually confirm that the string is not in the language. It might loop infinitely if the string is indeed within the language, though. The reason for this is simple - if some string w isn't contained within a co-Turing-recognizable language, then that string w must be contained within the complement of that co-Turing-recognizable language, which (by definition) has to be Turing-recognizable. Since w is in the Turing-recognizable complement, there must be some program that can confirm that w is indeed in the complement. This program therefore can confirm that w is not in the original co-Turing-recognizable language.
In short, Turing-recognizability means that there is a program that can confirm that a string w is in a language, and co-Turing-recognizability means that there is a program that can confirm that a string w is not in the language.
Hope this helps!
Let me tell why decidable and co-decidable meant the same with some different usage words. Experienced here, please let me know if I have gone wrong way:
If we have set of strings S which forms L. Then S’ will form L’. Now, L being decidable means we have algorithm / TM which can confirm any string s∈S belongs to L and s'∈S' does not belong to L. Same algorithm will tell us s∈S does not belong to L’ and s'∈S' belongs to L’. So, in other words, we have exact same definition for L’. So, there is no such different meaning to the complement of the concept of decidable language. Hence, both decidable and co-decidable languages are said to be the same.
A language is Recognizable iff there is a Turing Machine which will halt and accept only the strings in that language and for strings not in the language, the TM either rejects, or does not halt at all. Note: there is no requirement that the Turing Machine should halt for strings not in the language.
A language is Decidable iff there is a Turing Machine which will accept strings in the language and reject strings not in the language.

Resources