Several questions about permutations - permutation

How many unique permutations are there for all 255 ASCII characters? Ranging from 1 character in length to 255.
How did you calculate that?
About how much space (in GB, TB, PB, etc.) would that file take up?
Roughly how long would it take for a single computer to generate that file?
Does a project like this exist that possibly uses a bunch of computers on a network or the internet to generate all these permutations?
Would that project be feasible and could a GPU be used to generate them faster than a CPU?

There are 255 + 255^2 + 255^3 + .... +255^255 permutations of 255 characters in strings ranging from one to 255 chars.
PS: There are only 127 ASCII characters.

Related

How many bytes will be required to store number in binary and text files respectively

If I want to store a number, let's say 56789 in a file, how many bytes will be required to store it in binary and text files respectively? I want to know how bytes are allocated to data in binary and text files.
It depends on:
text encoding and number system (decimal, hexadecimal, many more...)
signed/not signed
single integer or multiple (require separators)
data type
target architecture
use of compressed encodings
In ASCII a character takes 1 byte. In UTF-8 a character takes 1 to 4 bytes, but digits always take 1 byte. In UTF-16 or Unicode it takes 2 or more bytes per character.
Non-ASCII formats may require additional 2 bytes (initial BOM) for the file, this depends on the editor and/or settings used when the file was created.
But let's assume you store the data in a simple ASCII file, or the discussion becomes needlessly complex.
Let's also assume you use the decimal number system.
In hexadecimal you use digits 0-9 and letters a-f to represent numbers. A decimal (base-10) like 34234324423 would be 7F88655C7 in hexadecimal (base-16). In the first system we have 11 digits, in the second just 9 digits. The minimum base is 2 (digits 0 and 1) and the common maximum base is 64 (base-64). Technically, with ASCII you could go as high as base-96 maybe base-100, but that's very uncommon.
Each digit (0-9) will take one byte. If you have signed integers, an additional minus sign will lead the digits (so negative numbers charge 1 additional byte).
In some circumstances you may want to store several numerals. You will need a separator to tell the numerals apart. A comma (,), colon (:), semicolon (;), pipe (|) or newline (LF, CR or on Windows CRLF, which takes 2 bytes) have all been observed in the djungle as legit separators of numerals.
What is a numeral? The concept or idea of the quantity 8 that is IN YOUR HEAD is the number. Any representation of that concept on stone, paper, magnetic tape, or pixels on a screen are just that: REPRESENTATIONS. They are symbols which stand for what you understand in your brain. Those are numerals. Please don't ever confuse numbers with numerals, this distinction is the foundation of mathematics and computer science.
In these cases you want to count an additional character for the separator per numeral. Or maybe per numeral minus one. It depends on if you want to terminate each numeral with a marker or separate the numerals from each other:
Example (three digits and three newlines): 6 bytes
1<LF>
2<LF>
3<LF>
Example (three digits and two commas): 5 bytes
1,2,3
Example (four digits and one comma): 5 bytes
2134,
Example (sign and one digit): 2 bytes
-3
If you store the data in a binary format (not to be confused with the binary number system, which would still be a text format) the occupied memory depends on the integer type (or, better, bit length of the integer).
An octet (0..255) will occupy 1 byte. No separators or leading signs required.
A 16-bit float will occupy 2 bytes. For C and C++ the underlying architecture must be taken into account. A common integer on a 32-bit architecture will take 4 bytes. The very same code, compiled against a 64-bit architecture, will take 8 bytes.
There are exceptions to those flat rules. As an example, Google's protobuf uses a zig-zag VarInt implementation that leverages variable length encoding.
Here is a VarInt implementation in C/C++.
EDIT: added Thomas Weller's suggestion
Beyond the actual file CONTENT you will have to store metadata about the file (for bookkeeping such as the first sector, the filename, access permissions and more). This metadata is not shown for the file occupying space on disk, but actually is there.
If you store each numeral in a separate file such as the numeral 10 in the file result-10, these metadata entries will occupy more space than the numerals themselves.
If you store ten, hundred, thousands or millions/billions of numerals in one file, that overhead becomes increasingly irrelevant.
More about metadata here.
EDIT: to be clearer about file overhead
The overhead is under circumstances relevant, as discussed above.
But it is not a differentiator between textual and binary formats. As doug65536 says, however you store the data, if the filesystem structure is the same, it does not matter.
A file is a file, independently if it contains binary data or ASCII text.
Still, the above reasoning applies independently from the format you choose.
The number of digits needed to store a number in a given number base is ceil(log(n)/log(base)).
Storing as decimal would be base 10, storing as hexadecimal text would be base 16. Storing as binary would be base 2.
You would usually need to round up to a multiple of eight or power of two when storing as binary, but it is possible to store a value with an unusual number of bits in a packed format.
Given your example number (ignoring negative numbers for a moment):
56789 in base 2 needs 15.793323887 bits (16)
56789 in base 10 needs 4.754264221 decimal digits (5)
56789 in base 16 needs 3.948330972 hex digits (4)
56789 in base 64 needs 2.632220648 characters (3)
Representing sign needs an additional character or bit.
To look at how binary compares to text, assume a byte is 8 bits, each ASCII character would be a byte in text encoding (8 bits). A byte has a range of 0 to 255, a decimal digit has a range from 0 to 9. Each character (8 bits) can encode about 3.32 bits of a number per byte (log(10)/log(2)). A binary encoding can store 8 bits of a number per byte. Encoding numbers as text takes about 2.4x more space. If you pad out your numbers so they line up in fields, then numbers are very poor storage encoding, with a typical width being 10 digits you'll be storing 80 bits, which would be only 33 bits of binary encoded data.
I am not too developed in this subject; however, I believe it would not just be a case of the content, but also the META-DATA attached. But if you were just talking about the number, you could store it in ASCII or in a binary form.
In binary, 56789 could be converted to 1101110111010101; there is a 'simple' way to work this out on paper. But, http://www.binaryhexconverter.com/decimal-to-binary-converter is a website you can use to convert it.
1101110111010101 has 16 characters, therefore 16 bits which is two bytes.
Each integer is usually around 4 bytes of storage. So if you are storing the number in binary in the text file, and the binary equivalent is 1101110111010101, there are 16 integers in that binary number. 16 * 4 = 64. So your number will take up about 64 bytes of storage. If your integers were stored in 64bit rather than 32bit, each integer would instead take up 8 bytes of storage, so your total would equal 128 bytes.
Before you post any question, you should do your research.
Size of the file depends on many factors but for the sake of simplicity, in text format numbers will occupy 1 byte for each character if you are using UTF-8 encoding. On the other hand a binary value for long data type will take 4 bytes.

logically Understanding a compression algorithm

this idea had been flowing in my head for 3 years and i am having problems to apply it
i wanted to create a compression algorithm that cuts the file size in half
e.g. 8 mb to 4 mb
and with some searching and experience in programming i understood the following.
let's take a .txt file with letters (a,b,c,d)
using the IO.File.ReadAllBytes function , it gives the following array of bytes : ( 97 | 98 | 99 | 100 ) , which according to this : https://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart is the decimal value of the letter.
what i thought about was : how to mathematically cut this 4-membered-array to only 2-membered-array by combining each 2 members into a single member but you can't simply mathematically combine two numbers and simply reverse them back as you have many possibilities,e.g.
80 | 90 : 90+80=170 but there is no way to know that 170 was the result of 80+90 not like 100+70 or 110+60.
and even if you could overcome that , you would be limited by the maximum value of bytes (255 bytes) in a single member of the array.
i understand that most of the compression algorithms use the binary compression and they were successful,but imagine cutting a file size in half , i would like to hear your ideas on this.
Best Regards.
It's impossible to make a compression algorithm that makes every file shorter. The proof is called the "counting argument", and it's easy:
There are 256^L possible files of length L.
Lets say there are N(L) possible files with length < L.
If you do the math, you find that 256^L = 255*N(L)+1
So. You obviously cannot compress every file of length L, because there just aren't enough shorter files to hold them uniquely. If you made a compressor that always shortened a file of length L, then MANY files would have to compress to the same shorter file, and of course you could only get one of them back on decompression.
In fact, there are more than 255 times as many files of length L as there are shorter files, so you can't even compress most files of length L. Only a small proportion can actually get shorter.
This is explained pretty well (again) in the comp.compression FAQ:
http://www.faqs.org/faqs/compression-faq/part1/section-8.html
EDIT: So maybe you're now wondering what this compression stuff is all about...
Well, the vast majority of those "all possible files of length L" are random garbage. Lossless data compression works by assigning shorter representations (the output files) to the files we actually use.
For example, Huffman encoding works character by character and uses fewer bits to write the most common characters. "e" occurs in text more often than "q", for example, so it might spend only 3 bits to write "e"s, but 7 bits to write "q"s. bytes that hardly ever occur, like character 131 may be written with 9 or 10 bits -- longer than the 8-bit bytes they came from. On average you can compress simple English text by almost half this way.
LZ and similar compressors (like PKZIP, etc) remember all the strings that occur in the file, and assign shorter encodings to strings that have already occurred, and longer encodings to strings that have not yet been seen. This works even better since it takes into account more information about the context of every character encoded. On average, it will take fewer bits to write "boy" than "boe", because "boy" occurs more often, even though "e" is more common than "y".
Since it's all about predicting the characteristics of the files you actually use, it's a bit of a black art, and different kinds of compressors work better or worse on different kinds of data -- that's why there are so many different algorithms.

Finding the number of occurrences of each character in a String or character array

I am going over some interview preparation material and I was wondering what the best way to solve this problem would be if the characters in the String or array can be unicode characters. If it they were strictly ascii, you could make an int array of size 256 and map each ascii character to an index and that position in the array would represent the number of occurrences. If the string has unicode characters is it still possible to do so, i.e. does the unicode character a reasonable size that you could represent it using the indexes of a integer array? Since unicode characters can be more than 1 byte in size, what data type would you use to represent them? What would be the most optimal solution for this case?
Since Unicode only defines code points in the range [0, 221), you only need an array of 221 (i.e. 2 million) elements, which should fit comfortably into memory.
An array wouldn't be practical when using Unicode. This is because Unicode defines (less than) 221 characters.
Instead, consider using two parallel vectors, one for the character and one for the count. The setup would look something like this:
<'c', '$', 'F', '¿', '¤'> //unicode characters
< 1 , 3 , 1 , 9 , 4 > //number of times each character has appeared.
EDIT
After seeing Kerrek's answer, I must admit, an array of size 2 million would be reasonable. The amount of memory it would take up would be in the Megabyte range.
But as it's for an interview, I wouldn't recommend having an array 2 million elements long, especially if many of those slots will be unused (not all Unicode characters will appear, most likely). They're probably looking for something a little more elegant.
SECOND EDIT
As per the comments here, Kerrek's answer does indeed seem to be more efficient as well as easier to code.
While others here are focusing on data structures, you should also know that the notion of "Unicode character" is somewhat ill-defined. That's a potential interview trap. Consider: are å and å the same character? The first one is a "latin small letter a with ring above" (codepoint U+00E5). The second one is a "latin small letter a" (codepoint U+0061) followed by a "combining ring above" (U+030A). Depending on the purpose of the count, you might need to consider these as the same character.
You might want to look into Unicode normalization forms. It's great fun.
Convert string to UTF-32.
Sort the 32-bit characters.
Getting character counts is now trivial.

LZW Compression with Entire unicode library

I am trying to do this problem:
Assume we have an initial alphabet of the entire Unicode character set,
instead of just all the possible byte values. Recall that unicode
characters are unsigned 2-byte values, so this means that each
2 bytes of uncompressed data will be treated as one symbol, and
we'll have an alphabet with over 60,000 symbols. (Treating symbols as
2-byte Unicodes, rather than a byte at a time, makes for better
compression in the case of internationalized text.) And, note, there's
nothing that limits the number of bits per code to at most 16. As you
generalize the LZW algorithm for this very large alphabet, don't worry
if you have some pretty long codes.
With this, give the compressed version of this four-symbol sequence,
using our project assumptions, including an EOD code, and grouping
into 4-byte ints. (These three symbols are Unicode values,
represented numerically.) Write your answer as 3 8-digit hex values,
space separated, using capital hex digits, not lowercase.
32767 32768 32767 32768
The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.
Also, it seems to me that it will only take two integers the compressed data.
The problem statement is ill-formed.
In Unicode, as we know it today, code points (those numbers that represent characters, composable parts of characters and other useful but more sneaky things) cannot be all numbered from 0 to 65535 to fit into 16 bits. There are more than 100 thousand of Chinese, Japanese and Korean characters in Unicode. Clearly, you'd need 17+ bits just for those. So, Unicode clearly cannot be the correct option here.
OTOH, there exist a sort of "abridged" version of Unicode, Universal Character Set, whose UCS-2 encoding uses 16-bit code points and can technically be used for at most 65536 characters and the like. Those characters with codes greater than 65535 are, well, unlucky, you can't have them with UCS-2.
So, if it's really UCS-2, you can download its specification (ISO/IEC 10646, I believe) and figure out exactly which codes out of those 64K are used and thus should form your initial LZW alphabet.

How to simply generate a random base64 string compatible with all base64 encodings

In C, I was asked to write a function to generate a random Base64 string of length 40 characters (30 bytes ?).
But I don't know the Base64 flavor, so it needs to be compatible with many version of Base64.
What can I do ? What is the best option ?
All the Base64 encodings agree on some things, such as the use of [0-9A-Za-z], which are 62 characters. So you won't get a full 64^40 possible combinations, but you can get 62^40 which is still quite a lot! You could just generate a random number for each digit, mod 62. Or slice it up more carefully to reduce the amount of entropy needed from the system. For example, given a 32-bit random number, take 6 bits at a time (0..63), and if those bits are 62 or 63, discard them, otherwise map them to one Base64 digit. This way you only need about 8, 32-bit integers to make a 40-character string.
If this system has security considerations, you need to consider the consequences of generating "unusual" Base64 numbers (e.g. an attacker could detect that your Base64 numbers are special in having only 62 symbols with just a small corpus--does that matter?).

Resources