What is the longest (printed length) version 4 GUID/UUID when displayed in a non-fixed-width font? (obviously in a monospace font all guids look the same length)
For instance, I want to size a table cell that will hold a GUID, and I want it to be slightly larger than the largest possible value.
After a lot of testing of which character shows up the longest, I discovered that:
dddddddd-dddd-4ddd-bddd-dddddddddddd
Shows up the longest in both lowercase and uppercase.
The long answer is that it depends on the font, but in most "normal" fonts (such as Arial, Helvetica, Calibri, etc.), D tends to be the longest character (or a very close second) in both upper and lower cases.
Related
I'm building a reading app and need to create a model for books. I'm considering making the entire book's contents an array of strings, with each page being represented by a single string. For a book with hundreds of pages though, this seems like not the best way to structure the data. Is this an appropriate way to do this or is there a better way?
An option could be to map every word appearing in the book to a bitset. To do this efficiently, you could define or count the frequency of each word, and encode it using Huffman codes. This way, common words will have lower sizes.
For instance, the word "the" could be encoded to 001 while the word "magnificent" could be encoded to 011001101110.
Hence, each page would now be a sequence of bitsets.
Improvement study
For simplicity, let's assume we are using a binary tree to encode our words, which is an upper bound to the Huffman codes in terms of average length. According to Wikipedia, the number of words in the English dictionary is 350K. Let us consider the worst case for this new approach in which our text has all 350K words. Then, each word would be encoded to log2(350K) = 19 bits.
Now imagine you have a book of 100 pages with 1500 characters per page. Each char is 1 byte, so you would end up with 100*1500*1 = 150 KB to encode your book.
However, using our binary tree compression, first, we need to divide our text into words: according to this post, the average word length in English is 4.7. Then, if we divide the total number of characters in our book (150K) by 4.7 we get 32K words approximately. Finally, the encoded size of the book would be 19 bits * 32K = 608 Kbits = 76 KB, which reduces by half the original size. This is using a binary tree which is a very rough upper bound of Huffman codes, but it is useful to show that even worst-case-scenarios with upper bounds imply an improvement. So for the actual Huffman codes improvement would need a more complex study, but it would be much larger (I would say a size reduction factor of 10 or 20).
Of course, you would need to consider the map memory size, but for long books, it won't have that much impact I guess.
My SQL Server database was created & designed by a freelance developer.
I see the database getting quite big and I want to ensure that the column datatypes are the most efficient in preserving the size as small as possible.
Most columns were created as
VARCHAR (255), NULL
This covers those where they are
Numerics with a length of 2 numbers maximum
Numerics where a length will never be more than 3 numbers or blank
Alpha which will contain just 1 letter or are blank
Then there are a number of columns which are alphanumeric with a maximum of 10
alphanumeric characters with a maximum of 25.
There is one big alphanumeric column which can be up to 300 characters.
There has been an amendment for a column which show time taken in seconds to race an event. Under 1000 seconds and up to 2 decimal places
This is set as DECIMAL (18,2) NULL
The question is can I reduce the size of the database by changing the column data types, or was the original design, optimum for purpose?
You should definitely strive to use the most appropriate data types for all columns - and in this regard, that freelance developer did a very poor job - both from a point of consistency and usability (just try to sum up the numbers in a VARCHAR(255) column, or sort by their numeric value - horribly bad design...), but also from a performance point of view.
Numerics with a length of 2 numbers maximum
Numerics where a length will never be more than 3 numbers or blank
-> if you don't need any fractional decimal points (only whole numbers) - use INT
Alpha which will contain just 1 letter or are blank
-> in this case, I'd use a CHAR(1) (or NCHAR(1) if you need to be able to handle Unicode characters, like Hebrew, Arabic, Cyrillic or east asian languages). Since it's really only ever 1 character (or nothing), there's no need or point in using a variable-length string datatype, since that only adds at least 2 bytes of overhead per string stored
There is one big alphanumeric column which can be up to 300 characters.
-> That's a great candidate for a VARCHAR(300) column (or again: NVARCHAR(300) if you need to support Unicode). Here I'd definitely use a variable-length string type to avoid padding the column with spaces up to the defined length if you really want to store fewer characters.
Forgive me for the lack of official phrasing; this is a problem given orally in class, as opposed to being written in a problem set. Using the English alphabet with no spaces, commas, periods, etc (and thus only working with twenty-six letters possible), how many possible orderings are there of a string of fifty characters that contain the combination "Johndoe" at some location in the set?
Edit: was a little quick to answer, and overlooked something very obvious. Please see the new answer below
This is more suited for something like math or stats stackexchange. Having said, that, there are 26^(50-7)*(50-7) combinations. To see why, ask yourself: how many 50 letter permutations of the 26 letters exist? Now, we will reduce this set by adding the restriction that a 7-letter contiguous word must exist within any candidate permutation. This has the effect of "fixing" 7 letters and making them unable to vary. However, we can place this 7 letter string anywhere, and there are 43 positions to place it ("johndoe" at position 0, "johndoe" at position 1, all the way to position 43, since "johndoe" will not fit at position 44).
I want a quick and dirty way of determining what language the user is writing in. I know that there is a Google API which will detect the difference between French and Spanish (even though they both use mostly the same alphabet), but I don't want the latency. Essentially, I know that the Latin alphabet has a lot of confusion as to what language it is using. Other alphabets, however, don't. For example, if there is a character using hiragana (part of the Japanese writing system) there is no confusion as to the language. Therefore, I don't need to ask Google.
Therefore, I would like to be able to do something simple like say that שלום uses the Hebrew alphabet and こんにちは uses Japanese characters. How do I get that alphabet string?
"Bonjour", "Hello", etc. should return "Latin" or "English" (Then I'll ask Google for the real language). "こんにちは" should return "Hiragana" or "Japanese". "שלום" should return "Hebrew".
I'd suggest looking at the Unicode "Script" property. The latest database can be found here.
For a quick and dirty implementation, I'd try scanning all of the characters in the target text and looking up the script name for each one. Pick whichever script has the most characters.
Use an N-gram model and then give a sufficiently large set of training data. A full example describing this technique is to be found at this page, among others:
http://phpir.com/language-detection-with-n-grams/
Although the article assumes you are implementing in PHP and by "language" you mean something like English, Italian, etc... the description may be implemented in C if you require this, and instead of using "language" as in English, etc. for the training, just use your notion of "alphabet" for the training. For example, look at all of your "Latin alphabet" strings together and consider their n-grams for n=2:
Bonjour: "Bo", "on", "nj", "jo", "ou", "ur"
Hello: "He", "el", "ll", "lo"
With enough training data, you will discover dominant combinations that are likely for all Latin text, for example, perhaps "Bo" and "el" are quite probable for text written in the "Latin alphabet". Likewise, these combinations are probably quite rare in text that is written in the "Hiragana alphabet". Similar discoveries will be made with any other alphabet classification for which you can provide sufficient training data.
This technique is also known as a Hidden Markov model or a Markov chain; searching for these keywords will give more ideas for implementation. For "quick and dirty" I would use n=2 and gather just enough training data such that the least common letter from each alphabet is encountered at least once... e.g. at least one 'z' and at least one 'ぅ' *little hiragana u.
EDIT:
For a simpler solution than N-Grams, use only basic statistical tests -- min, max and average -- to compare your Input (a string given by the user) with an Alphabet (a string of all characters in one of the alphabets you are interested).
Step 1. Place all the numerical values of the Alphabet (e.g. utf8 codes) in an array. For example, if the Alphabet to be tested against is "Basic Latin", make an array DEF := {32, 33, 34, ..., 122}.
Step 2. Place all the numerical values of the Input into an array, for example, make an array INP := {73, 102, 32, ...}.
Step 3. Calculate a score for the input based on INP and DEF. If INP really comes from the same alphabet as DEF, then I would expect the following statements to be true:
min(INP) >= min(DEF)
max(INP) <= max(DEF)
avg(INP) - avg(DEF) < EPS, where EPS is a suitable constant
If all statements are true, the score should be close to 1.0. If all are false, the score should close to 0.0. After this "Score" routine is defined, all that's left is to repeat it on each alphabet you are interested in and choose the one whiich gives the highest score for a given Input.
I have a requirement to adapt an existing, non-unicode, c project to display Chinese characters. As there is a short deadline, and I'm new(ish) to C and encoding I've gone down the route of changing the sytem locale to Simplified Chinese PRC in order to support display of Chinese text in the gui application. This has in turn changed the encoding (in visual studio 2010) in the project to Chinese Simplified (GB2312).
Everything works except that special characters, eg: degree sign, superscript 2, etc are displayed as question marks. I believe this is because we used to pass \260 i.e. the octal value of the degree symbol in the ascii table, and this no longer equates to anything in the gb2312 table.
The workflow for displaying a degree symbol in a keypad was as follows:
display_function( data, '\260' ); //Pass the octal value of the degree symbol to the keypad
This display_function is used to translate the integer inputs into strings for display on the keypad:
data->[ pos ] = (char) ch;
Essentially I need to get this (and other special chars) to display correctly. Is there a way to pass this symbol using the current setup?
According to the char list for gb23212 the symbol is supported so my current thinking is to work out the octal value of the symbol and keep the existing functions as they are. These currently pass the values around as chars. Using the table below:
http://ash.jp/code/cn/gb2312tbl.htm.
and the following formula to obtain the octal value:
octal number associated with the row, multiplied by 10 and added to the octal number associated with the column.
I believe this would be A1E0 x 10 + 3 = 414403.
However, when I try and pass this to display_function I get "error C2022: '268' : too big for character".
Am I going about this wrong? I'd prefer not to change the existing functions as they are in widespread use, but do I need to change the function to use a wide char?
Apologies if the above is convoluted and filled with incorrect assumptions! I've been trying to get my head round this for a week or two and encodings, char sets and locales just seem to get more and more confusing!
thanks in advance
If current functions support only 8-bits characters, and you need to use them to display 16-bits characters, then probably your guess is correct - you may have to change functions to use something like "wchar" instead of "char".
Maybe also duplicate them with other name to provide compatibility for other users in case these functions are used in other projects.
But if it's only one project, then maybe you will want to check possibility to replace "char" by "wchar" in almost all places of the project.