compiler design - lexical analysis: how many columns does a \t take? - lexical-analysis

I must keep the information of a token for my lexical analyzer, for example its row and column. If I come across some source code like this: \t \t int myInt; , how can I know the column of the token ? Because I don't know how many columns does a \t take.
Thanks

Experimental evidence suggests that gcc, at least, simply counts the tab character as a single character.
If you wanted to find the column number where the text is conventionally displayed, you would have to choose a tab width (probably 8 or 4) and round up to the first matching tab stop. But the problem is of course that different users choose different tab widths, and having the compiler arbitrarily choosing a tab width of its own would likely just add to the confusion.

Related

I'm trying to understand the K&R's exercise 1-21 [duplicate]

This question already has answers here:
K&R answer book exercise 1.21
(2 answers)
Closed 4 years ago.
This is the question I am trying to understand:
Write a program 'entab' that replaces strings of blanks by the minimum
number of tabs and blanks to achieve the same spacing. When either a
tab or a single blank would suffice to reach a tab stop, which should
be given preference?
Decoding the question:
a. It's a program that injects 'tabs' in the input.
b. If a string consists of continuous blank spaces, then these blank
spaces has to be replaced with minimum number of tabs and spaces.
How should the program behave in the following below inputs:
hey*****
******hi
hey**************hi!
hi####hey!
hi***how****are*you?
hi#**hey!
What should be the criteria to decide on the minimum number of tabs, and combination of min tabs and spaces?
'#' for tab, '*' for blank space, TABSTOP = 8.
What does the statement mean: "when either a tab or a single blank would suffice to reach a tab stop."
Note: I have gone through answers of this duplicate question, but I
don't find them helpful.
The linked possible-duplicate question focuses on doing the arithmetic correctly. I'm guessing your problem is more primitive than that: you don't really know what a tab stop is.
When K&R wrote their exercises, they could expect the reader to have experience with typewriters, which is where tab stops come from. The tab key caused the carriage to slide over to the position of the next stop, which was a real physical thing that literally stopped the carriage from moving further until another key was pressed. (Later typewriters, becoming more computer-like, had programmed stop locations instead of physical stops.)
The purpose of tabs was tabulation (making tables). The tab stops were set at the horizontal locations separating the columns of the table, and then the table was entered by pressing the tab key after each value. For example, if I want to type this table:
Character ASCII
Tab 9
Linefeed 10
Carriage Return 13
Space 32
Without using tabs, I have to type space many times after the word "Tab", not as many times after "Linefeed", and only a few times after "Carriage Return". But if I set a tab stop at the position where the second column starts, it becomes easier. I press the tab key once after the word "Tab", and the carriage advances to the correct place for the 9, press the tab key once after the word "Linefeed", and it advances to the correct place for the 10, etc. If I needed a third column, I would set another tab stop a little bit to the right of the second column.
You can experience this in a text editor - not a fancy IDE that assigns all kinds of unrelated functions to the tab key, but a plain one like vi. Or even a terminal emulator, running a program that does nothing (cat > /dev/null). Type several lines of words of various lengths, with a tab after each one, and observe how they line up, and what happens when one of your words is long enough to occupy 2 table columns. Keep playing with it until you have an intuitive understanding of what the tab character does.
Modern text editors and terminal emulators usually have tab stops set every 8 characters. That's what "tabstop = 8" means. The stops are at columns 8,16,24,32,..., or if you believe the leftmost column is 1 instead of 0, columns 9,17,25,33,... Tab stops are actually programmable on vt100-ish terminals, but that's a rarely used feature.
Back in your text editor, a the start of a new line, type 1 2 Tab 3 4. You get 12 and 34 separated by some whitespace made of a tab character. Then start another new line and type the same thing with a space before the tab: 1 2 space Tab 3 4. The second line looks exactly like the first, but this time the whitespace between 12 and 34 is made of a space and a tab. Make a third line that looks the same without using any tabs, by typing space until it lines up.
Those 3 lines are examples of possible input to the exercise 1-21 program. The first one, with a single tab and no spaces, uses the minimum number of characters so that's what you want to output for all 3 inputs.
If you need help figuring out a general formula for how many tabs and spaces to output, see the other question. Here's a rough description that leaves some details for you to think about: As you read the input, keep track of what column you're in. When you get to a space or tab, read to the end of the sequence of spaces and tabs, remembering what column you were in when the sequence started. At the end of the sequence, you know what column the cursor is in, and what column you want to move it to, and you must calculate the optimal sequence of spaces and tabs to make that movement.

Displaying special characters with Chinese Locale in c

I have a requirement to adapt an existing, non-unicode, c project to display Chinese characters.  As there is a short deadline, and I'm new(ish) to C and encoding I've gone down the route of changing the sytem locale to Simplified Chinese PRC in order to support display of Chinese text in the gui application.  This has in turn changed the encoding (in visual studio 2010) in the project to Chinese Simplified (GB2312).
Everything works except that special characters, eg: degree sign, superscript 2, etc are displayed as question marks.  I believe this is because we used to pass \260 i.e. the octal value of the degree symbol in the ascii table, and this no longer equates to anything in the gb2312 table. 
The workflow for displaying a degree symbol in a keypad was as follows: 
display_function( data, '\260' ); //Pass the octal value of the degree symbol to the keypad 
This display_function is used to translate the integer inputs into strings for display on the keypad: 
data->[ pos ] = (char) ch; 
Essentially I need to get this (and other special chars) to display correctly.  Is there a way to pass this symbol using the current setup? 
According to the char list for gb23212 the symbol is supported so my current thinking is to work out the octal value of the symbol and keep the existing functions as they are.  These currently pass the values around as chars.  Using the table below: 
http://ash.jp/code/cn/gb2312tbl.htm. 
and the following formula to obtain the octal value: 
octal number associated with the row, multiplied by 10 and added to the octal number associated with the column. 
I believe this would be A1E0 x 10 + 3 = 414403. 
However, when I try and pass this to display_function I get "error C2022: '268' : too big for character".
Am I going about this wrong?  I'd prefer not to change the existing functions as they are in widespread use, but do I need to change the function to use a wide char? 
Apologies if the above is convoluted and filled with incorrect assumptions!  I've been trying to get my head round this for a week or two and encodings, char sets and locales just seem to get more and more confusing!
thanks in advance
If current functions support only 8-bits characters, and you need to use them to display 16-bits characters, then probably your guess is correct - you may have to change functions to use something like "wchar" instead of "char".
Maybe also duplicate them with other name to provide compatibility for other users in case these functions are used in other projects.
But if it's only one project, then maybe you will want to check possibility to replace "char" by "wchar" in almost all places of the project.

WPF RichTextBox TextChanged event - how to find deleted or inserted text?

While creating a customized editor with RichTextBox, I've face the problem of finding deleted/inserted text with the provided information with TextChanged event.
The instance of TextChangedEventArgs has some useful data, but I guess it does not cover all the needs. Suppose a scenario which multiple paragraphs are inserted, and at the same time, the selected text (which itself spanned multiple paragraphs) has been deleted.
With the instance of TextChangedEventArgs, you have a collection of text changes, and each change only provides you with the number of removed or added symbols and the position of it.
The only solution I have in mind is, to keep a copy of document, and apply the given list of changes on it. But as the instances of TextChange only give us the number of inserted/removed symbols (and not the symbols), so we need to put some special symbol (for example, '?') to denote unknown symbols while we transform our original copy of document.
After applying all changes to the original copy of document, we can then compare it with the richtextbox's updated document and find the mappings between unknown symbols and the real ones. And finally, get what we want !!!
Anybody has tried this before? I need your suggestions on the whole strategy, and what you think about this approach.
Regards
It primarily depends on your use of the text changes. When the sequence includes both inserts and deletes it is theoretically impossible to know the details of each insert, since some of the symbols inserted may have subsequently been deleted. Therefore you have to choose what results you really want:
For some purposes you must to know the exact sequence of changes even if some of the inserted symbols must be left as "?".
For other purposes you must know exactly how the new text differs from the old but not the exact sequence in which the changes were made.
I will techniques to achieve each of these results. I have used both techniques in the past, so I know they are effective.
To get the exact sequence
This is more appropriate if you are implementing a history or undo log or searching for specific actions.
For these uses, the process you describe is probably best, with one possible change: Instead of "finding the mappings between the unknown symbols and the real ones", simply run the scan forward to find the text of each "Delete" then run it backward to find the text of each "Insert".
In other words:
Start with the initial text and process the changes in order. For each insert, insert '?' symbols. For each delete, remove the specified number of symbols and record them as the text deleted.
Start with the final text and process the changes in reverse order. For each delete, insert '?' symbols. For each insert, remove the specified number of symbols and record them as the text inserted.
When this is complete, all of your "Insert" and "Delete" change entries will have the associated text to the best of our knowledge, and any text that was inserted and immediately deleted will be '?' symbols.
To get the difference
This is more appropriate for revision marking or version comparison.
For these uses, simply use the text change information to compute a set of integer ranges in which changes might be found, then use a standard diff algorithm to find the actual changes. This tends to be very efficient in processing incremental changes but still gives you the best updates.
This is particularly nice when you paste in a replacement paragraph that is almost identical to the original: Using the text change information will indicate the whole paragraph is new, but using diff (ie. this technique) will mark only those symbol runs that are actually different.
The code for computing the change range is simple: Represent the change as four integers (oldstart, oldend, newstart, newend). Run through each change:
If changestart is before newstart, reduce newstart to changestart and reduce oldstart an equal amount
If changeend is after newend, increase newend to changeend and increase oldend an equal amount
Once this is done, extract range [oldstart, oldend] from the old document and the range [newstart, newend] from the new document, then use the standard diff algorithm to compare them.

Phone Number Columns in a Database

In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.

Is there a standard for storing normalized phone numbers in a database?

What is a good data structure for storing phone numbers in database fields? I'm looking for something that is flexible enough to handle international numbers, and also something that allows the various parts of the number to be queried efficiently.
Edit: Just to clarify the use case here: I currently store numbers in a single varchar field, and I leave them just as the customer entered them. Then, when the number is needed by code, I normalize it. The problem is that if I want to query a few million rows to find matching phone numbers, it involves a function, like
where dbo.f_normalizenum(num1) = dbo.f_normalizenum(num2)
which is terribly inefficient. Also queries that are looking for things like the area code become extremely tricky when it's just a single varchar field.
[Edit]
People have made lots of good suggestions here, thanks! As an update, here is what I'm doing now: I still store numbers exactly as they were entered, in a varchar field, but instead of normalizing things at query time, I have a trigger that does all that work as records are inserted or updated. So I have ints or bigints for any parts that I need to query, and those fields are indexed to make queries run faster.
First, beyond the country code, there is no real standard. About the best you can do is recognize, by the country code, which nation a particular phone number belongs to and deal with the rest of the number according to that nation's format.
Generally, however, phone equipment and such is standardized so you can almost always break a given phone number into the following components
C Country code 1-10 digits (right now 4 or less, but that may change)
A Area code (Province/state/region) code 0-10 digits (may actually want a region field and an area field separately, rather than one area code)
E Exchange (prefix, or switch) code 0-10 digits
L Line number 1-10 digits
With this method you can potentially separate numbers such that you can find, for instance, people that might be close to each other because they have the same country, area, and exchange codes. With cell phones that is no longer something you can count on though.
Further, inside each country there are differing standards. You can always depend on a (AAA) EEE-LLLL in the US, but in another country you may have exchanges in the cities (AAA) EE-LLL, and simply line numbers in the rural areas (AAA) LLLL. You will have to start at the top in a tree of some form, and format them as you have information. For example, country code 0 has a known format for the rest of the number, but for country code 5432 you might need to examine the area code before you understand the rest of the number.
You may also want to handle vanity numbers such as (800) Lucky-Guy, which requires recognizing that, if it's a US number, there's one too many digits (and you may need to full representation for advertising or other purposes) and that in the US the letters map to the numbers differently than in Germany.
You may also want to store the entire number separately as a text field (with internationalization) so you can go back later and re-parse numbers as things change, or as a backup in case someone submits a bad method to parse a particular country's format and loses information.
KISS - I'm getting tired of many of the US web sites. They have some cleverly written code to validate postal codes and phone numbers. When I type my perfectly valid Norwegian contact info I find that quite often it gets rejected.
Leave it a string, unless you have some specific need for something more advanced.
The Wikipedia page on E.164 should tell you everything you need to know.
Here's my proposed structure, I'd appreciate feedback:
The phone database field should be a varchar(42) with the following format:
CountryCode - Number x Extension
So, for example, in the US, we could have:
1-2125551234x1234
This would represent a US number (country code 1) with area-code/number (212) 555 1234 and extension 1234.
Separating out the country code with a dash makes the country code clear to someone who is perusing the data. This is not strictly necessary because country codes are "prefix codes" (you can read them left to right and you will always be able to unambiguously determine the country). But, since country codes have varying lengths (between 1 and 4 characters at the moment) you can't easily tell at a glance the country code unless you use some sort of separator.
I use an "x" to separate the extension because otherwise it really wouldn't be possible (in many cases) to figure out which was the number and which was the extension.
In this way you can store the entire number, including country code and extension, in a single database field, that you can then use to speed up your queries, instead of joining on a user-defined function as you have been painfully doing so far.
Why did I pick a varchar(42)? Well, first off, international phone numbers will be of varied lengths, hence the "var". I am storing a dash and an "x", so that explains the "char", and anyway, you won't be doing integer arithmetic on the phone numbers (I guess) so it makes little sense to try to use a numeric type. As for the length of 42, I used the maximum possible length of all the fields added up, based on Adam Davis' answer, and added 2 for the dash and the 'x".
Look up E.164. Basically, you store the phone number as a code starting with the country prefix and an optional pbx suffix. Display is then a localization issue. Validation can also be done, but it's also a localization issue (based on the country prefix).
For example, +12125551212+202 would be formatted in the en_US locale as (212) 555-1212 x202. It would have a different format in en_GB or de_DE.
There is quite a bit of info out there about ITU-T E.164, but it's pretty cryptic.
Storage
Store phones in RFC 3966 (like +1-202-555-0252, +1-202-555-7166;ext=22). The main differences from E.164 are
No limit on the length
Support of extensions
To optimise speed of fetching the data, also store the phone number in the National/International format, in addition to the RFC 3966 field.
Don't store the country code in a separate field unless you have a serious reason for that. Why? Because you shouldn't ask for the country code on the UI.
Mostly, people enter the phones as they hear them. E.g. if the local format starts with 0 or 8, it'd be annoying for the user to do a transformation on the fly (like, "OK, don't type '0', choose the country and type the rest of what the person said in this field").
Parsing
Google has your back here. Their libphonenumber library can validate and parse any phone number. There are ports to almost any language.
So let the user just enter "0449053501" or "04 4905 3501" or "(04) 4905 3501". The tool will figure out the rest for you.
See the official demo, to get a feeling of how much does it help.
I personally like the idea of storing a normalized varchar phone number (e.g. 9991234567) then, of course, formatting that phone number inline as you display it.
This way all the data in your database is "clean" and free of formatting
Perhaps storing the phone number sections in different columns, allowing for blank or null entries?
Ok, so based on the info on this page, here is a start on an international phone number validator:
function validatePhone(phoneNumber) {
var valid = true;
var stripped = phoneNumber.replace(/[\(\)\.\-\ \+\x]/g, '');
if(phoneNumber == ""){
valid = false;
}else if (isNaN(parseInt(stripped))) {
valid = false;
}else if (stripped.length > 40) {
valid = false;
}
return valid;
}
Loosely based on a script from this page: http://www.webcheatsheet.com/javascript/form_validation.php
The standard for formatting numbers is e.164, You should always store numbers in this format. You should never allow the extension number in the same field with the phone number, those should be stored separately. As for numeric vs alphanumeric, It depends on what you're going to be doing with that data.
I think free text (maybe varchar(25)) is the most widely used standard. This will allow for any format, either domestic or international.
I guess the main driving factor may be how exactly you're querying these numbers and what you're doing with them.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I would have to check, but I think our DB schema is similar. We hold a country code (it might default to the US, not sure), area code, 7 digits, and extension.
What about storing a freetext column that shows a user-friendly version of the telephone number, then a normalised version that removes spaces, brackets and expands '+'. For example:
User friendly: +44 (0)181 4642542
Normalized: 00441814642542
I would go for a freetext field and a field that contains a purely numeric version of the phone number. I would leave the representation of the phone number to the user and use the normalized field specifically for phone number comparisons in TAPI-based applications or when trying to find double entries in a phone directory.
Of course it does not hurt providing the user with an entry scheme that adds intelligence like separate fields for country code (if necessary), area code, base number and extension.
Where are you getting the phone numbers from? If you're getting them from part of the phone network, you'll get a string of digits and a number type and plan, eg
441234567890 type/plan 0x11 (which means international E.164)
In most cases the best thing to do is to store all of these as they are, and normalise for display, though storing normalised numbers can be useful if you want to use them as a unique key or similar.
User friendly: +44 (0)181 464 2542 normalised: 00441814642542
The (0) is not valid in the international format. See the ITU-T E.123 standard.
The "normalised" format would not be useful to US readers as they use 011 for international access.
I've used 3 different ways to store phone numbers depending on the usage requirements.
If the number is being stored just for human retrieval and won't be used for searching its stored in a string type field exactly as the user entered it.
If the field is going to be searched on then any extra characters, such as +, spaces and brackets etc are removed and the remaining number stored in a string type field.
Finally, if the phone number is going to be used by a computer/phone application, then in this case it would need to be entered and stored as a valid phone number usable by the system, this option of course, being the hardest to code for.

Resources