We are building a german alexa skill where one of our intents uses a DateSlot. If we ask with the name of a Day (e.g. Dienstag, Mittwoch), Alexa understands our input and correctly call our api with this day. If we ask for "morgen" (stands for tommorrow), Alexa understands "morgan" and can not map the input to a date and so Alexa asks for a different input.
How can we ensure, that Alexa interprets our voice input as a german word and not as an english word?
Not sure about German vs. English, but we ran into a similar issue with slot types. We originally did not understand that slot types are not a finite list of inputs, but training values. We ended up using the JaroWinklerDistance method in the natural package. We made a list of out expected slot types to attempting matching against and went with the best match. Our slot types were for car makes, types, and colors. So you can imagine the invalid values we had coming in.
Possibly you can do something similar by making a new intent to capture your known invalid values and do the natural.JaroWinklerDistance matching. You would need to add more utterances to match your DateSlot utterances which expect another slot with 'morgen' and other values you incorrectly receive. Then, with you intent handing, detect the string value and match through your known list of values. You can always fail out if something entirely unexpected happens, as it does now when you receive 'morgan'. We had to play with the threshold for the best match to determine if we used it or not.
Or, you can build up a list of items you constantly get and map them to valid values. We also had a slot that expected 1, 2, 3, 4, or 5. During development we were getting too, four, two, to, etc... We ended up mapping these known values in a list and translating them to what we expected. For this situation we had a fairly limited list of values we had to translate for, but it worked well.
We are storing wine data in our database. The vintage of the wine might be a number like 2010 or a string such as Non-Vintage.
2010 means the grapes were harvested in 2010.
Non-Vintage means the grapes were harvested across an unknown time-period.
At first we decided to store the field as a string, since 2010 and Non-Vintage are both potentially strings. However, we need to be able to sort the years or perform some arithmetic (i.e. year > 2010).
We are considering either:
Store the data as a number and "Non-Vintage" would be assigned 0. However, we'd have to provide weird validations everywhere in the app for handling the 0 value.
Store the year as a number and provide a boolean "non_vintage" field for non-vintage wines.
The data pulled from the database will be delivered to an AngularJS front-end via an API. The Javascript code will have to parse through and use the year at various points... i.e. "show me all wines where year > 2010".
Anyone have any thoughts on which is better and why?
are you using mysql? you can cast strings as ints in your mysql statements.
select * from table order by cast(string AS signed) asc;
That would mean you can store as a string.
Option 1
Since you're not really treating the year as a number, nor do you need to .. the simplest option you have is to just create a string and have at it.
This has some advantages and disads right off the bat:
Advantage:
simplest, nothing complex. any database can handle this.
logic is fairly simple, although a side case to handle your "Non-Vintage" is needed.
Disadvantage:
potentially allows values you don't want/expect/etc. :
ie: "NonVintage", "Non-vantage", "Unknown", "Spider-man!" ... O.o
This might be mitigated by whatever your process for putting values in might be (ie if it's mostly automated, this might be a smaller issue) :)
========================================
Option 2
A more stricter way would be to use 2 columns.
A number and a string.
Store the vintage year in a 4-digit number field, and you know it'll always be a "proper" year. (you could add a check constraint to prevent years < 1000 if you want ;) )
Store the code "NV" in a 2 (or 3?) digit string "code" column. This gives you good flexibility going forward in case other requirements in future start asking for additional types or such.
You could just do a boolean - it would work, however, if things changed in future (they always do), you'd have to redesign .. the string code column has no real hard disad on the boolean and gives you simple flexibility going forward ;)
========================
It would depend on your system and what you know of it, and how the data's coming in ... but I'd probably lean towards option 2 (number + string) myself.
In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.
What is a good data structure for storing phone numbers in database fields? I'm looking for something that is flexible enough to handle international numbers, and also something that allows the various parts of the number to be queried efficiently.
Edit: Just to clarify the use case here: I currently store numbers in a single varchar field, and I leave them just as the customer entered them. Then, when the number is needed by code, I normalize it. The problem is that if I want to query a few million rows to find matching phone numbers, it involves a function, like
where dbo.f_normalizenum(num1) = dbo.f_normalizenum(num2)
which is terribly inefficient. Also queries that are looking for things like the area code become extremely tricky when it's just a single varchar field.
[Edit]
People have made lots of good suggestions here, thanks! As an update, here is what I'm doing now: I still store numbers exactly as they were entered, in a varchar field, but instead of normalizing things at query time, I have a trigger that does all that work as records are inserted or updated. So I have ints or bigints for any parts that I need to query, and those fields are indexed to make queries run faster.
First, beyond the country code, there is no real standard. About the best you can do is recognize, by the country code, which nation a particular phone number belongs to and deal with the rest of the number according to that nation's format.
Generally, however, phone equipment and such is standardized so you can almost always break a given phone number into the following components
C Country code 1-10 digits (right now 4 or less, but that may change)
A Area code (Province/state/region) code 0-10 digits (may actually want a region field and an area field separately, rather than one area code)
E Exchange (prefix, or switch) code 0-10 digits
L Line number 1-10 digits
With this method you can potentially separate numbers such that you can find, for instance, people that might be close to each other because they have the same country, area, and exchange codes. With cell phones that is no longer something you can count on though.
Further, inside each country there are differing standards. You can always depend on a (AAA) EEE-LLLL in the US, but in another country you may have exchanges in the cities (AAA) EE-LLL, and simply line numbers in the rural areas (AAA) LLLL. You will have to start at the top in a tree of some form, and format them as you have information. For example, country code 0 has a known format for the rest of the number, but for country code 5432 you might need to examine the area code before you understand the rest of the number.
You may also want to handle vanity numbers such as (800) Lucky-Guy, which requires recognizing that, if it's a US number, there's one too many digits (and you may need to full representation for advertising or other purposes) and that in the US the letters map to the numbers differently than in Germany.
You may also want to store the entire number separately as a text field (with internationalization) so you can go back later and re-parse numbers as things change, or as a backup in case someone submits a bad method to parse a particular country's format and loses information.
KISS - I'm getting tired of many of the US web sites. They have some cleverly written code to validate postal codes and phone numbers. When I type my perfectly valid Norwegian contact info I find that quite often it gets rejected.
Leave it a string, unless you have some specific need for something more advanced.
The Wikipedia page on E.164 should tell you everything you need to know.
Here's my proposed structure, I'd appreciate feedback:
The phone database field should be a varchar(42) with the following format:
CountryCode - Number x Extension
So, for example, in the US, we could have:
1-2125551234x1234
This would represent a US number (country code 1) with area-code/number (212) 555 1234 and extension 1234.
Separating out the country code with a dash makes the country code clear to someone who is perusing the data. This is not strictly necessary because country codes are "prefix codes" (you can read them left to right and you will always be able to unambiguously determine the country). But, since country codes have varying lengths (between 1 and 4 characters at the moment) you can't easily tell at a glance the country code unless you use some sort of separator.
I use an "x" to separate the extension because otherwise it really wouldn't be possible (in many cases) to figure out which was the number and which was the extension.
In this way you can store the entire number, including country code and extension, in a single database field, that you can then use to speed up your queries, instead of joining on a user-defined function as you have been painfully doing so far.
Why did I pick a varchar(42)? Well, first off, international phone numbers will be of varied lengths, hence the "var". I am storing a dash and an "x", so that explains the "char", and anyway, you won't be doing integer arithmetic on the phone numbers (I guess) so it makes little sense to try to use a numeric type. As for the length of 42, I used the maximum possible length of all the fields added up, based on Adam Davis' answer, and added 2 for the dash and the 'x".
Look up E.164. Basically, you store the phone number as a code starting with the country prefix and an optional pbx suffix. Display is then a localization issue. Validation can also be done, but it's also a localization issue (based on the country prefix).
For example, +12125551212+202 would be formatted in the en_US locale as (212) 555-1212 x202. It would have a different format in en_GB or de_DE.
There is quite a bit of info out there about ITU-T E.164, but it's pretty cryptic.
Storage
Store phones in RFC 3966 (like +1-202-555-0252, +1-202-555-7166;ext=22). The main differences from E.164 are
No limit on the length
Support of extensions
To optimise speed of fetching the data, also store the phone number in the National/International format, in addition to the RFC 3966 field.
Don't store the country code in a separate field unless you have a serious reason for that. Why? Because you shouldn't ask for the country code on the UI.
Mostly, people enter the phones as they hear them. E.g. if the local format starts with 0 or 8, it'd be annoying for the user to do a transformation on the fly (like, "OK, don't type '0', choose the country and type the rest of what the person said in this field").
Parsing
Google has your back here. Their libphonenumber library can validate and parse any phone number. There are ports to almost any language.
So let the user just enter "0449053501" or "04 4905 3501" or "(04) 4905 3501". The tool will figure out the rest for you.
See the official demo, to get a feeling of how much does it help.
I personally like the idea of storing a normalized varchar phone number (e.g. 9991234567) then, of course, formatting that phone number inline as you display it.
This way all the data in your database is "clean" and free of formatting
Perhaps storing the phone number sections in different columns, allowing for blank or null entries?
Ok, so based on the info on this page, here is a start on an international phone number validator:
function validatePhone(phoneNumber) {
var valid = true;
var stripped = phoneNumber.replace(/[\(\)\.\-\ \+\x]/g, '');
if(phoneNumber == ""){
valid = false;
}else if (isNaN(parseInt(stripped))) {
valid = false;
}else if (stripped.length > 40) {
valid = false;
}
return valid;
}
Loosely based on a script from this page: http://www.webcheatsheet.com/javascript/form_validation.php
The standard for formatting numbers is e.164, You should always store numbers in this format. You should never allow the extension number in the same field with the phone number, those should be stored separately. As for numeric vs alphanumeric, It depends on what you're going to be doing with that data.
I think free text (maybe varchar(25)) is the most widely used standard. This will allow for any format, either domestic or international.
I guess the main driving factor may be how exactly you're querying these numbers and what you're doing with them.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I would have to check, but I think our DB schema is similar. We hold a country code (it might default to the US, not sure), area code, 7 digits, and extension.
What about storing a freetext column that shows a user-friendly version of the telephone number, then a normalised version that removes spaces, brackets and expands '+'. For example:
User friendly: +44 (0)181 4642542
Normalized: 00441814642542
I would go for a freetext field and a field that contains a purely numeric version of the phone number. I would leave the representation of the phone number to the user and use the normalized field specifically for phone number comparisons in TAPI-based applications or when trying to find double entries in a phone directory.
Of course it does not hurt providing the user with an entry scheme that adds intelligence like separate fields for country code (if necessary), area code, base number and extension.
Where are you getting the phone numbers from? If you're getting them from part of the phone network, you'll get a string of digits and a number type and plan, eg
441234567890 type/plan 0x11 (which means international E.164)
In most cases the best thing to do is to store all of these as they are, and normalise for display, though storing normalised numbers can be useful if you want to use them as a unique key or similar.
User friendly: +44 (0)181 464 2542 normalised: 00441814642542
The (0) is not valid in the international format. See the ITU-T E.123 standard.
The "normalised" format would not be useful to US readers as they use 011 for international access.
I've used 3 different ways to store phone numbers depending on the usage requirements.
If the number is being stored just for human retrieval and won't be used for searching its stored in a string type field exactly as the user entered it.
If the field is going to be searched on then any extra characters, such as +, spaces and brackets etc are removed and the remaining number stored in a string type field.
Finally, if the phone number is going to be used by a computer/phone application, then in this case it would need to be entered and stored as a valid phone number usable by the system, this option of course, being the hardest to code for.