I'm designing a database table and asking myself this question: How long should the firstname field be?
Does anyone have a list of reasonable lengths for the most common fields, such as first name, last name, and email address?
I just queried my database with millions of customers in the USA.
The maximum first name length was 46. I go with 50. (Of course, only 500 of those were over 25, and they were all cases where data imports resulted in extra junk winding up in that field.)
Last name was similar to first name.
Email addresses maxed out at 62
characters. Most of the longer ones
were actually lists of email
addresses separated by semicolons.
Street address maxes out at 95
characters. The long ones were all
valid.
Max city length was 35.
This should be a decent statistical spread for people in the US. If you have localization to consider, the numbers could vary significantly.
UK Government Data Standards Catalogue details the UK standards for this kind of thing.
It suggests 35 characters for each of Given Name and Family Name, or 70 characters for a single field to hold the Full Name, and 255 characters for an email address. Amongst other things..
W3C's recommendation:
If designing a form or database that will accept names from people
with a variety of backgrounds, you should ask yourself whether you
really need to have separate fields for given name and family name.
… Bear in mind that names in some cultures can be quite a lot longer
than your own. … Avoid limiting the field size for names in your
database. In particular, do not assume that a four-character
Japanese name in UTF-8 will fit in four bytes – you are likely to
actually need 12.
https://www.w3.org/International/questions/qa-personal-names
For database fields, VARCHAR(255) is a safe default choice, unless you can actually come up with a good reason to use something else. For typical web applications, performance won't be a problem. Don't prematurely optimize.
Some almost-certainly correct column lengths
Min Max
Hostname 1 255
Domain Name 4 253
Email Address 7 254
Email Address [1] 3 254
Telephone Number 10 15
Telephone Number [2] 3 26
HTTP(S) URL w domain name 11 2083
URL [3] 6 2083
Postal Code [4] 2 11
IP Address (incl ipv6) 7 45
Longitude numeric 9,6
Latitude numeric 8,6
Money[5] numeric 19,4
[1] Allow local domains or TLD-only domains
[2] Allow short numbers like 911 and extensions like 16045551212x12345
[3] Allow local domains, tv:// scheme
[4] http://en.wikipedia.org/wiki/List_of_postal_codes. Use max 12 if storing dash or space
[5] http://stackoverflow.com/questions/224462/storing-money-in-a-decimal-column-what-precision-and-scale
A long rant on personal names
A personal name is either a Polynym (a name with multiple sortable components), a Mononym (a name with only one component), or a Pictonym (a name represented by a picture - this exists due to people like Prince).
A person can have multiple names, playing roles, such as LEGAL, MARITAL, MAIDEN, PREFERRED, SOBRIQUET, PSEUDONYM, etc. You might have business rules, such as "a person can only have one legal name at a time, but multiple pseudonyms at a time".
Some examples:
names: [
{
type:"POLYNYM",
role:"LEGAL",
given:"George",
middle:"Herman",
moniker:"Babe",
surname:"Ruth",
generation:"JUNIOR"
},
{
type:"MONONYM",
role:"SOBRIQUET",
mononym:"The Bambino" /* mononyms can be more than one word, but only one component */
},
{
type:"MONONYM",
role:"SOBRIQUET",
mononym:"The Sultan of Swat"
}
]
or
names: [
{
type:"POLYNYM",
role:"PREFERRED",
given:"Malcolm",
surname:"X"
},
{
type:"POLYNYM",
role:"BIRTH",
given:"Malcolm",
surname:"Little"
},
{
type:"POLYNYM",
role:"LEGAL",
given:"Malik",
surname:"El-Shabazz"
}
]
or
names:[
{
type:"POLYNYM",
role:"LEGAL",
given:"Prince",
middle:"Rogers",
surname:"Nelson"
},
{
type:"MONONYM",
role:"SOBRIQUET",
mononym:"Prince"
},
{
type:"PICTONYM",
role:"LEGAL",
url:"http://upload.wikimedia.org/wikipedia/en/thumb/a/af/Prince_logo.svg/130px-Prince_logo.svg.png"
}
]
or
names:[
{
type:"POLYNYM",
role:"LEGAL",
given:"Juan Pablo",
surname:"Fernández de Calderón",
secondarySurname:"García-Iglesias" /* hispanic people often have two surnames. it can be impolite to use the wrong one. Portuguese and Spaniards differ as to which surname is important */
}
]
Given names, middle names, surnames can be multiple words such as "Billy Bob" Thornton, or Ralph "Vaughn Williams".
I would say to err on the high side. Since you'll probably be using varchar, any extra space you allow won't actually use up any extra space unless somebody needs it. I would say for names (first or last), go at least 50 chars, and for email address, make it at least 128. There are some really long email addresses out there.
Another thing I like to do is go to Lipsum.com and ask it to generate some text. That way you can get a good idea of just what 100 bytes looks like.
I pretty much always use a power of 2 unless there is a good reason not to, such as a customer facing interface where some other number has special meaning to the customer.
If you stick to powers of 2 it keeps you within a limited set of common sizes, which itself is a good thing, and it makes it easier to guess the size of unknown objects you may encounter. I see a fair number of other people doing this, and there is something aesthetically pleasing about it. It generally gives me a good feeling when I see this, it means the designer was thinking like an engineer or mathematician. Though I'd probably be concerned if only prime numbers were used. :)
These might be useful to someone;
youtube max channel length = 20
facebook max name length = 50
twitter max handle length = 15
email max length = 255
http://www.interoadvisory.com/2015/08/6-areas-inside-of-linkedin-with-character-limits/
I wanted to find the same and the UK Government Data Standards mentioned in the accepted answer sounded ideal. However none of these seemed to exist any more - after an extended search I found it in an archive here: http://webarchive.nationalarchives.gov.uk/+/http://www.cabinetoffice.gov.uk/govtalk/schemasstandards/e-gif/datastandards.aspx. Need to download the zip, extract it and then open default.htm in the html folder.
+------------+---------------+---------------------------------+
| Field | Length (Char) | Description |
+------------+---------------+---------------------------------+
|firstname | 35 | |
|lastname | 35 | |
|email | 255 | |
|url | 60+ | According to server and browser |
|city | 45 | |
|address | 90 | |
+------------+---------------+---------------------------------+
Edit: Added some spacing
Just looking though my email archives, there are a number of pretty long "first" names (of course what is meant by first is variable by culture). One example is Krishnamurthy - which is 13 letters long. A good guess might be 20 to 25 letters based on this. Email should be much longer since you might have firstname.lastname#somedomain.com. Also, gmail and some other mail programs allow you to use firstname.lastname+sometag#somedomain.com where "sometag" is anything you want to put there so that you can use it to sort incoming emails. I frequently run into web forms that don't allow me to put in my full email address without considering any tags. So, if you need a fixed email field maybe something like 25.25+15#20.3 in characters for a total of 90 characters (if I did my math right!).
I usually go with:
Firstname: 30 chars
Lastname: 30 chars
Email: 50 chars
Address: 200 chars
If I am concerned about long fields for the names, I might sometimes go with 50 for the name fields too, since storage space is rarely an issue these days.
If you need to consider localisation (for those of us outside the US!) and it's possible in your environment, I'd suggest:
Define data types for each component of the name - NOTE: some cultures have more than two names! Then have a type for the full name,
Then localisation becomes simple (as far as names are concerned).
The same applies to addresses, BTW - different formats!
it is varchar right? So it then doesn't matter if you use 50 or 25, better be safe and use 50, that said I believe the longest I have seen is about 19 or so. Last names are longer
Related
I have a textfile of a threaded discussion. I want to use python to iterate over the names of the people who posted in the discussion, select all the text posted by each person, and count the total number of words and total instances of a particular word within the selected text.
So, for instance, JoeX may have posted 5 times in the discussion. There is a delimiter that appears between posts (the words: Manage Discussion Entry). So, I need to find the name "JoeX", select the text until "Manage Discussion Entry", count the words in that block and the number of instances of phraseX, then continue through the textfile doing the same tasks each time JoeX posted...updating the total count and instances of phraseX count until the end of the textfile is reached.
Here is the code I've put together so far where ListOfNames is a pre-created list based on the names of the posters to the discussion and "impact" is phraseX:
for Name in ListOfNames:
count = 0
impact = 0
with open('canvassaveastext.html', 'r') as f:
for word in f.split(): # Split the contents into words and iterate through each.
if word[0] == Name: # Check if the word is the name we are looking for
RightOfName = f.partition(Name)[2]
NameWordString = RightOfName.partition("Manage Discussion Entry")[0]
impactcount = NameWordString.count("impact")
NameWordCount = NameWordString.count(" ")+1
count=count+NameWordCount
impact=impact+impactcount
print Name, count, impact
Sample input looks like this:
JOE Apr 22 at 5:31pm Various information here based on what the person wanted to say in their post // Manage Discussion Entry <#> MATT Apr 23 at 5:31pm Various information here based on what the person wanted to say in their post // Manage Discussion Entry <#> JENNIFER Apr 24 at 5:31pm Various information here based on what the person wanted to say in their post // Manage Discussion Entry <#> JOE Apr 25 at 5:31pm Various information here based on what the person wanted to say in their post // Manage Discussion Entry <#>
Expected output would look like this:
Joe 14 0
Right now the program outputs counts of 0 (e.g. Joe 0 0) so clearly there is a logic error somewhere.
Your help would be much appreciated! Thank you.
I have to create a database combined with 4 types of xls files, for example A, B, C and D. Every year new file is created, starting from 2004. A have 7 sheets with 800-1000 rows, B - D have one sheet with max 200 rows.
Everyone knows that people are lazy, so in excel files, address data are stored differently in each sheet. One of them, from 2008, have address data stored separately, but every other sheets have this data combined into one column.
Sooo, here is a question - how should I design a datatable? Something like this?
+---------+----------+----------+-------------+--------------------------------+
| Street | House Nr | City | Postal Code | Combined Address |
+---------+----------+----------+-------------+--------------------------------+
| Street1 | 20 | Somwhere | 00-000 | null |
| Street2 | 98 | Elswhere | 99-999 | null |
| null | null | null | null | Somwhere 00-000, street3 29 |
| null | null | null | null | st. Street2 65 12-345 Elswhere |
+---------+----------+----------+-------------+--------------------------------+
There will be a lot of nulls, so maybe best solution would be 2 different tables?
Most important thing is that users will search by using this data, and in the future add data into that database without excel files.
There are at least two different angles of view here: Normalization and efficiency, leading to different results.
Normalization
If this is the most important criterion you would make even three tables. Obviously Combined Address needs a place of it's own, but also Postal Code and City have to be stored into another table, because there is a dependency between them. Just one of the two, most probably Postal Code will stay. (Yes, there even is sth. about Street and Postal Code too, but I'm clearly not going to be pedantic.)
Efficiency
Normalization as an end in itself doesn't necessarily make the best result. If you permit yourself to be a bit sloppy on that and leave it the way it is in the model you posted, things might become easier in coding. You could use a trigger to make sure Combined Address is never null or use a (materialized) view that pretends it is and just search in Combined Address for the time being.
Imagine the effort if you use different tables and there is a need for referencing these addresses in other tables (Which table to use when? How to provide a unique id? Clearly a problem.).
So, decide on what's more important.
If I'm not mistaken we are taking about some 2000 rows or some 8000 rows if it is '7 sheets with 800-1000 rows each' actually. Even if the latter applies this isn't a number that makes data correction impracticable. If the number of different input pattern in the combined column is low, you might be able to do this (partly) automatically and just have some-one prove reading.
So you might want to think about a future redesign as well and choose what's more convenient in this case.
I'm looking for an efficient way of storing sets of objects that have occurred together during events, in such a way that I can generate aggregate stats on them on a day-by-day basis.
To make up an example, let's imagine a system that keeps track of meetings in an office. For every meeting we record how many minutes long it was and in which room it took place.
I want to get stats broken down both by person as well as by room. I do not need to keep track of the individual meetings (so no meeting_id or anything like that), all I want to know is daily aggregate information. In my real application there are hundreds of thousands of events per day so storing each one individually is not feasible.
I'd like to be able to answer questions like:
In 2012, how many minutes did Bob, Sam, and Julie spend in each conference room (not necessarily together)?
Probably fine to do this with 3 queries:
>>> query(dates=2012, people=[Bob])
{Board-Room: 35, Auditorium: 279}
>>> query(dates=2012, people=[Sam])
{Board-Room: 790, Auditorium: 277, Broom-Closet: 71}
>>> query(dates=2012, people=[Julie])
{Board-Room: 190, Broom-Closet: 55}
In 2012, how many minutes did Sam and Julie spend MEETING TOGETHER in each conference room? What about Bob, Sam, and Julie all together?
>>> query(dates=2012, people=[Sam, Julie])
{Board-Room: 128, Broom-Closet: 55}
>>> query(dates=2012, people=[Bob, Sam, Julie])
{Board-Room: 22}
In 2012, how many minutes did each person spend in the Board-Room?
>>> query(dates=2012, rooms=[Board-Room])
{Bob: 35, Sam: 790, Julie: 190}
In 2012, how many minutes was the Board-Room in use?
This is actually pretty difficult since the naive strategy of summing up the number of minutes each person spent will result in serious over-counting. But we can probably solve this by storing the number separately as the meta-person Anyone:
>>> query(dates=2012, rooms=[Board-Room], people=[Anyone])
865
What are some good data structures or databases that I can use to enable this kind of querying? Since the rest of my application uses MySQL, I'm tempted to define a string column that holds the (sorted) ids of each person in the meeting, but the size of this table will grow pretty quickly:
2012-01-01 | "Bob" | "Board-Room" | 2
2012-01-01 | "Julie" | "Board-Room" | 4
2012-01-01 | "Sam" | "Board-Room" | 6
2012-01-01 | "Bob,Julie" | "Board-Room" | 2
2012-01-01 | "Bob,Sam" | "Board-Room" | 2
2012-01-01 | "Julie,Sam" | "Board-Room" | 3
2012-01-01 | "Bob,Julie,Sam" | "Board-Room" | 2
2012-01-01 | "Anyone" | "Board-Room" | 7
What else can I do?
Your question is a little unclear because you say you don't want to store each individual meeting, but then how are you getting the current meeting stats (dates)? In addition any table given the right indexes can be very fast even with alot of records.
You should be able to use a table like log_meeting. I imagine it could contain something like:
employee_id, room_id, date (as timestamp), time_in_meeting
Where foreign keys to employee id to employee table, and room id key to room table
If you index employee id, room id, and date you should have a pretty quick lookup as mysql multiple-column indexes go left to right such that you gain index on (employee id, employee id + room id, and employee id + room id + timestamp) when do searches. This is explained more in the multi-index part of:
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
By refusing to store meetings (and related objects) individually, you are loosing the original source of information.
You will not be able to compensate for this loss of data, unless you memorize on a regular basis the extensive list of all potential daily (or monthly or weekly or ...) aggregates that you might need to question later on!
Believe me, it's going to be a nightmare ...
If the number of people are constant and not very large you can then assign a column to each person for present or not and store the room, date and time in 3 more columns this can remove the string splitting problems.
Also by the nature of your question I feel first of all you need to assign Ids to everything rooms,people, etc. No need for long repetitive string in DB. Also try reducing any string operation and work using individual data in each column for better intersection performance. Also you can store a permutation all the people in a table and assign a id for them then use one of those ids in the actual date and time table. But all techniques will require that something be constant either people or rooms.
I do not understand whether you know all "questions" in design time or it's possible to add new ones during development/production time - this approach would require to keep all data all the time.
Well if you would know all your questions it seems like classic "banking system" which recalculates data on daily basis.
How I think about it.
Seems like you have limited number of rooms, people, days etc.
Gather logging data on daily basis, one table per day. Just one event, one database row, all information (field) what you need.
Start to analyse data using some crone script at "midnight".
Update stats for people, rooms, etc. Just increment number of hours spent by Bob in xyz room etc. All what your requirements need.
As analyzed data are limited and relatively small as you analyzed (compress) them, your system can contain also various queries as indexes would be relatively small etc.
You could be able to use scalable map/reduce algorithm.
You can't avoid storing the atomic facts as follows: (the meeting room, the people, the duration, the day), which is probably only a weak consolidation when the same people meet multiple times in the same room on the same day. Maybe that happens a lot in your office :).
Making groups comparable is an interesting problem, but as long as you always compose the member strings the same, you can probably do it with string comparisons. This is not "normal" however. To normalise you'll need a relation table (many to many) and compose a temporary table out of your query set so it joins quickly, or use an "IN" clause and a count aggregate to ensure everyone is there (you'll see what I mean when you try it).
I think you can derive the minutes the board room was in use as meetings shouldn't overlap, so a sum will work.
For storage efficiency, use integer keys for everything with lookup tables. Dereference the integers during the query parsing, or just use good old joins if you are feeling traditional.
That's how I would do it anyway :).
You'll probably have to store individual meetings to get the data you need anyway.
However you'll have to make sure you aggregate and anonymise it properly before creating your reports. Make sure to separate concerns and access levels to stay within the proper legal limits on data.
How to get a Carrier Name in Orders API of Amazon MWS for MFN Account which is already shipped? is it possible to do this using Order API?
I don't think there currently is any API to retrieve tracking information (or even just the carrier name) through MWS.
For the sake of completeness: To submit shipping information including carrier names and tracking numbers, you can use the SubmitFeed API with FeedType=_POST_ORDER_FULFILLMENT_DATA_. The corresponding XSD (OrderFulfillment.xsd) defines the following values as valid CarrierCodes: USPS, UPS, FedEx, DHL, Fastway, GLS, GO!, Hermes Logistik Gruppe, Royal Mail, Parcelforce, City Link, TNT, Target, SagawaExpress, NipponExpress, YamatoTransport . All other carriers must use the CarrierName field.
While I'm sure it's available in MWS, it would be much easier to use the carriers shipping number structure to determine the carrier.
IE...
Fedex can have 12 or 15 digit tracking numbers and the barcode can be 22 digits.
UPS has 1Z in front of their tracking.
USPS format is 20 digits (e.g. 9999 9999 9999 9999 9999), or a combination of 13 alphabetic and numeric characters, usually starting with 2 alphabets, following by 9 digits, and ending by "US" (e.g. EA 999 999 999 US
build regular expression to handle these easily.
Address records are probably used in most database, but I've seen a number of slightly different sets of fields used to store them. The number of fields seems to vary from 3-7, and sometimes all fields are simple labelled address1..addressN, other times given specific meaning (town, city, etc).
This is UK specific, though I'm open to comments about the rest of the world too. Here you need the first line of the address (actually just the number) and the post code to identify the address - everything else is mostly an added bonus.
I'm currently favouring:
Address 1
Address 2
Address 3
Town
County
Post Code
We could add Country if we ever needed it (unlikely).
What do you think? Is this too little, too much?
The Post Office suggests (http://www.postoffice.co.uk/portal/po/content1?catId=19100182&mediaId=19100267) 7 lines:
Addressees Name
Company/Organisation
Building Name
Number of building and name of thoroughfare
Locality Name
Post Town
Post Code
They then say you do not need to include a County name provided the Post Town and Postcode are used.
The BSI have BS 7666 - that covers all addressing. I recommend you look there.
The 2000 version recommends
An address shall be based upon a logical data model comprising the following entities:
addressable object, with sub-types:
primary addressable object;
secondary addressable object;
street;
locality;
town;
administrative area, a.k.a. district;
county;
postcode.
See: http://landregistry.data.gov.uk/def/common/BS7666Address
I don't know whether this is minimal (I doubt it) but the heading on my cheque book says something pretty close to:
Lloyds TSB
Isle of Man Offshore Centre
Peveril Buildings
Peveril Square
Douglas
Isle of Man
IM99 0XX
United Kingdom
This causes fits when I try to enter it into the US banking system.
If I were you, I'd call Royal Mail and ask them... or look on their website for postcode lookup as a best practice.
There's different types of addresses, and each different type has a slightly different structure. Forward sorting offices have a different postal address structure than a residential home with a street number. What if the house has a name instead of a number? There are so many factors to consider.
Since I moved to Canada I had to do something similar and it's far more complicated than a straightforward residential address which generally has:
Street Number if applicable
Street Number Suffix if applicable
House Name
Street Name
Street Type
Street Direction if applicable
Unit Number for flats, townhouses or other types of building/location
Minor Municipality (Village)
Major Municipality (Major Town/City)
County
PostCode
Country if you include Scotland, Wales, Northern Ireland (and now I noticed Eire)
Then you get businesses that have their own Delivery Route, PO Boxes, Forward Sortation Offices...
It gets complicated in a real hurry.
Best bet - give Royal Mail a call and they should be able to give you information on their standard address templates.
EDIT: Your 3 field method isn't a bad one...particularly. However, data sanitization may be a significant issue using the field setup you have and you may need a fairly complex strategy for making sure that the address entered is valid. It's far easier to sanitize single dedicated fields to make sure input is correct than it is to parse various address tokens out of combined fields.
Another simpler way to gain this info is to go on the Royal Mail website and check their postcode lookup page.
On their main postcode lookup, they use 4 fields and I guess they have some form of validation on the street name/type field. They separate the house number and name and I guess they only allow major municipality. I'm assuming the county/country are assumed. If you break out their advanced search, they give you two extra fields for flat number and business name.
Given that some fields are combined on their site, you have to assume that there's some amount of validation to make sure that data entered can be gainfully used.
Premises elements
Sub Building Name
Building Name
Building Number
Organisation Name
Department Name
PO Box Number
Thoroughfare elements
Dependent Thoroughfare Name
Dependent Thoroughfare Descriptor
Thoroughfare Name
Thoroughfare Descriptor
Locality elements
Double Dependent Locality
Dependent Locality
Post Town
Postcode element
Postcode
This answer may be a few years late, but it's aimed at those like myself looking for guidance on how to correctly format postal addresses for both storing in a database (or the likes of it) and for printing purposes.
Taken from Royal Mail Doc, link below - conveniently titled the 'Programmers Guide'
Technical specififcation for users of PAF
Page 27 - 42 was most helpful for me.
It's very likely that a "UK" will be opened to Eire as well, and in some lines of business there will be legal differences, generally between Scotland / NI / the channel islands and England and Wales.
In short, I would add country to the list. Otherwise it's fine (no fewer certainly), though of course any address is traceable from a building reference, a post code and a country alone.
Where we live in France its just 3 lines:-
myname
village/location name
6 digit postcode followed by post town name in uppercase
Even from UK that's all that is required