Avoiding duplicate addresses in a database table - database

I'm trying to avoid reinventing the wheel when it comes to storing street addresses in a table only once. Uniqueness constraints won't work in some common situations:
100 W 5th Ave
100 West 5th Ave
100 W 5th
200 N 6th Ave Suite 405
200 N 6th Ave #405
I could implement some business logic or a trigger to normalize all fields before inserting and use uniqueness constraints across several fields in the table, but it would be easy to miss some cases with something that varies as much as street addresses.
What would be best would be a universal identifier for each address, perhaps based on GPS coordinates. Before storing a new address look up its GUID and see if the GUID already exists in the Address table.
An organization like Mapquest, the Postal Serice, FedEx, or the US government probably has a system like this.
Has anyone found a good solution to this?
Here's my Address table now (generated by JPA):
CREATE TABLE address
(
id bigint NOT NULL,
"number" character varying(255),
dir character varying(255),
street character varying(255),
"type" character varying(255),
trailingdir character varying(255),
unit character varying(255),
city character varying(255),
state character varying(255),
zip integer,
zip4 integer,
CONSTRAINT address_pkey PRIMARY KEY (id)
)

Look up the address in Google maps and use the spelling they use.

You need support for regular expressions like syntax. You can come up with some kind of automata function that will parse tokens and try matching them and then expand or contract them into abbreviations. I'd look into glob() like functions that give support to *? etc on unix as a quick dirty fix.

I wasn't looking for address validation or normalization, although address validation is a good idea. I need a unique identifier for each street address to avoid duplicate records.
It looks like geocoding can provide a solution. With geocoding the input can be a street address and the output will be latitude and longitude coordinates with enough precision to resolve a specific building.
There's a more serious problem with street address ambiguity than I thought. This is from the Wikipedia page on geocoding:
"...there are multiple 100 Washington Streets in Boston, Massachusetts because several cities have been annexed without changing street names."
The Wikipedia page on geocoding has a list of resources (many free) to perform geocoding.

I settled on the USC WebGIS service due to their nice web service interface and being easy to sign up for.
Geocodes aren't suitable as a unique key for street addresses, though, for a number of reasons. For example, geocoding cannot distinguish between different units in a condominium complex or apartment building.
I decided to use the parsed address from the geocoding result and put unique constraints on the street number, street name, unit, city, state, and zip. It's not perfect, but it works for what I'm doing.

Related

How to get the bounding coordinates for a US postal(zip) code?

Is there a service/API that will take a postal/zip code and return the bounding(perimeter) coordinates so I can build a Geometry object in a MS SQL database?
By bounding coordinates, I mean I would like to retrieve a list of GPS coordinates that construct a polygon that defines the US zip code.
An elaboration of my comment, that ZIP codes are not polygons....
We often think of ZIP codes as areas (polygons) because we say, "Oh, I live in this ZIP code..." which gives the impression of a containing region, and maybe the fact that ZIP stands for "Zone Improvement Plan" helps the false association with polygons.
In actuality, ZIP codes are lines which represent, in a sense, mail carrier routes. Geometrically, lines do not have area. Just as lines are strings of points along a coordinate plane, ZIP code lines are strings of delivery points in the abstract space of USPS-designated addresses.
They are not correlated to geographical coordinates. What you will find, though, is that they appear to be geographically oriented because it would be inefficient for carriers to have a route completely irrelevant of distance and location.
What is this "abstract space of USPS-designated addresses"? That's how I am describing the large and mysterious database of deliverable locations maintained by the US Postal Service. Addresses are not allotted based on geography, but on the routes that carriers travel which usually relates to streets and travelability.
Some 5-digit ZIP codes are only a single building, or a complex of buildings, or even a single floor of a building (yes, multiple zip codes can be at a single coordinate because their delivery points are layered vertically). Some of these -- among others -- are "unique" ZIPs. Companies and universities frequently get their own ZIP codes for marketing or organizational purposes. For instance, the ZIP code "12345" belongs to General Electric up in Schenectady, NY. (Edit: In a previous version of Google Maps, when you follow that link, you'd notice that the placement marker was hovering, because it points to a ZIP code, which is not a coordinate. While most US ZIP codes used to show a region on Google Maps, these types cannot because the USPS does not "own" them, so to speak, and they have no area.)
Just for fun, let's try verifying an address in a unique ZIP code. Head over to SmartyStreets and punch in a bogus address in 12345, like:
Street: 999 Sdf sdf
ZIP Code: 12345
When you try to verify that, notice that... it's VALID! Why? The USPS will deliver a piece to the receptacle for that unique ZIP code, but at that point, it's up to GE to distribute it. Pretty much anything internal to the ZIP code is irrelevant to the USPS, including the street address (technically "delivery line 1"). Many universities function in a similar manner. Here's more information regarding that.
Now, try the same bogus address, but without a ZIP code, and instead do the city/state:
Street: 999 Sdf sdf
City: Schenectady
State: NY
It doesn't validate. This is because even though Schenectady contains 12345, where the address is "valid," it geometrically intersects with the "real" ZIP codes for Schenectady.
Take another instance: military. Certain naval ships have their own ZIP codes. Military addresses are an entirely different class of addresses using the same namespace. Ships move. Geographical coordinates don't.
ZIP precision is another fun one. 5-digit ZIP codes are the least "precise" (though the term "specific" might be more meaningful here, since ZIP codes don't pinpoint anything). 7- and 9-digit ZIP codes are the most specific, often down to block or neighborhood-level in urban areas. But since each ZIP code is a different size, it's really hard to tell what actual distances you're talking.
A 9-digit ZIP code might be portioned to a floor of a building, so there you have overlapping ZIP codes for potentially hundreds of addresses.
Bottom line: ZIP codes don't, contrary to popular belief, provide geographical or boundary data. They vary widely and are actually quite un-helpful unless you're delivering mail or packages... but the USPS' job was to design efficient carrier routes, not partition the population into coordinate regions so much.
That's more the job of the census bureau. They've compiled a list of cartographic boundaries since ZIP codes are "convenient" to work with. To do this, they sectioned bunches of addresses into census blocks. Then, they aggregated USPS ZIP code data to find the relation between their census blocks (which has some rough coordinate data) and the ZIP codes. Thus, we have approximations of what it would look like to plot a line as a polygon. (Apparently, they converted a 1D line into a 2D polygon by transforming a 2D polygon based on its contents to fit linear data -- for each non-unique, regular ZIP code.)
From their website (link above):
A ZIP Code tabulation area (ZCTA) is a statistical geographic entity
that approximates the delivery area for a U.S. Postal Service
five-digit or three-digit ZIP Code. ZCTAs are aggregations of census
blocks that have the same predominant ZIP Code associated with the
addresses in the U.S. Census Bureau's Master Address File (MAF).
Three-digit ZCTA codes are applied to large contiguous areas for which
the U.S. Census Bureau does not have five-digit ZIP Code information
in its MAF. ZCTAs do not precisely depict ZIP Code delivery areas, and
do not include all ZIP Codes used for mail delivery. The U.S. Census
Bureau has established ZCTAs as a new geographic entity similar to,
but replacing, data tabulations for ZIP Codes undertaken in
conjunction with the 1990 and earlier censuses.
The USCB's dataset is incomplete, and at times inaccurate. Google still has holes in their data, too (the 12345 is a somewhat good example) -- but Google will patch it eventually by going over each address and ZIP code by hand. They do this already, but haven't made all their map data perfect quite yet. Naturally, access to this data is limited to API terms, and it's very expensive to raise these.
Phew. I'm beat. I hope that helps clarify things. Disclaimer: I used to be a developer at SmartyStreets. More information on geocoding with address data.
Even more information about ZIP codes.
What you are asking for is a service to provide "Free Zip code Geocoding". There are a few out there with varying quality. You're going to have a bad time coding something like this yourself because of a few reasons:
Zip codes can be assigned to a single building or to a post office.
Zip codes are NOT considered a polygonal area. Projecting Zip codes to a polygonal area will require you to make an educated guess where the boundary is between one zipcode and the next.
ZIP code address data specifies only a center location for the ZIP code. Zip code data provides the general vicinity of an address. Mailing addresses that exist between one zipcode and another can be in dispute on which zipcode it actually is in.
A mailing address may be physically closer to zipcode 11111, yet its official zip code is a more distant zip code point 11112.
Google Maps has a geocoding API:
The google maps API is client-side javascript. You can directly query the geocoding system from php using an http request. However, google maps only gives you what the United States Postal Service gives them. A point representing the center of the zipcode.
https://developers.google.com/maps/#Geocoding_Examples
map city/zipcode polygons using google maps
Thoughts on projecting a zipcode to its lat/long bounding box
There are approximately 43,000 ZIP Codes in the United States. This number fluctuates from month to month, depending on the number of changes made. The zipcodes used by the USPS are not represented as polygons and do not have hard and fast boundaries.
The USPS (United States Postal Service) is the authority that defines each zipcode lat/long. Any software which resolves a zipcode to a geographical location would be in need of weekly updates. One company called alignstar provides demographics and GIS data of zipcodes ( http://www.alignstar.com/data.html ).
Given a physical (mailing) address, find the geographical coordinates in order to display that location on a map.
If you want to reliably project what shape the zipcode is in, you are going to need to brute force it and ask: "give me every street address by zipcode", then paint boxes around those mis-shapen blobs. Then you can get a general feel for what geographical areas the zipcodes cover.
http://vterrain.org/Culture/geocoding.html
If you were to throw millions of mailing address points into an algorithm resolving every one to a lat/long, you might be able to build a rudimentary blob bounding box of that zipcode. You would have to re-run this algorithm and it would theoretically heal itself whenever the zipcode numbers move.
Other ideas
http://shop.delorme.com/OA_HTML/DELibeCCtpSctDspRte.jsp?section=10075
http://www.zip-codes.com/zip-code-map-boundary-data.asp
step 1:download cb_2018_us_zcta510_500k.zip
https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
if you want to keep them in mysql
step 2: in your mysql create a db of named :spatialdata
run this command
ogr2ogr -f "MySQL" MYSQL:"spatialdata,host=localhost,user=root" -nln "map" -a_srs "EPSG:4683" cb_2018_us_zcta510_500k.shp -overwrite -addfields -fieldTypeToString All -lco ENGINE=MyISAM
i uploaded the file on github(https://github.com/sahilkashyap64/USA-zipcode-boundary/blob/master/USAspatialdata.zip)
In the your "spatialdata db" there will be 2 table named map & geometry_columns .
In 'map' there will be a column named "shape".
shape column is of type "geometry" and it contains polygon/multipolygon files
In 'geometry_columns' there will will be srid defined
how to check if point falls in the polygon
SELECT * FROM map WHERE ST_Contains( map.SHAPE, ST_GeomFromText( 'POINT(63.39550 -148.89730 )', 4683 ) )
and want to show boundary on a map
select zcta5ce10 as zipcode, ST_AsGeoJSON(SHAPE) sh from map where ST_Contains( map.SHAPE, ST_GeomFromText( 'POINT(34.1116 -85.6092 )', 4683 ) )
"ST_AsGeoJSON" this returns spatial data as geojson.
Use http://geojson.tools/
"HERE maps" to check the shape of geojson
if you want to generate topojson
mapshaper converts shapefile to topojson (no need to convert it to kml file)
npx -p mapshaper mapshaper-xl cb_2018_us_zcta510_500k.shp snap -simplify 0.1% -filter-fields ZCTA5CE10 -rename-fields zip=ZCTA5CE10 -o format=topojson cb_2018_us_zcta510_500k.json
If you want to convert shapefile to kml
`ogr2ogr -f KML tl_2019_us_zcta510.kml -mapFieldType Integer64=Real tl_2019_us_zcta510.shp
I have used mapbox gl to display 2 zipcodes
example: https://sahilkashyap64.github.io/USA-zipcode-boundary/
code :https://github.com/sahilkashyap64/USA-zipcode-boundary
SQL Server Solution
Download the Shape files from the US Census:
https://catalog.data.gov/dataset/2019-cartographic-boundary-shapefile-2010-zip-code-tabulation-areas-for-united-states-1-500000
I then found this repository to import the shape file to SQL Server, it was very fast and required no additional coding: https://github.com/xfischer/Shape2SqlServer
Then I could write my own script to find out which zip codes are in a polygon I created:
DECLARE #polygon GEOMETRY;
DECLARE #isValid bit = 0;
DECLARE #p nvarchar(2048) = 'POLYGON((-120.1547 39.2472,-120.3758 39.1950,-120.2124 38.7734,-119.6590 38.8162,-119.6342 39.3672,-120.1836 39.2525,-120.1547 39.2472))'
SET #polygon = GEOMETRY::STPolyFromText(#p,4326)
SET #isValid = #polygon.STIsValid()
IF (#isValid = 1)
SET #polygon = #polygon.MakeValid();
SET #isValid = #polygon.STIsValid()
IF (#isValid = 1)
BEGIN
SELECT * FROM cb_2019_us_zcta510_500k
WHERE geom.STIntersects(#polygon) = 1
END
ELSE
SELECT 'Polygon not valid'
I think this is what you need it uses US Census as repository: US Zipcode
Boundaries API: https://www.boundaries-io.com
Above API shows US Boundaries(GeoJson) by zipcode,city, and state. you should use the API programatically to handle large results.
Disclaimer,I work here
I think the world geoJson link and the google map geocode api can help you.
example: you can use the geocode api to code the zip,you will get the city,state,country,then,you search from the world and us geoJson get the boundry,I have an example of US State boundry,like dsdlink

Normalization 3NF

I an reading through some examples of normalization, however I have come across one that I do not understand.
The website the example is located here: http://cisnet.baruch.cuny.edu/holowczak/classes/3400/normalization/#allinone
The part I do not understand is "Third Normal Form"
In my head I see the transitive dependencies in EMPLOYEE_OFFICE_PHONE (Name, Office, Floor, Phone) as the following Name->->Office|Floor and Name->->Office|Phone
The author splits the table EMPLOYEE_OFFICE_PHONE (Name, Office, Floor, Phone) into EMPLOYEE_OFFICE (Name, Office, Floor) and EMPLOYEE_PHONE (Office, Phone)
From my judgement in the beginning, I still see the transitive dependency in Name->->Office|Floor so I don't understand why it is in 3NF. Was I wrong to state that there is a transitive dependency in Name->->Office|Floor?
Reasoning for transitivity:
Here is my list of the functional dependencies
Name -> Office
Name -> Floor
Name -> Phone
Office -> Phone
Office -> Floor (Is this the incorrect one? and why?
Thank-you your help everyone!
5) you assume a naming sheme here ... offices 4xx have to be on floor 4 ... 5xx have to be on floor 5 ... if such a scheme exists, you can have your dependency ... as long as this is not part of the specification ... no. 5 is out of the game ...
1. Name -> Office
2. Name -> Floor
3. Name -> Phone
4. Office -> Phone
5. Office -> Floor (Is this the incorrect one? and why?
(1) You and the author and I agree that Name->Office.
(2) You and the author agree that Name->Floor. While that's true based solely on the sample data, it's also true that Office->Floor. I'd explore this kind of issue by asking this question: "If an office is empty, do I still know what floor that office is on?" (Yes)
Those things suggest there's a transitive dependency, Name->Office, and Office->Floor. So I would disagree with you and with the author on this one.
(3) You say Name->Phone. The author says Office->Phone. The author also says that "each office has exactly one phone number." So given one value for Office, I know one and only one value for Phone. And given one value for Name, I know one and only one value for Phone. I'd explore this issue by asking, "If I move to a different office, does my phone number follow me?" If it does, then Name->Phone. If it doesn't, then Office->Phone.
There isn't enough information here to answer that question, and I've worked in offices that worked each of those two ways, so real-world experience doesn't help us very much, either. I'd have to side with the author in this case, although I think it's not very well thought through for a normalization example.
(4) This is really just an extension of (3) above.
(5) See (2) above. This doesn't have anything to do with a naming scheme, and you don't need to assume that offices numbered 5xx are on the 5th floor. The only relevant question is this: Given one value for Office, is there one and only one value for Floor? (Yes) I might explore this issue by asking "Can one office be on more than one floor?" (In the real world, that's remotely possible. But the sample data doesn't support that possibility.)
Some additional FDs, based solely on the sample data.
Phone->Office
Phone->Floor
Office->Name
First of all,let me define 3NF clearly :-
A relation is in 3NF if following conditions are satisfied:-
1.)Relation is in 2NF
2.)No non prime attribute is transitively dependent on the primary key.
In other words,a relation is in 3NF is one of the following conditions is satisfied for every functional dependency X->Y:-
1.)X is superkey
2.)Y is a prime attribute
For Your Question,if the following FDs are present :-
Name -> Office
Name -> Floor
Name -> Phone
Office -> Phone
Then we cannot say anything about Office and Floor.You can verify this by applying and checking any of the Armstrong Inference Rules.When you apply these rules, you will find that you cannot infer anything about office and floor.

Best way to store user-submitted item names (and their synonyms)

Consider an e-commerce application with multiple stores. Each store owner can edit the item catalog of his store.
My current database schema is as follows:
item_names: id | name | description | picture | common(BOOL)
items: id | item_name_id | picture | price | description | picture
item_synonyms: id | item_name_id | name | error(BOOL)
Notes: error indicates a wrong spelling (eg. "Ericson"). description and picture of the item_names table are "globals" that can optionally be overridden by "local" description and picture fields of the items table (in case the store owner wants to supply a different picture for an item). common helps separate unique item names ("Jimmy Joe's Cheese Pizza" from "Cheese Pizza")
I think the bright side of this schema is:
Optimized searching & Handling Synonyms: I can query the item_names & item_synonyms tables using name LIKE %QUERY% and obtain the list of item_name_ids that need to be joined with the items table. (Examples of synonyms: "Sony Ericsson", "Sony Ericson", "X10", "X 10")
Autocompletion: Again, a simple query to the item_names table. I can avoid the usage of DISTINCT and it minimizes number of variations ("Sony Ericsson Xperia™ X10", "Sony Ericsson - Xperia X10", "Xperia X10, Sony Ericsson")
The down side would be:
Overhead: When inserting an item, I query item_names to see if this name already exists. If not, I create a new entry. When deleting an item, I count the number of entries with the same name. If this is the only item with that name, I delete the entry from the item_names table (just to keep things clean; accounts for possible erroneous submissions). And updating is the combination of both.
Weird Item Names: Store owners sometimes use sentences like "Harry Potter 1, 2 Books + CDs + Magic Hat". There's something off about having so much overhead to accommodate cases like this. This would perhaps be the prime reason I'm tempted to go for a schema like this:
items: id | name | picture | price | description | picture
(... with item_names and item_synonyms as utility tables that I could query)
Is there a better schema you would suggested?
Should item names be normalized for autocomplete? Is this probably what Facebook does for "School", "City" entries?
Is the first schema or the second better/optimal for search?
Thanks in advance!
References: (1) Is normalizing a person's name going too far?, (2) Avoiding DISTINCT
EDIT: In the event of 2 items being entered with similar names, an Admin who sees this simply clicks "Make Synonym" which will convert one of the names into the synonym of the other. I don't require a way to automatically detect if an entered name is the synonym of the other. I'm hoping the autocomplete will take care of 95% of such cases. As the table set increases in size, the need to "Make Synonym" will decrease. Hope that clears the confusion.
UPDATE: To those who would like to know what I went ahead with... I've gone with the second schema but removed the item_names and item_synonyms tables in hopes that Solr will provide me with the ability to perform all the remaining tasks I need:
items: id | name | picture | price | description | picture
Thanks everyone for the help!
The requirements you state in your comment ("Optimized searching", "Handling Synonyms" and "Autocomplete") are not things that are generally associated with an RDBMS. It sounds like what you're trying to solve is a searching problem, not a data storage and normalization problem. You might want to start looking at some search architectures like Solr
Excerpted from the solr feature list:
Faceted Searching based on unique field values, explicit queries, or date ranges
Spelling suggestions for user queries
More Like This suggestions for given document
Auto-suggest functionality
Performance Optimizations
If there were more attributes exposed for mapping, I would suggest using a fast search index system. No need to set aliases up as the records are added, the attributes simply get indexed and each search issued returns matches with a relevance score. Take the top X% as valid matches and display those.
Creating and storing aliases seems like a brute-force, labor intensive approach that probably won't be able to adjust to the needs of your users.
Just an idea.
One thing that comes to my mind is sorting the characters in the name and synonym throwing away all white space. This is similar to the solution of finding all anagrams for a word. The end result is ability to quickly find similar entries. As you pointed out, all synonyms should converge into one single term, or name. The search is performed against synonyms using again sorted input string.

Store user name as separate first/last name, or as a single full name String?

I have a web app, I'd like the user to supply their real name, for friend searches. I'm not sure whether to store this as two separate fields in my user class, or as a single field:
class User {
#Persistent
private String mFirstName;
#Persistent
private String mLastName;
}
.. or ..
class User {
#Persistent
private String mFullName;
}
I'm only going to use it to let users search for people. For example, they might search for "John", or "John Doe", or "Doe". I'm not sure what the app engine query engine allows us to do here, in terms of partial matches and such - has anyone gone through this and can recommend a good solution? I'm leaning towards just storing the full name to make searches easier,
Thanks
It's not just a question of how you end up storing names, it also matters how you ask for names on your web form.
If you prompt with a first and last field, the vast majority of inputs will probably conform, but you'll still have many exceptions (prefixes, middle names, suffixes, punctuation, etc).
If you clearly prompt with separate prefix, first name, middle name, last name, suffix fields, there'll be even fewer exceptions, but users might get peeved or confused.
You might even offer both: an easy one-field input or preparsed multifield input. Explore what other web sites do and see if you find them appealing/confusing/whatever.
Also keep in mind that if you input separate fields you can always easily join them later, but if you input only a single field you won't have the typist's help if you need to parse it later.
Short answer: store a full name.
Long answer here (Falsehoods programmers believe about names, by Patrick McKenzie).
I’m going to list assumptions your systems probably make about names. All of these assumptions are wrong.
(#1) People have exactly one canonical full name.
(#20) People have last names, family names, or anything else which is shared by folks recognized as their relatives.

Does anyone know of a good library for mapping a person's name to his or her gender? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am looking for a library or database that can provide guesses about whether a person is male or female based on his or her name or nickname. Something like
john => "M",
mary => "F",
alex => "A", #ambiguous
I am looking for something that supports names other than English names (such as Japanese, Indian, etc.).
Before I get another answer along the lines of "you are going to offend people by assuming their sex/gender" let me be clear, my application does not interact with anyone. It does not send emails or contact anyone in anyway. There are no users to ask. In many cases, the person in question is dead, and the only information I have is name, birth date, and date of death. The reason I want to know the sex of the individual is to make the grammar of the output nicer and to aid in possible searches that may come latter.
gender.c is an open source C program that does a good job.
It comes with data for 44568 first names from all around the world.
There is good documentation and a description of the file format (basically plain text)
so it should not be to difficult to read it from your own application.
Here is what the author says:
A few words on quality of data
The dictionary of first names has been prepared with utmost care.
For example, the Turkish, Indian and Korean names in this dictionary
have all been independently classified by several native speakers.
I also took special care to list only those names which can currently
be found.
The lesson from this?
Any modifications should be done very cautiously (and they must also
adhere to the sorting required by the search algorithm).
For example, knowing that "Sascha" is a boy's name in Germany,
the author never assumed the English "Sasha" to be a girl's name.
Knowing that "Jan" is a boy's name in Germany, I never assumed it to be
also a English short form of "Janet".
Another case in point is the name "Esra". This is a boy's name in
Germany, but a girl's name in Turkey.
The program calculates a probability for the name being male of female.
It can do so with the name as input alone or with the name and country of origin,
which gives significantly better results.
You can download it from the website of the German computer magazine c't
40 000 Namen.
The article is in German but don't worry, all documentation is English.
Here is the direct ftp link 0717-182.zip if you are not interested in the article.
The zip-File contains the source code, an windows executable, the database
and the documentation.
The gender of a name is something that cannot be inferred programmatically in the general case. You need a name database. Here is a free name database from the US Census Bureau.
EDIT: The link for the 2010 name is dead but there are working links and a libraries in the comments.
"I tell ya, life ain't easy for a boy named 'Sue.'"
...So, why make it any harder? If you need to know the sex, just ask... Otherwise, don't worry about it.
I've builded a free API that gives a probabilistic guess on the gender based on a first name. Instead of using any of the above mentioned approaches, i instead use a huge dataset of profiles from social networks to provide a probabilistic guess along with a certainty factor. It also supports optional filtering through country or language id's. It's getting better by the day as more profiles are added to the dataset.
It's free to use at http://genderize.io
ONE thing you should consider is using a tool that takes demographics into account, as naming conventions will rely heavily on this.
Example
http://api.genderize.io?name=kim
{"name":"kim","gender":"female","probability":"0.89","count":1440}
http://api.genderize.io?name=kim&country_id=dk
{"name":"kim","gender":"male","probability":"0.95","count":44,"country_id":"dk"}
Here are two oddball approaches that may not even work, and likely wouldn't work en masse without violating the terms of a license:
Use the Facebook API (which I know virtually nothing about, it may not even be possible) to perform two searches: one for FB male users with that first name, and one for female. Use the two numbers to decide the probability of gender.
Much looser but more scalable, use the Google API and search for the name plus the gender-specific pronouns, and compare the numbers. For instance, there are 592,000,000 results for searching for "Richard his" (not as a phrase), but only 179,000,000 for "Richard her".
Given your stated constraints, your best option is to re-phrase whatever it is you're writing to be gender-neutral unless you know what gender they want to be called in each instance.
If writing in English, remember that singular “they” is grammatically fine as a gender-neutral third-person singular pronoun.
A good example is the title of this question. As is currently:
… mapping a person's name to his or her sex?
That would be less awkward if written:
… mapping a person's name to their sex?
It's also poor practice to assume that users must be male or female. There are a small but significant number of "intersex" people, most of whom are heartily sick of not having a box to tick..
bignose: interesting on the "singular they". I didn't realize it had such a long history.
It's not a service, but a little app with a database:
http://www.codeproject.com/KB/cpp/genderizer.aspx
And this tool is in german:
http://www.faq-o-matic.net/2011/06/01/zu-einem-vornamen-das-geschlecht-finden/
And another one in VB:
http://www.vbarchiv.net/tipps/tipp_1925-geschlecht-anhand-des-vornamens-ermitteln.html
I think in combination with some "Most used firstname in 2011" lists you should be able to build something decent.
The python package SexMachine will do that for you. Given any first name it returns if it's male, female or unisex. It relies on the data from the gender.c program by Jorg Michael.
The only thing you'll get from trying to automate it is a bunch of unhappy users. From that census data:
JAMES, JOHN, ROBERT, MICHAEL, WILLIAM, DAVID, RICHARD, CHARLES, JOSEPH, THOMAS, CHRISTOPHER, DANIEL, PAUL, MARK, DONALD, GEORGE, KENNETH, STEVEN, EDWARD, BRIAN, RONALD, ANTHONY, KEVIN, JASON, MATTHEW, GARY, TIMOTHY, JOSE, LARRY, JEFFREY, FRANK, SCOTT, ERIC, STEPHEN, ANDREW, RAYMOND, GREGORY, JOSHUA, JERRY, DENNIS, WALTER, PATRICK, PETER, HAROLD, HENRY, CARL, ARTHUR, RYAN, JOE, JUAN, JACK, ALBERT, JUSTIN, TERRY, GERALD, KEITH, SAMUEL, WILLIE, LAWRENCE, ROY, BRANDON, ADAM, FRED, BILLY, LOUIS, JEREMY, AARON, RANDY, EUGENE, CARLOS, RUSSELL, BOBBY, VICTOR, MARTIN, JESSE, SHAWN, CLARENCE, SEAN, CHRIS, JOHNNY, JIMMY, ANTONIO, TONY, LUIS, MIKE, DALE, CURTIS, NORMAN, ALLEN, GLENN, TRAVIS, LEE, MELVIN, KYLE, FRANCIS, JESUS, RAY, JOEL, EDDIE, TROY, ALEXANDER, MARIO, FRANCISCO, MICHEAL, OSCAR, JAY, ALEX, JON, RONNIE, TOMMY, LEON, LEO, WESLEY, DEAN, DAN, LEWIS, COREY, MAURICE, VERNON, ROBERTO, CLYDE, SHANE, SAM, LESTER, CHARLIE, TYLER, GENE, BRETT, ANGEL, LESLIE, CECIL, ANDRE, ELMER, GABRIEL, MITCHELL, ADRIAN, KARL, CORY, CLAUDE, JAMIE, JESSIE, CHRISTIAN, LONNIE, CODY, JULIO, KELLY, JIMMIE, JORDAN, JAIME, CASEY, JOHNNIE, SIDNEY, JULIAN, DARYL, VIRGIL, MARSHALL, PERRY, MARION, TRACY, RENE, FREDDIE, AUSTIN, JACKIE, JOEY, EVAN, DANA, DONNIE, SHANNON, ANGELO, SHAUN, LYNN, CAMERON, BLAKE, KERRY, JEAN, IRA, RUDY, BENNIE, ROBIN, LOREN, NOEL, DEVIN, KIM, GUADALUPE, CARROLL, SAMMY, MARTY, TAYLOR, ELLIS, DALLAS, LAURENCE, DREW, JODY, FRANKIE, PAT, MERLE, TERRELL, DARNELL, TOMMIE, TOBY, VAN, COURTNEY, JAN, CARY, SANTOS, AUBREY, MORGAN, LOUIE, STACY, MICAH, BILLIE, LOGAN, DEMETRIUS, ROBBIE, KENDALL, ROYCE, MICKEY, DEVON, ASHLEY, CAREY, SON, MARLIN, ALI, SAMMIE, MICHEL, RORY, KRIS, AVERY, ALEXIS, GERRY, STACEY, CARMEN, SHELBY, RICKIE, BOBBIE, OLLIE, DENNY, DION, ODELL, MARY, COLBY, HOLLIS, KIRBY, CRUZ, MERRILL, LANE, CLEO, BLAIR, NUMBERS, CLAIR, BERNIE, JOAN, DOMINIQUE, TRISTAN, JAME, GALE, LAVERNE, ALVA, STEVIE, ERIN, AUGUSTINE, YOUNG, JOHNIE, ARIEL, DUSTY, LINDSEY, TRACEY, SCOTTIE, SANDY, SYDNEY, GAIL, DORIAN, LAVERN, REFUGIO, IVORY, ANDREA, SANG, DEON, CAROL, YONG, BERRY, TRINIDAD, SHIRLEY, MARIA, CHANG, ROSARIO, DANNIE, FRANCES, THANH, CONNIE, TORY, LUPE, DEE, SUNG, CHI, QUINN, MINH, THEO, LOU, CHUNG, VALENTINE, JAMEY, WHITNEY, SOL, CHONG, PARIS, OTHA, LACY, DONG, ANTONIA, KELLEY, CARROL, SHAYNE, VAL, JUDE, BRITT, HONG, LEIGH, GAYLE, JAE, NICKY, LESLEY, MAN, KASEY, JEWELL, PATRICIA, LAUREN, ELISHA, MICHAL, LINDSAY, and JEWEL
are all names that work for both males and females. If a girl's name is Robert and everyone, including your software, keeps on calling her a man, she'd be rather pissed.
Although databases are probably the most practical solution, if you want to have some fun maybe you could try writing a neural net (or using a neural net library) that takes in the name and outputs one of those 3 options (F,M,A).
You could train it using the datasets that exist in the databases suggested by other answers, as well as with any other data you have.
This solution would allow you to handle names not specifically categorised previously, and also handle different languages. You might want to pass the language (if you know it) as an input to the neural net as well.
I don't know that I can say neural nets (or any other machine learning) would do a good job of categorising though.
It's culture/region dependent: take Andrea, for Italians is only masculine, for Sweden is a female name while Andreas is for men; Shawn is ambiguous in English.
If a language has declination, like Latin or Russian, the final letters will change according to grammatical rules,
Another source of ambiguities is Family names identical to Personal names.
In my opinion it's impossibile to solve in general.
The idea will clearly not work in most languages.
However if you could tell the nationality beforehand you could have more luck.
In most Slav languages (e.g. russian, polish, bulgarian) you could safely assume that all surnames ending with -va -cha -ska (-a in general are feminine) while -v -ch -shi are masculine.
In fact any surname has feminine and masculine form depending on the ending.
The same names used in other countries (e.g. US) might use only the masculine form though.
The same could be said for first names (-a -ya are feminine) but it is not 100% accurate.
But in general you would hardly get a library that is sufficiently accurate.
I haven't used it, but IBM has a Global Name Analytics library (for a price!) that seems pretty comprehensive.
The Z Directory (at vettrasoft.com) has a C-language function, works something like so:
void func()
{
char c = z_guess_sex_byfirstname ("Lon");
switch(c)
{
case 'M': std::cout << "It's a boy!\n"; break;
case 'F': std::cout << "It's a girl!\n"; break;
case 'B': std::cout << "this name is for both sexes\n"; break;
case '?': std::cout << "sex unknown sorry\n"; break;
}
}
it's database driven, the table has something like 10,000+ names I think, but you need to
download and install the z directory (includes many other topo items like countries, geographical landmarks, airports, states, area codes, postal-zip codes, etc along with
c++ functions and objects to access the data). However the names are very English-language
oriented. The table is a work in progress and gradually updated.
Name-gender maps can work but in multicultural countries it's more like guessing. I can give you one example: Marian in Polish is a typical masculine name, whereas the same name in Great Britain is a female name. In the era of people immigrating all over the world, I'm not sure such database would be very accurate. Good luck!
Some cultures have unisex names - like mine. What do you do then? I think the answer is plain and simple - don't assume - you could cause offence. Just ask if its needed, otherwise gender neutrality.
Well, not anymore. IBM patented that idea a while ago.
So if you're looking for any level of flexability (something other than a list of names), you'll either have to (gasp!) ask the user, or simply pay IBM for the rights :)
In any case, such autodetection is annoying for many people who have gender-ambiguous names, or even just mean parents. Let's not make this any harder for them.
It's not free, but this is a nice library that I have used before:
NetGender for .NET allows you to
quickly and easily build Name
Verification, Parsing and Gender
Determination into your custom
applications. Accurately verify
whether a particular field contains a
valid individual or company. NetGender
uses a 100,000+, ethnically diverse,
Name Dictionary in combination with an
8,000+ Company Name Dictionary to
ensure precise gender determination.
http://www.softwarecompany.com/dotnet/netgender.htm
It's interesting that you say you have birth date. That could help. I've seen databases of histories of name popularity.
In the film Splash (1984), it was funny that Darryl Hannah's character chooses the name "Madison" from a Madison Avenue street sign, because obviously "Madison" is not a girl's name.
24 years later, Madison is the 4th most popular name for girl babies!
Name history from the gov't. (Check out Mary's sad decline through the last 100 years.)
When I wrote to the White House as a child, Richard Nixon (or, perhaps a secretary) responded to me with some photos of the historic place, addressed to "Miss Rhett Anderson." "Miss Rhett?" It doesn't even make sense! Can we REALLY not tell the difference between Clark Gable's Rhett (with a mustache, in Gone With The Wind!) and Vivian Lee's Scarlett? I shall never forgive him, despite Neil Young's assurance that "even Richard Nixon has got soul."
I'm pretty sure no such service could exist with an acceptable level of accuracy. Here are the problems which I think are insurmountable:
There are plenty of names which are for both men and women.
There's a lot of different names in this world, even if you only consider one country.
There is the "A Boy Named Sue" issue, raised so eloquently by Johnny Cash :-)
Check out http://genderchecker.com/
You can have a look at my python gender detection project https://github.com/muatik/genderizer
It tries to detect authors' genders looking their names and/or sample text(for example tweets) of them.
And it also supports mongodb, memcached for performance.
This is not really a programming problem - it comes down to getting a probability table.
AFAIK there are no public databases in distilled forms. You could either build this from census data, or buy the data from someone.
For example, this is someone who sells the probability table for Canada.
IMHO, it is a generally bad idea to determine sex from an individuals name. A lot of names are intersexual (good grief, is this even a word ?? :-), and also they may be one sex in one culture and another in another.
A few stupid examples, just a few that came to mind (from my part of the world, CE)
Vanja - female, in eastern countries from here, mostly male
Alex - intersex (short for Sandra, female, and Sandro, male)
Robin - in western cultures, can be both
In some parts of the world, a persons sex can be determined by looking at how the name ends. For example, Marija, Sandra, Ivana, Petra, Sara, Lucija, Ana - you can see that most of these female names end in "ja" or "ra". There are other examples as well.
Still, I think it's better just to ask the user for sex.
Got this from hacker news discussion about this
I know of no such service. You can perhaps find the data you are looking for, however. The US government publishes data about the prevalence of names and the gender of the person they're attached to. The Social Security Administration has such a page, and the census may as well, but I haven't taken the time to look. Perhaps other world governments do similar things.
I know of no such service, however ..
you could start with a raw list of person names or
guess gender according to some rules (e.g. -o => male, -ela, -a => female)
In some countries (e.g. germany) the name a person can be given is limited by law - maybe there are some publications concerning that matter, which could be harvested (but I don't know of any in the moment).
What I would do is make a hack which takes the name and searches it against the facebook api. Then looks at the resulting users and count how many of them are female or male. You then can return a percentage. Not so insurmountable anymore. :)
Just ask people, and if they are nice they will give you their 'M's or 'F's , and if they are not then give'em an 'A' .

Resources