Does anyone know of a good library for mapping a person's name to his or her gender? [closed] - database

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am looking for a library or database that can provide guesses about whether a person is male or female based on his or her name or nickname. Something like
john => "M",
mary => "F",
alex => "A", #ambiguous
I am looking for something that supports names other than English names (such as Japanese, Indian, etc.).
Before I get another answer along the lines of "you are going to offend people by assuming their sex/gender" let me be clear, my application does not interact with anyone. It does not send emails or contact anyone in anyway. There are no users to ask. In many cases, the person in question is dead, and the only information I have is name, birth date, and date of death. The reason I want to know the sex of the individual is to make the grammar of the output nicer and to aid in possible searches that may come latter.

gender.c is an open source C program that does a good job.
It comes with data for 44568 first names from all around the world.
There is good documentation and a description of the file format (basically plain text)
so it should not be to difficult to read it from your own application.
Here is what the author says:
A few words on quality of data
The dictionary of first names has been prepared with utmost care.
For example, the Turkish, Indian and Korean names in this dictionary
have all been independently classified by several native speakers.
I also took special care to list only those names which can currently
be found.
The lesson from this?
Any modifications should be done very cautiously (and they must also
adhere to the sorting required by the search algorithm).
For example, knowing that "Sascha" is a boy's name in Germany,
the author never assumed the English "Sasha" to be a girl's name.
Knowing that "Jan" is a boy's name in Germany, I never assumed it to be
also a English short form of "Janet".
Another case in point is the name "Esra". This is a boy's name in
Germany, but a girl's name in Turkey.
The program calculates a probability for the name being male of female.
It can do so with the name as input alone or with the name and country of origin,
which gives significantly better results.
You can download it from the website of the German computer magazine c't
40 000 Namen.
The article is in German but don't worry, all documentation is English.
Here is the direct ftp link 0717-182.zip if you are not interested in the article.
The zip-File contains the source code, an windows executable, the database
and the documentation.

The gender of a name is something that cannot be inferred programmatically in the general case. You need a name database. Here is a free name database from the US Census Bureau.
EDIT: The link for the 2010 name is dead but there are working links and a libraries in the comments.

"I tell ya, life ain't easy for a boy named 'Sue.'"
...So, why make it any harder? If you need to know the sex, just ask... Otherwise, don't worry about it.

I've builded a free API that gives a probabilistic guess on the gender based on a first name. Instead of using any of the above mentioned approaches, i instead use a huge dataset of profiles from social networks to provide a probabilistic guess along with a certainty factor. It also supports optional filtering through country or language id's. It's getting better by the day as more profiles are added to the dataset.
It's free to use at http://genderize.io
ONE thing you should consider is using a tool that takes demographics into account, as naming conventions will rely heavily on this.
Example
http://api.genderize.io?name=kim
{"name":"kim","gender":"female","probability":"0.89","count":1440}
http://api.genderize.io?name=kim&country_id=dk
{"name":"kim","gender":"male","probability":"0.95","count":44,"country_id":"dk"}

Here are two oddball approaches that may not even work, and likely wouldn't work en masse without violating the terms of a license:
Use the Facebook API (which I know virtually nothing about, it may not even be possible) to perform two searches: one for FB male users with that first name, and one for female. Use the two numbers to decide the probability of gender.
Much looser but more scalable, use the Google API and search for the name plus the gender-specific pronouns, and compare the numbers. For instance, there are 592,000,000 results for searching for "Richard his" (not as a phrase), but only 179,000,000 for "Richard her".

Given your stated constraints, your best option is to re-phrase whatever it is you're writing to be gender-neutral unless you know what gender they want to be called in each instance.
If writing in English, remember that singular “they” is grammatically fine as a gender-neutral third-person singular pronoun.
A good example is the title of this question. As is currently:
… mapping a person's name to his or her sex?
That would be less awkward if written:
… mapping a person's name to their sex?

It's also poor practice to assume that users must be male or female. There are a small but significant number of "intersex" people, most of whom are heartily sick of not having a box to tick..
bignose: interesting on the "singular they". I didn't realize it had such a long history.

It's not a service, but a little app with a database:
http://www.codeproject.com/KB/cpp/genderizer.aspx
And this tool is in german:
http://www.faq-o-matic.net/2011/06/01/zu-einem-vornamen-das-geschlecht-finden/
And another one in VB:
http://www.vbarchiv.net/tipps/tipp_1925-geschlecht-anhand-des-vornamens-ermitteln.html
I think in combination with some "Most used firstname in 2011" lists you should be able to build something decent.

The python package SexMachine will do that for you. Given any first name it returns if it's male, female or unisex. It relies on the data from the gender.c program by Jorg Michael.

The only thing you'll get from trying to automate it is a bunch of unhappy users. From that census data:
JAMES, JOHN, ROBERT, MICHAEL, WILLIAM, DAVID, RICHARD, CHARLES, JOSEPH, THOMAS, CHRISTOPHER, DANIEL, PAUL, MARK, DONALD, GEORGE, KENNETH, STEVEN, EDWARD, BRIAN, RONALD, ANTHONY, KEVIN, JASON, MATTHEW, GARY, TIMOTHY, JOSE, LARRY, JEFFREY, FRANK, SCOTT, ERIC, STEPHEN, ANDREW, RAYMOND, GREGORY, JOSHUA, JERRY, DENNIS, WALTER, PATRICK, PETER, HAROLD, HENRY, CARL, ARTHUR, RYAN, JOE, JUAN, JACK, ALBERT, JUSTIN, TERRY, GERALD, KEITH, SAMUEL, WILLIE, LAWRENCE, ROY, BRANDON, ADAM, FRED, BILLY, LOUIS, JEREMY, AARON, RANDY, EUGENE, CARLOS, RUSSELL, BOBBY, VICTOR, MARTIN, JESSE, SHAWN, CLARENCE, SEAN, CHRIS, JOHNNY, JIMMY, ANTONIO, TONY, LUIS, MIKE, DALE, CURTIS, NORMAN, ALLEN, GLENN, TRAVIS, LEE, MELVIN, KYLE, FRANCIS, JESUS, RAY, JOEL, EDDIE, TROY, ALEXANDER, MARIO, FRANCISCO, MICHEAL, OSCAR, JAY, ALEX, JON, RONNIE, TOMMY, LEON, LEO, WESLEY, DEAN, DAN, LEWIS, COREY, MAURICE, VERNON, ROBERTO, CLYDE, SHANE, SAM, LESTER, CHARLIE, TYLER, GENE, BRETT, ANGEL, LESLIE, CECIL, ANDRE, ELMER, GABRIEL, MITCHELL, ADRIAN, KARL, CORY, CLAUDE, JAMIE, JESSIE, CHRISTIAN, LONNIE, CODY, JULIO, KELLY, JIMMIE, JORDAN, JAIME, CASEY, JOHNNIE, SIDNEY, JULIAN, DARYL, VIRGIL, MARSHALL, PERRY, MARION, TRACY, RENE, FREDDIE, AUSTIN, JACKIE, JOEY, EVAN, DANA, DONNIE, SHANNON, ANGELO, SHAUN, LYNN, CAMERON, BLAKE, KERRY, JEAN, IRA, RUDY, BENNIE, ROBIN, LOREN, NOEL, DEVIN, KIM, GUADALUPE, CARROLL, SAMMY, MARTY, TAYLOR, ELLIS, DALLAS, LAURENCE, DREW, JODY, FRANKIE, PAT, MERLE, TERRELL, DARNELL, TOMMIE, TOBY, VAN, COURTNEY, JAN, CARY, SANTOS, AUBREY, MORGAN, LOUIE, STACY, MICAH, BILLIE, LOGAN, DEMETRIUS, ROBBIE, KENDALL, ROYCE, MICKEY, DEVON, ASHLEY, CAREY, SON, MARLIN, ALI, SAMMIE, MICHEL, RORY, KRIS, AVERY, ALEXIS, GERRY, STACEY, CARMEN, SHELBY, RICKIE, BOBBIE, OLLIE, DENNY, DION, ODELL, MARY, COLBY, HOLLIS, KIRBY, CRUZ, MERRILL, LANE, CLEO, BLAIR, NUMBERS, CLAIR, BERNIE, JOAN, DOMINIQUE, TRISTAN, JAME, GALE, LAVERNE, ALVA, STEVIE, ERIN, AUGUSTINE, YOUNG, JOHNIE, ARIEL, DUSTY, LINDSEY, TRACEY, SCOTTIE, SANDY, SYDNEY, GAIL, DORIAN, LAVERN, REFUGIO, IVORY, ANDREA, SANG, DEON, CAROL, YONG, BERRY, TRINIDAD, SHIRLEY, MARIA, CHANG, ROSARIO, DANNIE, FRANCES, THANH, CONNIE, TORY, LUPE, DEE, SUNG, CHI, QUINN, MINH, THEO, LOU, CHUNG, VALENTINE, JAMEY, WHITNEY, SOL, CHONG, PARIS, OTHA, LACY, DONG, ANTONIA, KELLEY, CARROL, SHAYNE, VAL, JUDE, BRITT, HONG, LEIGH, GAYLE, JAE, NICKY, LESLEY, MAN, KASEY, JEWELL, PATRICIA, LAUREN, ELISHA, MICHAL, LINDSAY, and JEWEL
are all names that work for both males and females. If a girl's name is Robert and everyone, including your software, keeps on calling her a man, she'd be rather pissed.

Although databases are probably the most practical solution, if you want to have some fun maybe you could try writing a neural net (or using a neural net library) that takes in the name and outputs one of those 3 options (F,M,A).
You could train it using the datasets that exist in the databases suggested by other answers, as well as with any other data you have.
This solution would allow you to handle names not specifically categorised previously, and also handle different languages. You might want to pass the language (if you know it) as an input to the neural net as well.
I don't know that I can say neural nets (or any other machine learning) would do a good job of categorising though.

It's culture/region dependent: take Andrea, for Italians is only masculine, for Sweden is a female name while Andreas is for men; Shawn is ambiguous in English.
If a language has declination, like Latin or Russian, the final letters will change according to grammatical rules,
Another source of ambiguities is Family names identical to Personal names.
In my opinion it's impossibile to solve in general.

The idea will clearly not work in most languages.
However if you could tell the nationality beforehand you could have more luck.
In most Slav languages (e.g. russian, polish, bulgarian) you could safely assume that all surnames ending with -va -cha -ska (-a in general are feminine) while -v -ch -shi are masculine.
In fact any surname has feminine and masculine form depending on the ending.
The same names used in other countries (e.g. US) might use only the masculine form though.
The same could be said for first names (-a -ya are feminine) but it is not 100% accurate.
But in general you would hardly get a library that is sufficiently accurate.

I haven't used it, but IBM has a Global Name Analytics library (for a price!) that seems pretty comprehensive.

The Z Directory (at vettrasoft.com) has a C-language function, works something like so:
void func()
{
char c = z_guess_sex_byfirstname ("Lon");
switch(c)
{
case 'M': std::cout << "It's a boy!\n"; break;
case 'F': std::cout << "It's a girl!\n"; break;
case 'B': std::cout << "this name is for both sexes\n"; break;
case '?': std::cout << "sex unknown sorry\n"; break;
}
}
it's database driven, the table has something like 10,000+ names I think, but you need to
download and install the z directory (includes many other topo items like countries, geographical landmarks, airports, states, area codes, postal-zip codes, etc along with
c++ functions and objects to access the data). However the names are very English-language
oriented. The table is a work in progress and gradually updated.

Name-gender maps can work but in multicultural countries it's more like guessing. I can give you one example: Marian in Polish is a typical masculine name, whereas the same name in Great Britain is a female name. In the era of people immigrating all over the world, I'm not sure such database would be very accurate. Good luck!

Some cultures have unisex names - like mine. What do you do then? I think the answer is plain and simple - don't assume - you could cause offence. Just ask if its needed, otherwise gender neutrality.

Well, not anymore. IBM patented that idea a while ago.
So if you're looking for any level of flexability (something other than a list of names), you'll either have to (gasp!) ask the user, or simply pay IBM for the rights :)
In any case, such autodetection is annoying for many people who have gender-ambiguous names, or even just mean parents. Let's not make this any harder for them.

It's not free, but this is a nice library that I have used before:
NetGender for .NET allows you to
quickly and easily build Name
Verification, Parsing and Gender
Determination into your custom
applications. Accurately verify
whether a particular field contains a
valid individual or company. NetGender
uses a 100,000+, ethnically diverse,
Name Dictionary in combination with an
8,000+ Company Name Dictionary to
ensure precise gender determination.
http://www.softwarecompany.com/dotnet/netgender.htm

It's interesting that you say you have birth date. That could help. I've seen databases of histories of name popularity.
In the film Splash (1984), it was funny that Darryl Hannah's character chooses the name "Madison" from a Madison Avenue street sign, because obviously "Madison" is not a girl's name.
24 years later, Madison is the 4th most popular name for girl babies!
Name history from the gov't. (Check out Mary's sad decline through the last 100 years.)
When I wrote to the White House as a child, Richard Nixon (or, perhaps a secretary) responded to me with some photos of the historic place, addressed to "Miss Rhett Anderson." "Miss Rhett?" It doesn't even make sense! Can we REALLY not tell the difference between Clark Gable's Rhett (with a mustache, in Gone With The Wind!) and Vivian Lee's Scarlett? I shall never forgive him, despite Neil Young's assurance that "even Richard Nixon has got soul."

I'm pretty sure no such service could exist with an acceptable level of accuracy. Here are the problems which I think are insurmountable:
There are plenty of names which are for both men and women.
There's a lot of different names in this world, even if you only consider one country.
There is the "A Boy Named Sue" issue, raised so eloquently by Johnny Cash :-)

Check out http://genderchecker.com/

You can have a look at my python gender detection project https://github.com/muatik/genderizer
It tries to detect authors' genders looking their names and/or sample text(for example tweets) of them.
And it also supports mongodb, memcached for performance.

This is not really a programming problem - it comes down to getting a probability table.
AFAIK there are no public databases in distilled forms. You could either build this from census data, or buy the data from someone.
For example, this is someone who sells the probability table for Canada.

IMHO, it is a generally bad idea to determine sex from an individuals name. A lot of names are intersexual (good grief, is this even a word ?? :-), and also they may be one sex in one culture and another in another.
A few stupid examples, just a few that came to mind (from my part of the world, CE)
Vanja - female, in eastern countries from here, mostly male
Alex - intersex (short for Sandra, female, and Sandro, male)
Robin - in western cultures, can be both
In some parts of the world, a persons sex can be determined by looking at how the name ends. For example, Marija, Sandra, Ivana, Petra, Sara, Lucija, Ana - you can see that most of these female names end in "ja" or "ra". There are other examples as well.
Still, I think it's better just to ask the user for sex.

Got this from hacker news discussion about this

I know of no such service. You can perhaps find the data you are looking for, however. The US government publishes data about the prevalence of names and the gender of the person they're attached to. The Social Security Administration has such a page, and the census may as well, but I haven't taken the time to look. Perhaps other world governments do similar things.

I know of no such service, however ..
you could start with a raw list of person names or
guess gender according to some rules (e.g. -o => male, -ela, -a => female)
In some countries (e.g. germany) the name a person can be given is limited by law - maybe there are some publications concerning that matter, which could be harvested (but I don't know of any in the moment).

What I would do is make a hack which takes the name and searches it against the facebook api. Then looks at the resulting users and count how many of them are female or male. You then can return a percentage. Not so insurmountable anymore. :)

Just ask people, and if they are nice they will give you their 'M's or 'F's , and if they are not then give'em an 'A' .

Related

Extract Array of Values for Watson Dialog Variables

In DevPost Watson Developer Challenge for Conversational Applications post, I saw Watson (maybe) able to analyze following phrase "I want to visit Tokyo, Sydney, Manchester, and Reykjavik during a trip that takes 30 days".
Is there a better way to extract those array of locations without having to predefine max no of location variables (i.e. set location1 - 5) and manually specify various grammar items like $ (Locations)={location1} * (Locations)={location2} * (Locations)={location3} * (Locations)={location4} as per Pizza example dialog? I would like to follow up with comment such as "That's a lot" if location > 4, or "Sure" if less.
You could try something like alchemy or relationship extraction to identify all of the languages, and then simply add them to the user profile in Dialog. But today, the best way to do this within a broader conversation will be to do it the same way the pizza sample does as you outlined above.

Python searching a dataset

I've got this as a homework question and dont know how I should go about it.
Firstly, I've been given a dataset with a list of employees' names, addresses, emails etc. with about 50 employees in total.
You are asked to write an application to provide information about the members of staff. Your program should prompt the user to enter search criteria. Any members of staff who match the search criteria should be printed to the screen in the following format:
Position Designation Room and Extension Name and Email Address
(columns are tab-separated)
Matching information............
You will have to amend the dataset for processing, and you may choose to hold this in a separate file, although this is not necessary. Your program should satisfy certain constraints:
You should compare each column in the dataset against the search criteria.
The comparisons should not be case-sensitive.
All output should be in initial capitals, apart from the email address.
If a match is found, the results line should be printed and the columns should all line up.
Where there is no match, a message should be printed, without a title line.
You should save (1) your program, and (2) a paragraph to explain HOW you accomplished the processing of the dataset.
You should also run these test cases on your application:
Search for ‘brenda’
Search for all clerical staff.
Search for ‘BredNa’
Find Dr Carr’s position
Which office is Neil located in?
So, firstly, how should I read this dataset? Should I read it in as a text file or create a tuple, dictionary? etc.
staff = [['prof.liam maguire','head of school','academic','MS127','75605','lguire#ulster.ac.uk'],
['prof. martin McGinnity','director of intelligent systems research centre','academic','MS112','75616','tinnity#ulster.ac.uk'],
['dr laxmidhar Behera','reader','academic','MS107','75276','lra#ulster.ac.uk'],
['dr girijesh Prasad','professor','academic','MS137','75645','gad#ulster.ac.uk'],
['dr kevin Curran','senior lecturer','academic','MS130','75565','krran#ulster.ac.uk'],
['mr aiden McCaughey','Senior Lecturer','academic','MG126','75131','aughey#ulster.ac.uk'],
['dr tom Lunney','postgraduate courses co-ordinator (Senior Lecturer)','academic','MG121D','75388','tfney#ulster.ac.uk'],
['dr heather Sayers','undergraduate courses','co-ordinator (Senior Lecturer)','academic','MG121C','75148','hmyers#ulster.ac.uk'],
['dr liam Mc Daid','senior lecturer','academic','MS016','75452','ljid#ulster.ac.uk'],
['mr derek Woods','senior lecturer','academic','MS134','75380','dnoods#ulster.ac.uk'],
['dr ammar Belatreche','lecturer','academic','MS104','75185','aatreche#ulster.ac.uk'],
['mr michael Callaghan','lecturer','academic','MS132','75771','mjllaghan#ulster.ac.uk'],
['dr sonya Coleman','lecturer','academic','MS133','75030','saeman#ulster.ac.uk'],
['dr joan Condell','lecturer','academic','MS131','75024','jdell#ulster.ac.uk'],
['dr damien Coyle','lecturer','academic','MS103','75170','dhle#ulster.ac.uk'],
['mr martin Doherty','lecturer','academic','MG121A','75552','merty#ulster.ac.uk'],
['dr jim Harkin','lecturer','academic','MS108','75128','jgrkin#ulster.ac.uk'],
['dr yuhua Li','lecturer','academic','MS106','75528','yi#ulster.ac.uk'],
['dr sandra Moffett','lecturer','academic','MS015','75381','soffett#ulster.ac.uk'],
['mrs mairin Nicell','lecturer','academic','MG127','75007','micell#ulster.ac.uk'],
['mrs maeve Paris','lecturer','academic','MG040','75212','m#ulster.ac.uk'],
['dr jose Santos','lecturer','academic','MG035','75034','jantos#ulster.ac.uk'],
['dr nH. Siddique','lecturer','academic','MG037','75340','nhique#ulster.ac.uk'],
['dr zumao Weng','lecturer','academic','MG050','75358','zmng#ulster.ac.uk'],
['dr shane Wilson','lecturer','academic','MG038','75527','s.on#ulster.ac.uk'],
['dr caitriona carr','computing and Technical Support','MG121B','75003','crr#ulster.ac.uk'],
['mr neil McDonnell','technical Services Supervisor','computing and Technical Support','MS030 / MF143','75360','ndonnell#ulster.ac.uk'],
['mr paddy McDonough','technical Services Engineer','computing and Technical Support','MS034','75322','p.ugh#ulster.ac.uk'],
['mr bernard McGarry','network Assistant','computing and Technical Support','MG132','75644','bgrry#ulster.ac.uk'],
['mr stephen Friel','secretary','clerical staff','MG048','75148','siel#ulster.ac.uk'],
['ms emma McLaughlin','secretary','clerical staff','MG048','75153','eughlin1#ulster.ac.uk'],
['mrs. brenda Plummer','secretary','clerical staff','MS126','75605','blmmer#ulster.ac.uk'],
['miss paula Sheerin','secretary','clerical staff','MS111','75616','perin#ulster.ac.uk'],
['mrs michelle Stewart','secretary','clerical staff','MG048','75382','mwart#ulster.ac.uk']]
matches = []
criterion = input ("please enter search criterion: ")
criterion = criterion.lower()
for person in staff:
for characteristic in person:
if characteristic in person:
if criterion in characteristic:
matches.append(person)
break
if len(matches) == 0:
print("No Match")
else:
print("POSITION |||DESIGNATION ||| EXT & ROOM NO||| NAME & EMAIL")
for i in matches:
print (i[1].title(),': ',i[2].title(),':',i[3].upper()+ i[4],':',i[0].title(), i[5].title())`
This is what ive came up with so far and it seems to work, is there any improvements you would make?
Thank you for being honest and telling us that this is a homework question. StackOverflow discourages straight-up giving out answers to homework questions, but we can guide you towards the correct answer.
With regards to "amend the dataset for processing": This implies that the data is not currently in a consistent format. The first thing you need to do is look at the data you've been given, and decide the best representation for the data.
I'd recommend a columnar tab-delimited data file- this is easily created in Microsoft Excel by placing the data in a spreadsheet, and saving it as text. (Excel will complain that it will lose all the various things that make it a spreadsheet instead of a text file, but that's okay- you want a text file.) Save the updated file.
Excel produced what's called a tab-delimited text file: a 2-dimensional grid of data (like the shape of a spreadsheet), represented with one row of data per line (rephrased, the linebreak symbol is used to separate rows of data, which text editors interpret as a command to start writing on a new line), and the tab character (written in Python within an escaped string as \t, but really a single character of its own) separating cells within each row. This is also known as tab-separated values, or TSV. Closely related is comma-separated values, or CSV, which is another option in Excel. CSV can also stand for character-separated values, which is the general term for any text file representing a grid of data by using some character (',' for comma-separated, '\t' for tab-separated) to separate records.
CSVs are a very common file format, so Python is ready to help you out here. Python has a library, csv, designed to read these files for you. If you used Excel text format, you'll need to tell it your dialect is excel-tab, since that symbolizes tab-delimited files as Excel outputs them.
You'll need to construct a csv.reader to read your formatted data file. Use the order you put the columns in to understand the lists that you get when you read the CSV one line at a time- the order of the columns and the order of the items in each row is the same, so use that information to index correctly into the list to find each field.
Once you've read a row, what do you want to do with it?
You have a choice of storage formats in your program:
Save every record to a list (acting like a list of lists, since each record acts like a list). Now it's loaded, and when you want to search it, you iterate over your entire list-of-lists and use equality testing to find matches. This can be done with a list comprehension, which is almost certainly what your teacher is looking for.
Additionally, create a dict for each column of the file, and store every record in every dictionary: each dict maps that column value to your key. There's a catch here! A dict can only store one record for each key, but you'll definitely have different personnel in the same "designation" (multiple professors, multiple clerical staff, etc.) and there's no way to be sure that no two people will have the same name, either. Your indexing dicts will have to store lists of records themselves, not just single records.
The second approach is much faster for repeated queries, since you're organizing all the records at the start for fast lookup. The first is much easier to implement, however, and is more likely to be what your teacher expects. I'd recommend implementing the first, understanding it, and then if you have time, implement the second.
The user interface for all this is up to you, of course, but this should put you well on your way to implementing the core of the program. Good luck.
This is how I would go about it:
staff_details = [["Prof. Liam Maguire","Head of School","Academic","MS127","75605","lp.maguire#ulster.ac.uk"],
["Prof. Martin McGinnity","Director of Intelligent Systems Research Centre","Academic","MS112","75616","tm.mcginnity#ulster.ac.uk"],
["Dr Laxmidhar Behera","Reader","Academic","MS107","75276", "l.behera#ulster.ac.uk"],
["Dr Girijesh Prasad","Professor","Academic","MS137","75645","g.prasad#ulster.ac.uk"],
["Dr Kevin Curran","Senior Lecturer","Academic","MS130","75565","kj.curran#ulster.ac.uk"],
["Mr Aiden McCaughey","Senior Lecturer","Academic","MG126","75131","a.mccaughey#ulster.ac.uk"],
["Dr Tom Lunney","Postgraduate Courses’ Co-ordinator (Senior Lecturer) ","Academic","MG121D","75388","tf.lunney#ulster.ac.uk"],
["Dr Heather Sayers","Undergraduate Courses’ Co-ordinator (Senior Lecturer) ","Academic","MG121C","75148","hm.sayers#ulster.ac.uk"],
["Dr Liam Mc Daid","Senior Lecturer","Academic","MS016","75452","lj.mcdaid#ulster.ac.uk"],
["Mr Derek Woods","Senior Lecturer","Academic","MS134","75380","dn.woods#ulster.ac.uk"],
["Dr Ammar Belatreche","Lecturer","Academic","MS104","75185","a.belatreche#ulster.ac.uk"],
["Mr Michael Callaghan","Lecturer","Academic","MS132","75771","mj.callaghan#ulster.ac.uk"],
["Dr Sonya Coleman","Lecturer","Academic","MS133","75030","sa.coleman#ulster.ac.uk"],
["Dr Joan Condell","Lecturer","Academic","MS131","75024","j.condell#ulster.ac.uk"],
["Dr Damien Coyle","Lecturer","Academic","MS103","75170","dh.coyle#ulster.ac.uk"],
["Mr Martin Doherty","Lecturer","Academic","MG121A","75552","m.doherty#ulster.ac.uk"],
["Dr Jim Harkin","Lecturer","Academic","MS108","75128","jg.harkin#ulster.ac.uk"],
["Dr Yuhua Li","Lecturer","Academic","MS106","75528","y.li#ulster.ac.uk"],
["Dr Sandra Moffett","Lecturer","Academic","MS015","75381","sm.moffett#ulster.ac.uk"],
["Mrs Mairin Nicell","Lecturer","Academic","MG127","75007","ma.nicell#ulster.ac.uk"],
["Mrs Maeve Paris","Lecturer","Academic","MG040","75212","m.paris#ulster.ac.uk"],
["Dr Jose Santos","Lecturer","Academic","MG035","75034","ja.santos#ulster.ac.uk"],
["Dr NH. Siddique","Lecturer","Academic","MG037","75340","nh.siddique#ulster.ac.uk"],
["Dr Zumao Weng","Lecturer","Academic","MG050 ","75358","zm.weng#ulster.ac.uk"],
["Dr Shane Wilson","Lecturer","Academic","MG038","75527","s.wilson#ulster.ac.uk"],
["Dr Caitriona Carr","Technical Services Engineer","Computing and Technical Support","MG121B","75003","c.carr#ulster.ac.uk"],
["Mr Neil McDonnell","Technical Services Supervisor","Computing and Technical Support","MS030 / MF143","75360", "n.mcdonnell#ulster.ac.uk"],
["Mr Paddy McDonough","Technical Services Engineer","Computing and Technical Support","MS034","75322","p.mcdonough#ulster.ac.uk"],
["Mr Bernard McGarry","Network Assistant","Computing and Technical Support","MG132","75644","bg.mcgarry#ulster.ac.uk"],
["Mr Stephen Friel","Secretary","Clerical Staff","MG048","75148","s.friel#ulster.ac.uk"],
["Ms Emma McLaughlin","Secretary","Clerical Staff","MG048","75153","e.mclaughlin1#ulster.ac.uk"],
["Mrs. Brenda Plummer","Secretary","Clerical Staff","MS126","75605","bl.plummer#ulster.ac.uk"],
["Miss Paula Sheerin","Secretary","Clerical Staff","MS111","75616","p.sheerin#ulster.ac.uk"],
["Mrs Michelle Stewart","Secretary","Clerical Staff","MG048","75382","m.stewart#ulster.ac.uk"]]
search_result = []
search_input = input ("Please enter a search criterion: ")
search_input = search_input.title()
for person in staff_details:
for characteristic in person:
if characteristic in person:
if search_input in characteristic:
search_result.append(person)
break
if len(search_result) == 0:
print ("No staff members match your search criterion of ->", search_input)
else:
print("We have a match!")
print ("{0:<30} {1:<40} {2:<40} {3:<50}".format("Position:", "Designation:", "Room and Extension:", "Name and Email:"))
print ("-" * 160)
for align in search_result:
print("{0:<30} {1:<40} {2:<40} {3:<50}".format((align[1]), (align[2]), (align[3] + ", Ext:" + align[4]), align[0] + "(" + align[5] + ")"))
I hope this helps you out!
I assume you have your dataset as a plain text file (or copyable text in email, etc.) Then you have two several choices:
Create a text file where each line stores info about one employee in the format specified: "Name", "Position", etc. In this case, to do the search you'll need to scan through the file and print the matching line, then repeat the matching part.
Use Python data types to store the info in memory, e.g. a list of dicts with keys "Name", "Position", etc. Then the search will become a little more complex to do (just a little, really), but you'll be able to format the output in any way you like.
But first you'll need to fill the list with data by reading the text file (or hard-code it manually, if you're desperate).
You could somewhat combine the approaches by forming a dict only from matching lines of your file.
You could use a real database engine such as MySQL, but that would be a real overkill for this homework.

Normalization 3NF

I an reading through some examples of normalization, however I have come across one that I do not understand.
The website the example is located here: http://cisnet.baruch.cuny.edu/holowczak/classes/3400/normalization/#allinone
The part I do not understand is "Third Normal Form"
In my head I see the transitive dependencies in EMPLOYEE_OFFICE_PHONE (Name, Office, Floor, Phone) as the following Name->->Office|Floor and Name->->Office|Phone
The author splits the table EMPLOYEE_OFFICE_PHONE (Name, Office, Floor, Phone) into EMPLOYEE_OFFICE (Name, Office, Floor) and EMPLOYEE_PHONE (Office, Phone)
From my judgement in the beginning, I still see the transitive dependency in Name->->Office|Floor so I don't understand why it is in 3NF. Was I wrong to state that there is a transitive dependency in Name->->Office|Floor?
Reasoning for transitivity:
Here is my list of the functional dependencies
Name -> Office
Name -> Floor
Name -> Phone
Office -> Phone
Office -> Floor (Is this the incorrect one? and why?
Thank-you your help everyone!
5) you assume a naming sheme here ... offices 4xx have to be on floor 4 ... 5xx have to be on floor 5 ... if such a scheme exists, you can have your dependency ... as long as this is not part of the specification ... no. 5 is out of the game ...
1. Name -> Office
2. Name -> Floor
3. Name -> Phone
4. Office -> Phone
5. Office -> Floor (Is this the incorrect one? and why?
(1) You and the author and I agree that Name->Office.
(2) You and the author agree that Name->Floor. While that's true based solely on the sample data, it's also true that Office->Floor. I'd explore this kind of issue by asking this question: "If an office is empty, do I still know what floor that office is on?" (Yes)
Those things suggest there's a transitive dependency, Name->Office, and Office->Floor. So I would disagree with you and with the author on this one.
(3) You say Name->Phone. The author says Office->Phone. The author also says that "each office has exactly one phone number." So given one value for Office, I know one and only one value for Phone. And given one value for Name, I know one and only one value for Phone. I'd explore this issue by asking, "If I move to a different office, does my phone number follow me?" If it does, then Name->Phone. If it doesn't, then Office->Phone.
There isn't enough information here to answer that question, and I've worked in offices that worked each of those two ways, so real-world experience doesn't help us very much, either. I'd have to side with the author in this case, although I think it's not very well thought through for a normalization example.
(4) This is really just an extension of (3) above.
(5) See (2) above. This doesn't have anything to do with a naming scheme, and you don't need to assume that offices numbered 5xx are on the 5th floor. The only relevant question is this: Given one value for Office, is there one and only one value for Floor? (Yes) I might explore this issue by asking "Can one office be on more than one floor?" (In the real world, that's remotely possible. But the sample data doesn't support that possibility.)
Some additional FDs, based solely on the sample data.
Phone->Office
Phone->Floor
Office->Name
First of all,let me define 3NF clearly :-
A relation is in 3NF if following conditions are satisfied:-
1.)Relation is in 2NF
2.)No non prime attribute is transitively dependent on the primary key.
In other words,a relation is in 3NF is one of the following conditions is satisfied for every functional dependency X->Y:-
1.)X is superkey
2.)Y is a prime attribute
For Your Question,if the following FDs are present :-
Name -> Office
Name -> Floor
Name -> Phone
Office -> Phone
Then we cannot say anything about Office and Floor.You can verify this by applying and checking any of the Armstrong Inference Rules.When you apply these rules, you will find that you cannot infer anything about office and floor.

Database of common name aliases / nicknames of people

I'm involved with a SQL / .NET project that will be searching through a list of names. I'm looking for a way to return some results on similar first names of people. If searching for "Tom" the results would include Thom, Thomas, etc. It is not important whether this be a file or a web service. Example Design:
Table "Names" has Name and NameID
Table "Nicknames" has Nickname, NicknameID and NameID
Example output:
You searched for "John Smith"
You show results Jon Smith, Jonathan Smith, Johnny Smith, ...
Are there any databases out there (public or paid) suited to this type of task to populate a relationship between nicknames and names?
I'm adding another source for anyone who comes across this question via Google. This project provides a very good lookup for this purpose.
https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup
It's somewhat simpler and less complete than pdNickName but on the other hand it's free and easy to use.
A google search on "Database of Nicknames" turned up pdNickName (for pay).
In addition, I think you only need a single table for this job, not two, with NameID, Name, and MasterNameID. All the nicknames go into the Name column. One name is considered the "canonical" one. All the nickname records use the MasterNameID column to point back to that record, with the canonical name pointing to itself.
Your two table schema contains no additional information and, depending on how you fill in the nickname table, you might need extra code to handle the canonical cases.
I just found this site.
It looks like you could script it pretty easily.
http://www.behindthename.com/php/extra.php?terms=steve&extra=r&gender=m
I just wish I could auto narrow this to english..
Another commercial name matching database is: http://www.basistech.com/name-indexer/
It looks quite professional (though potentially expensive).
They claim to support the following languages:
Arabic, Chinese (Simplified), Chinese (Traditional), Persian (Farsi / Dari), English, Japanese, Korean, Pashto, Russian, Urdu
Here is a github repo with csv of related names, and you can contribute back:
The first few lines show the format:
aaron,ron
abel,abe
abednego,bedney
abijah,ab,bige
abigail,ab,abbie,abby,gail
abner,ab,abbie,abby
abraham,abe,abram,bram
absalom,ab,abbie,app
There is a database out there called pdNicknames (found at http://www.peacockdata2.com/products/pdnickname/). It contains everything you need, at a cost of $500.
Similar format as Stan James's csv, but folded two ways for lookups:
Name to nickname: https://github.com/MrCsabaToth/SOEMPI/blob/master/openempi/conf/name_to_nick.csv
Nickname to name: https://github.com/MrCsabaToth/SOEMPI/blob/master/openempi/conf/nick_to_name.csv
To select similar sounding name use: (see MSDN)
SELECT SOUNDEX ('Tom')
This is a good choice: https://github.com/onyxrev/common_nickname_csv
id, name, nickname
1, Aaron, Erin
2, Aaron, Ron
3, Aaron, Ronnie
4, Abel, Ab
5, Abel, Abe
6, Abel, Eb
7, Abel, Ebbie
8, Abiel, Ab
9, Abigail, Abby
10, Abigail, Gail

Searching through descriptions

There's a movie which name I can't remember. It's about a carnival or amusement park with a horror house and a bunch of teens who are murdered one by one by something with a clowns mask. I've seen this movie about 20 years ago, and it's sequel, but can't remember it exactly. (And also forgot it's title.) As a result, I started wondering about how to solve something technical.
Assume that I have a database with the story plot and other data of each and every movie published. (Something like the IMDb.) And I would have an edit field where a user can just enter a description in plain text. The system would then start analysing this text to find the movie(s) that would qualify to this description.
For example (different movie), I enter this in the edit field: "Some movie about an Egyptian king who attacks a bunch of indians on horseback, but he's badly wounded and his horse dies while he lost this battle."
The system should then report the movie "Alexander" from 2004 as answer, but possibly a few more. (Even allowing a few errors in the description.)
To create such a system where a description gets analysed to find a matching record by searching through descriptions, what techniques should I need for something as complex as that? Not that I want to build something like that right now, but more out of curiosity if I ever want to pick up some interesting new project.
(I wanted to award extra points for those who recognise the movie I've mentioned in the beginning. But one Google-attempt later and I found it myself!)
Btw, it's not the search engine itself that interests me, but analysing the description to get to something a search engine will understand! With the example movie, it's human logic that helped me to find the title. (And it's annoying that this movie isn't for sale in the Netherlands.) Human logic will always be a requirement but it's about analysing the user input, which is in the form of a story or description, with possible errors.
You should check out document classification.
A few document classification techniques
Naive Bayes classifier
tf–idf
For what I can tell by your own comments, Google is the technique to be used. ;-) But, honestly, I think more or less any search engine would do.
Edit: heh, you removed your comment, but I do remember you mentioned Google as the one deserving extra points.
Edit+: well, you mentioned Google again, but I don't want to remove my first edit. ;-)
Pure speculation: Would something trivial such as taking every word of more than 4 letters in the description "Egyptian, Indian, horse battle etc." and fuzzy matching against a database of such summaries work? Perhaps with some normalisation eg. king == leader == emperor?
Hmmm ... Young Man, Girlfriend, swimming pool, mother, wedding does that get us to The Graduate? Well I guess with a small amount of specifics "Robinson" it might.
You can do lots of interesting stuff with the imdb keyword search:
http://akas.imdb.com/keyword/carnival/clown/murder/
You can specify multiple keywords, it suggests movies and more keywords which are in similar context with your given keywords.
The data contained in imdb is publicy available for non-commercial use and can be downloaded as text files. You could build a database from it.

Resources