Plain, computer parseable lists of common first names? - dataset

I need a list of common first names for people, like "Bill", "Gordon", "Jane", etc. Is there some free list of lots of known names, instead of me having to type them out? Something that I can easily parse with the programme to fill in an array for example?
I'm not worried about:
Knowing if a name is masculine or feminine (or both)
If the dataset has a whole pile of false positives
If there are names that aren't on it, obviously no dataset like this will be complete.
If there are 'duplicates', i.e. I don't care if the dataset lists "Bill" and "William" and "Billy" as different names. I'd rather have more data than less
I don't care about knowing the popularity the name
I know Wikipedia has a list of most popular given names, but that's all in a HTML page and manged up with horrible wiki syntax. Is there a better way to get some sample data like this without having to screen scrape wikipedia?

A CSV from the General Register Office of Scotland with all the forenames registered there in 2007.
Another large set of first names in CSV format and SQL format too (but they didn't say which DB dumped the SQL).
GitHub page with the top 1000 baby names from 1880 to 2009, already parsed into a CSV for you from the Social Security Administration.
CSV of baby names and meanings from a Princeton CS page.
That ought to be enough to get you started, I'd think.

You can easily consume the Wikipedia API (http://en.wikipedia.org/w/api.php) to retrieve the list of pages in specific category, looks like Category:Given names is something you want to start from.
http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmnamespace=0&cmlimit=500&cmtitle=Category:Given_names
The part of result from this URL looks like this:
<cm pageid="5797824" ns="0" title="Abdou" />
<cm pageid="5797863" ns="0" title="Abdu" />
<cm pageid="859035" ns="0" title="Abdul Aziz" />
<cm pageid="6504818" ns="0" title="Abdul Qadir" />
Look at the API and select appropriate format and query parameters, and check categories.
P.S.
BTW, The wiki-text from page you linked to contain names in a form that easy to extract using regexp... As well as titles of links in the rendered HTML page have “(name)” attached to the name itself.

Social Security Administration - Beyond the Top 1000 Names Data Files
The above is a comprehensive list of first names in use in the US. The zip files contain national and state-level data by year of birth in CSV format. It includes the number of occurrences (minimum 5) and gender. For example, the national file for 2010 includes 33,838 baby names.

Related

Using a thesaurus with full text search is not working, possibly I am loading US english instead of British English

I've created a thesaurus file and loaded it as per the example.
I have a database with a table that is full text indexed.
I loaded it using 1033 which is 'English' according to the MSDN article here
It took a long time to load (4 minutes) it's pretty huge, so I know that it loaded it.
There is also a file for British english which is possibly what my SQL is using which would explain why it didn't work. However the number isn't listed on that chart so I do not know how to load it.
Assuming that's the problem, all is dandy. However I also can't find anything that suggests the thesaurus will simply work automatically on any free text query eg. CONTAINS so I don't know if I have to do something to my free text catalog (couldn't see anything, but you never know).
Any ideas?
Additional information:
In my tsenu.xml file:
<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>John</sub>
<sub>Jon</sub>
</expansion>
</thesaurus>
</XML>
I then ran this:
EXEC sys.sp_fulltext_load_thesaurus_file 1033;
select * from Cats where contains(CatName , 'Jon', language 1033)
There are records for 'Jon' and 'John' but the results are not using the thesaurus to show me alternatives.
try
WHERE CONTAINS(CatName , 'FORMSOF (THESAURUS, Jon)')
using THESAURUS can be a bit tricky sometime, I havent used FTS for a long time but I remember sometimes , it will only return THESAURUS values not the actual value if it is contained in the data. for example if you are looking for thesaurus values for 'Jon' if there is a 'Jon' is data it will not return 'Jon' but will return all the Thesaurus for Jon, strange but this is how it works.
if you are looking for two words is something like
WHERE CONTAINS(CatName , '"Word1" AND "Word2"')
To search for John Smith I think you can do something like
WHERE CONTAINS(NAME, 'FORMSOF (THESAURUS, Jon) OR "Smith"')
Make sure to specify which language you want to use by using the optional language parameter when using Contains or FreeText
Contains and Free Text Information
I also looked at the chart and British English LCID is 2057.

What is the best way to handle Chinese names in Salesforce?

Salesforce separates family and personal names according to western convention.
That is the first name is personal and the last is a family name. This can be changed by changing the salesforce locale (say from the US to China) so that a the first name is the familyname and the last name(s) are personal
So in the vanilla SF John Smith appears as John Smith. If you switch to the Chinese localisation it would appear as Smith John.
Equally in vanilla Lim Keat Song would appear as Keat Song Lim, but would be correct in the chinese localisation as Lim Keat Song.
My problem is that about 30% of my contacts have East Asian names and so neither localisation is entirely satisfactory.
What are the the best ways of resolving this on a standard contact object?
I've asked the question on salesforce and as far as I can see there isn't much on this on google.
I'm asking this because whilst I can solve it - probably along the lines of the SD question - it's probably a known problem and I would like to find the best solution rather than reinventing the wheel.
Just for your idea.
We add two custom fields as Formula on the Contact object.
One custom field called Last Name refer to the Standard Last Name field on the Contact,
the other called First Name refer to the Standard First Name field on the Contact.
Therefore, you don't need to do any data import to these custom fields, but only define how to display your customized Last Name and First Name fields in the Contact display.
And this layout of Last Name and First Name will not change with users' localisation setting.
This may be not a perfect way to solve this, I wish it would help.

How to extract all the IUPAC names mentioned in the data available from Pubchem(NCBI) into a text file?

I want to build lists of prefixes and suffixes of some length from all the IUPAC names mentioned in Pubchem Database,so that I can use them further in my project as a feature.So I want all the IUPAC chemical names in a text file or in some format where I can extract these lists.
Thanks.
Sounds you need something like this Nist species list
You can search for most also in the Webbook but I failed to find a download link for the complete set.
In our lab we got a Cd(?) with the mass spectral database which contained the (complete? - well it got like 250.000 substances) database as text file. Maybe you can get that through some of the vendors.
The pubchem site offers you to download a dump of their data by ftp. Why not use that?
PubChem data can be downloaded via ftp from the PubChem site. A complete description of the available data can be obtained here: https://pubchemdocs.ncbi.nlm.nih.gov/downloads
Of particular interest for the question of IUPAC names, the data are downloadable from the "Compound Extras" section of the ftp site: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/
The README-Extras file in this location describes the data in detail. For the IUPAC names, the following information is provided:
CID-IUPAC.gz:
This is a listing of all CIDs with their computed IUPAC names.
It is a gzipped text file with CID, tab, IUPAC on each line. Note
that the names may contain UTF8 characters.
A download today (23-Apr-2020) contains 102,586,778 rows. An excerpt of the information is shown below.
> head CID-IUPAC
1 3-acetyloxy-4-(trimethylazaniumyl)butanoate
2 (2-acetyloxy-3-carboxypropyl)-trimethylazanium
3 5,6-dihydroxycyclohexa-1,3-diene-1-carboxylic acid
4 1-aminopropan-2-ol
5 (3-amino-2-oxopropyl) dihydrogen phosphate
6 1-chloro-2,4-dinitrobenzene
7 9-ethylpurin-6-amine
8 2,3-dihydroxy-3-methylpentanoic acid
9 (2,3,4,5,6-pentahydroxycyclohexyl) dihydrogen phosphate
11 1,2-dichloroethane

Import CSV to class structure as the user defines

I have a contact manager program and I would like to offer the feature to import csv files. The problem is that different data sources order the fields in different ways. I thought of programming an interface for the user to tell it the field order and how to handle exceptions.
Here is an example line in one of many possible field orders:
"ID#","Name","Rank","Address1","Address2","City","State","Country","Zip","Phone#","Email","Join Date","Sponsor ID","Sponsor Name"
"Z1234","Call, Anson","STU","1234 E. 6578 S.","","Somecity","TX","United States","012345","000-000-0000","someemail#gmail.com","5/24/2010","z12343","Quantum Independence"
Notice that in one data field "Name" there is a comma to separate last name and first name and in another there is not.
My plan is to have a line for each field (ie ID, Name, City etc.) and a statement "import to" and list box with options like: Don't Import, Business>Join Date, First Name, Zip
and the program recognizes those as properties of an object...
I'd also like the user to be able to record preset field orders so they can re-use them for csv files from the same download source.
Then I also need it to check if a record all ready exists (is there a record for Anson Call all ready?) and allow the user to tell it what to do if there is a record (ie mailing address may have changes, so if that field is filled overwrite it, or this mailing address is invalid, leave the current data untouched for this person, overwrite the rest).
While I'm capable of coding this...i'm not very excited about it and I'm wondering if there's a tool or set of tools out there to all ready perform most of this functionality...
I hope this makes sense...
Is there a header row?
usually in CSV files, the first line is the header.
If so you could use the header line to determine the order, just have a list of column names, and only prompt the user if a column name does not match.(this could then be auto added into the predefined list).
EDIT:
even if a header does not exist, its simple enough to add one. The file can be manually edited. Alternatively in your program let the user define it (from your predefined list)
I can't find any tools all ready out there and no one has replied otherwise, so for the sake of leaving a question answered until otherwise notified the answer is: no.

Make SQL Server index small numbers

We're using SQL Server 2005 in a project. The users of the system have the ability to search some objects by using 'keywords'. The way we implement this is by creating a full-text catalog for the significant columns in each table that may contain these 'keywords' and then using CONTAINS to search for the keywords the user inputs in the search box in that index.
So, for example, let say you have the Movie object, and you want to let the user search for keywords in the title and body of the article, then we'd index both the Title and Plot column, and then do something like:
SELECT * FROM Movies WHERE CONTAINS(Title, keywords) OR CONTAINS(Plot, keywords)
(It's actually a bit more advanced than that, but nothing terribly complex)
Some users are adding numbers to their search, so for example they want to find 'Terminator 2'. The problem here is that, as far as I know, by default SQL Server won't index short words, thus doing a search like this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"')
is actually equivalent to doing this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator"') <-- notice the missing '2'
and we are getting a plethora of spurious results.
Is there a way to force SQL Server to index small words? Preferably, I'd rather index only numbers like 1, 2, 21, etc. I don't know where to define the indexing criteria, or even if it's possible to be as specific as that.
Well, I did that, removed the "noise-words" from the list, and now the behaviour is a bit different, but still not what you'd expect.
A search won't for "Terminator 2" (I'm just making this up, my employer might not be really happy if I disclose what we are doing... anyway, the terms are a bit different but the principle the same), I don't get anything, but I know there are objects containing the two words.
Maybe I'm doing something wrong? I removed all numbers 1 ... 9 from my noise configuration for ENG, ENU and NEU (neutral), regenerated the indexes, and tried the search.
These "small words" are considered "noise words" by the full text index. You can customize the list of noise words. This blog post provides more details. You need to repopulate your full text index when you change the noise words file.
I knew about the noise words file, but I'm not why your "Terminator 2" example is still giving you issues. You might want to try asking this on the MSDN Database Engine forum where people that specialize in this sort of thing hang out.
You can combine CONTAINS (or CONTAINSTABLE) with simple where conditions:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"') and Title like '%Terminator 2%'
While the CONTAINS find all Terminator the where will eliminate 'Terminator 1'.
Of course the engine is smart enough to start with the CONTAINS not the like condition.

Resources