How to extract all the IUPAC names mentioned in the data available from Pubchem(NCBI) into a text file? - database

I want to build lists of prefixes and suffixes of some length from all the IUPAC names mentioned in Pubchem Database,so that I can use them further in my project as a feature.So I want all the IUPAC chemical names in a text file or in some format where I can extract these lists.
Thanks.

Sounds you need something like this Nist species list
You can search for most also in the Webbook but I failed to find a download link for the complete set.
In our lab we got a Cd(?) with the mass spectral database which contained the (complete? - well it got like 250.000 substances) database as text file. Maybe you can get that through some of the vendors.

The pubchem site offers you to download a dump of their data by ftp. Why not use that?

PubChem data can be downloaded via ftp from the PubChem site. A complete description of the available data can be obtained here: https://pubchemdocs.ncbi.nlm.nih.gov/downloads
Of particular interest for the question of IUPAC names, the data are downloadable from the "Compound Extras" section of the ftp site: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/
The README-Extras file in this location describes the data in detail. For the IUPAC names, the following information is provided:
CID-IUPAC.gz:
This is a listing of all CIDs with their computed IUPAC names.
It is a gzipped text file with CID, tab, IUPAC on each line. Note
that the names may contain UTF8 characters.
A download today (23-Apr-2020) contains 102,586,778 rows. An excerpt of the information is shown below.
> head CID-IUPAC
1 3-acetyloxy-4-(trimethylazaniumyl)butanoate
2 (2-acetyloxy-3-carboxypropyl)-trimethylazanium
3 5,6-dihydroxycyclohexa-1,3-diene-1-carboxylic acid
4 1-aminopropan-2-ol
5 (3-amino-2-oxopropyl) dihydrogen phosphate
6 1-chloro-2,4-dinitrobenzene
7 9-ethylpurin-6-amine
8 2,3-dihydroxy-3-methylpentanoic acid
9 (2,3,4,5,6-pentahydroxycyclohexyl) dihydrogen phosphate
11 1,2-dichloroethane

Related

Mixed Service Codes/Descriptions database

I am facing an issue regarding a significantly large database that I have to reorganize. There are two columns, one consists of the Service Code of an item and next is a column containing the Description of the relevant item. Below is an example:
TSB Trim Booklet
LMN Loading Manual
GLM Grain Loading Manual
etc.
There are a total of 170 different items.
The problem is this: On a different Excel file, there is a column containing (mixed around 16,000 times) only the Descriptions of the items without the 3-letter Service Code.
How can I link them quickly?
Assumptions: you want to take the service code from file 1 and apply it to the descriptions from file 2 and a single description always has the same service code.
Use the following formula in file 2 (the big one you want to add service codes to)
=INDEX([file1]Sheetname!$A:$A,MATCH([file2]Sheetname!A2,[file1]Sheetname!$B:$B,0))
Where
[file1]Sheetname!$A:$A
is the column with service codes in the file/sheet with both the code and the description
[file2]Sheetname!A2
is the cell with description in the file/sheet with just descriptions
and
[file1]Sheetname!$B:$B
is the column with descriptions in the file/sheet with both the code and the description

text file of all titles / topic titles in Freebase

I need a text file to contain every title / title of each topic / title of each item in a .txt file each on its own line.
How can I do this or make this if I have already downloaded a freebase rdf dump?
If possible, I also need a separate text file with each topic's / item's description on a single line each description on its own line.
How can I do that?
I would greatly appreciate it if someone could help me make either of these files from a Freebase rdf dump.
Thanks in Advance!
Filter the RDF dump on the predicate/property ns:type.object.name. If you only want a particular language, also filter by that language e.g. #en.
EDIT: I missed the second part about descriptions being desired as well. Here's a three part regex which will get you all the lines with:
English names
English descriptions
a type of /commmon/topic
Combining the three is left as an exercise for the reader.
zegrep $'\tns:(((type\\.object\\.name|common\\.topic\\.description)\t.*#en)|type\\.object\\.type\tns:common\\.topic)\\.$' freebase-rdf-2013-06-30-00-00.gz | gzip > freebase-rdf-2013-06-30-00-00-names-descriptions.gz
It seems to have a performance issue that I'll have to look at. A simple grep of the entire file takes ~11 min on my laptop, but this has been running several times that. I'll have to look at it later though...

Reading and writing to xls and doc files in c

I have this particular problem where i have to write a c program that reads numerical data from a text file. The data is tab delimited. Here is a sample from the text file.
1 23099 345565 345569
2 908 66766 66768
This is data for clients and each client has a row.Each column represents customer no.,previous balance,previous reading, current reading.Then i have to generate a doc. document
that summarizes all this information and calculates the balance I can write a function that does this but how do i create an xls document
and a word document where all the results are summarized using the program? The text document has only numerical data. Any ideas
The easiest way is to create a csv file and not a xls file.
Office can open those csv files with good results.
And it is way easier to create a ascii text file with commaseparated values,
than to create something into a closed format like the ms office formats.
The simplest way to create a spreadsheet that contains formulas and formatting, and can be opened by Excel, is to create an XML Spreadsheet file.

Need to extract/consolidate info from database files

Here's a summary of my problem:
Our company's old software had a large database of contacts in it.
We switched to a new program and have no way to easily transfer those contacts to it.
The contacts database appears to have 4 files which can all be opened in Excel, but not MSAccess. The four files contain the following:
File 1: A nicely formatted spreadsheet of names and some other BASIC info for each contact. There is an ID number on each one, but the numbers do not seem to correspond to anything in File 2.
File 2: Info on each contact, but not in rows. Instead it looks something like this :
JHGH_CONTACT_BLOB: 1426367745
EMAIL: SMITH
WEB:
PHONE_COUNT: 1
FAX_COUNT: 0
ADDRESS_COUNT: 0
NOTE_COUNT: 0
555-7364
(I changed some info for privacy reasons)
Each blob of info is on a separate spreadsheet row. Each starts off with the same first line, even the number is the same, so it can't be some sort of ID number.
File 3: A file containing a lot of gobbledygook, interspersed with a few readable bits of text here and there. The readable text looks like it belongs to the database (ie, it is info on contacts like place of work and other notes.)
File 4: Contains one row and one column labeled ID, with the number 12725 in it.
I need to somehow get the info from File 2, into the nicely formatted file 1. In essence, I need to add the phone numbers, emails etc included in a messy fashion in file 2 on their proper rows in file 1.
This probably makes little sense and I thank you for even reading down this far. If you have any suggestions, I'd love to hear them.
Thanks
We have established that you have a DBF file, an FPT file and a CDX file. These are likely to all relate to Visual FoxPro (a now discontinued Microsoft product).
The .dbf file can be opened in Excel via the standard file open dialog by changing "Files of type" to "dBase files (*.dbf)". Going by your original post, Excel seems to be able to open this sensibly in the first place.
The combination of all three files might be accessible by downloading this OLE DB provider for FoxPro which would let you access the database from Excel using the methods outlined here
You can get more info on the specific file structures at the following links: DBF, FPT and CDX. The DBF contains most of the data, the FPT contains binary memo data and the CDX is an index file.

Plain, computer parseable lists of common first names?

I need a list of common first names for people, like "Bill", "Gordon", "Jane", etc. Is there some free list of lots of known names, instead of me having to type them out? Something that I can easily parse with the programme to fill in an array for example?
I'm not worried about:
Knowing if a name is masculine or feminine (or both)
If the dataset has a whole pile of false positives
If there are names that aren't on it, obviously no dataset like this will be complete.
If there are 'duplicates', i.e. I don't care if the dataset lists "Bill" and "William" and "Billy" as different names. I'd rather have more data than less
I don't care about knowing the popularity the name
I know Wikipedia has a list of most popular given names, but that's all in a HTML page and manged up with horrible wiki syntax. Is there a better way to get some sample data like this without having to screen scrape wikipedia?
A CSV from the General Register Office of Scotland with all the forenames registered there in 2007.
Another large set of first names in CSV format and SQL format too (but they didn't say which DB dumped the SQL).
GitHub page with the top 1000 baby names from 1880 to 2009, already parsed into a CSV for you from the Social Security Administration.
CSV of baby names and meanings from a Princeton CS page.
That ought to be enough to get you started, I'd think.
You can easily consume the Wikipedia API (http://en.wikipedia.org/w/api.php) to retrieve the list of pages in specific category, looks like Category:Given names is something you want to start from.
http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmnamespace=0&cmlimit=500&cmtitle=Category:Given_names
The part of result from this URL looks like this:
<cm pageid="5797824" ns="0" title="Abdou" />
<cm pageid="5797863" ns="0" title="Abdu" />
<cm pageid="859035" ns="0" title="Abdul Aziz" />
<cm pageid="6504818" ns="0" title="Abdul Qadir" />
Look at the API and select appropriate format and query parameters, and check categories.
P.S.
BTW, The wiki-text from page you linked to contain names in a form that easy to extract using regexp... As well as titles of links in the rendered HTML page have “(name)” attached to the name itself.
Social Security Administration - Beyond the Top 1000 Names Data Files
The above is a comprehensive list of first names in use in the US. The zip files contain national and state-level data by year of birth in CSV format. It includes the number of occurrences (minimum 5) and gender. For example, the national file for 2010 includes 33,838 baby names.

Resources