text file of all titles / topic titles in Freebase - file

I need a text file to contain every title / title of each topic / title of each item in a .txt file each on its own line.
How can I do this or make this if I have already downloaded a freebase rdf dump?
If possible, I also need a separate text file with each topic's / item's description on a single line each description on its own line.
How can I do that?
I would greatly appreciate it if someone could help me make either of these files from a Freebase rdf dump.
Thanks in Advance!

Filter the RDF dump on the predicate/property ns:type.object.name. If you only want a particular language, also filter by that language e.g. #en.
EDIT: I missed the second part about descriptions being desired as well. Here's a three part regex which will get you all the lines with:
English names
English descriptions
a type of /commmon/topic
Combining the three is left as an exercise for the reader.
zegrep $'\tns:(((type\\.object\\.name|common\\.topic\\.description)\t.*#en)|type\\.object\\.type\tns:common\\.topic)\\.$' freebase-rdf-2013-06-30-00-00.gz | gzip > freebase-rdf-2013-06-30-00-00-names-descriptions.gz
It seems to have a performance issue that I'll have to look at. A simple grep of the entire file takes ~11 min on my laptop, but this has been running several times that. I'll have to look at it later though...

Related

CSV file not recognized as csv, reason nominal value not declared in header

I am trying to load a dataset in weka, I have tried many solutions such as arff format, comas etc. but it was all a failure. Could any of you give me a working solution or load this dataset according to the format.
Here is a link to dataset
Instead of using Weka's functionality for reading CSV files, you could use ADAMS (developed at the same university; I'm the lead developer) instead.
Download the adams-ml-app snapshot and then use the Weka Investigator to load/save the file:
Load it as ADAMS Spreadsheets (.csv, .csv.gz)
Save it as Arff data files (.arff, .arff.gz) or Simple ARFF data files (.arff, .arff.gz)
The Reviews column contains an erroneous 3.0M, which prevents it from becoming numeric.
If you want to have an introduction to the Weka Investigator, then take a look at my talk from the Weka User Conference 2021: Taking Weka to the next level with ADAMS .
There are too many issues with lines in this file.
In line 23, I eliminated the odd looking brackets.
I removed all single quotes (')
I eliminated all repeated double quotes ("")
In line 10474 the first two fields (before the number) didn't seem to be separated, so I added a comma.
This allowed the file to go through initial screening, but...
The file contains a lot of odd emojis. I started to eliminate them one by one, but there are clearly more of these than I wish to deal with.
Each time I got rid of one, it would read farther into the file, then stop at the next one.
If I just try to read the top of the file, the first 20 lines before we get to any of these problems, it reads fine.
My partial editing can be found here: https://www.dropbox.com/s/ij707mb23dt1jvz/googleplaystore3.csv?dl=0
I think if you clear up the remaining emojis the file should be usable.

How to extract all the IUPAC names mentioned in the data available from Pubchem(NCBI) into a text file?

I want to build lists of prefixes and suffixes of some length from all the IUPAC names mentioned in Pubchem Database,so that I can use them further in my project as a feature.So I want all the IUPAC chemical names in a text file or in some format where I can extract these lists.
Thanks.
Sounds you need something like this Nist species list
You can search for most also in the Webbook but I failed to find a download link for the complete set.
In our lab we got a Cd(?) with the mass spectral database which contained the (complete? - well it got like 250.000 substances) database as text file. Maybe you can get that through some of the vendors.
The pubchem site offers you to download a dump of their data by ftp. Why not use that?
PubChem data can be downloaded via ftp from the PubChem site. A complete description of the available data can be obtained here: https://pubchemdocs.ncbi.nlm.nih.gov/downloads
Of particular interest for the question of IUPAC names, the data are downloadable from the "Compound Extras" section of the ftp site: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/
The README-Extras file in this location describes the data in detail. For the IUPAC names, the following information is provided:
CID-IUPAC.gz:
This is a listing of all CIDs with their computed IUPAC names.
It is a gzipped text file with CID, tab, IUPAC on each line. Note
that the names may contain UTF8 characters.
A download today (23-Apr-2020) contains 102,586,778 rows. An excerpt of the information is shown below.
> head CID-IUPAC
1 3-acetyloxy-4-(trimethylazaniumyl)butanoate
2 (2-acetyloxy-3-carboxypropyl)-trimethylazanium
3 5,6-dihydroxycyclohexa-1,3-diene-1-carboxylic acid
4 1-aminopropan-2-ol
5 (3-amino-2-oxopropyl) dihydrogen phosphate
6 1-chloro-2,4-dinitrobenzene
7 9-ethylpurin-6-amine
8 2,3-dihydroxy-3-methylpentanoic acid
9 (2,3,4,5,6-pentahydroxycyclohexyl) dihydrogen phosphate
11 1,2-dichloroethane

How can I use SQL Server's full text search across multiple rows at once?

I'm trying to improve the search functionality on my web forums. I've got a table of posts, and each post has (among other less interesting things):
PostID, a unique ID for the individual post.
ThreadID, an ID of the thread the post belongs to. There can be any number of posts per thread.
Text, because a forum would be really boring without it.
I want to write an efficient query that will search the threads in the forum for a series of words, and it should return a hit for any ThreadID for which there are posts that include all of the search words. For example, let's say that thread 9 has post 1001 with the word "cat" in it, and also post 1027 with the word "hat" in it. I want a search for cat hat to return a hit for thread 9.
This seems like a straightforward requirement, but I don't know of an efficient way to do it. Using the regular FREETEXT and CONTAINS capabilities for N'cat AND hat' won't return any hits in the above example because the words exist in different posts, even though those posts are in the same thread. (As far as I can tell, when using CREATE FULLTEXT INDEX I have to give it my index on the primary key PostID, and can't tell it to index all posts with the same ThreadID together.)
The solution that I currently have in place works, but sucks: maintain a separate table that contains the entire concatenated post text of every thread, and make a full text index on THAT. I'm looking for a solution that doesn't require me to keep a duplicate copy of the entire text of every thread in my forums. Any ideas? Am I missing something obvious?
As far as i can see there is no "easy" way of doing this.
I would create a stored procedure which simply splits up the search words and starts looking for the first word and put the threadid's in a table variable. Then you look for the other words (if any) in the threadids you just collected (inner join).
If intrested i can write a few bits of code but im guessing you wont need it.
What are you searching for?
CAT HAT as a complete word, in which case:
CONTAINS(*,'"CAT HAT")
CAT OR HAT then..
CONTAINS (*,'CAT OR HAT')
Searching for "CAT HAT" and expecting just the post with CAT in doesn't make any sense. If the problem is parsing what the user types, you could just replace SPACES with OR (to search any of the words, AND if both required). The OR will give you both posts for thread 9.
SELECT DISTINCT ThreadId
FROM Posts
WHERE CONTAINS (*,'"CAT OR HAT")
Better still you could , if it helps, use the brilliant irony (http://irony.codeplex.com/) which translates (parses) a search string into a Fulltext query. Might help for you.
Requires the use of google syntax for the original search which can only be a good thing as most people are used to typing in google searches.
Plus here is an article on how to use it.
http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/

Where can I download a free, text-rich dataset?

I want to do a bit of lightweight testing and bench-marking for full-text search, so the dataset should have the qualities:
10,000 - 100,000 records.
good dispersion of English words.
In CSV or Excel format--i.e. I don't want to access it via API.
Something like books or movies with title and description fields would be perfect. I browsed the UCI Machine Learning Repo, but it was too number-oriented.
You can try
- CKAN
- or search for "Open Data"
Or, see Tim Berners-Lee discussing a quick survey of a few Open Data sets.
If you do not find one, you can create one using the LOREM IPSUM generator
T-SQL equivalent of =rand()
You can also get the full StackOverflow data dump
https://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/
Use the Gutenberg Project. You have access to thousands of English book in plain text. That's what I've used once and was happy with it.

Plain, computer parseable lists of common first names?

I need a list of common first names for people, like "Bill", "Gordon", "Jane", etc. Is there some free list of lots of known names, instead of me having to type them out? Something that I can easily parse with the programme to fill in an array for example?
I'm not worried about:
Knowing if a name is masculine or feminine (or both)
If the dataset has a whole pile of false positives
If there are names that aren't on it, obviously no dataset like this will be complete.
If there are 'duplicates', i.e. I don't care if the dataset lists "Bill" and "William" and "Billy" as different names. I'd rather have more data than less
I don't care about knowing the popularity the name
I know Wikipedia has a list of most popular given names, but that's all in a HTML page and manged up with horrible wiki syntax. Is there a better way to get some sample data like this without having to screen scrape wikipedia?
A CSV from the General Register Office of Scotland with all the forenames registered there in 2007.
Another large set of first names in CSV format and SQL format too (but they didn't say which DB dumped the SQL).
GitHub page with the top 1000 baby names from 1880 to 2009, already parsed into a CSV for you from the Social Security Administration.
CSV of baby names and meanings from a Princeton CS page.
That ought to be enough to get you started, I'd think.
You can easily consume the Wikipedia API (http://en.wikipedia.org/w/api.php) to retrieve the list of pages in specific category, looks like Category:Given names is something you want to start from.
http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmnamespace=0&cmlimit=500&cmtitle=Category:Given_names
The part of result from this URL looks like this:
<cm pageid="5797824" ns="0" title="Abdou" />
<cm pageid="5797863" ns="0" title="Abdu" />
<cm pageid="859035" ns="0" title="Abdul Aziz" />
<cm pageid="6504818" ns="0" title="Abdul Qadir" />
Look at the API and select appropriate format and query parameters, and check categories.
P.S.
BTW, The wiki-text from page you linked to contain names in a form that easy to extract using regexp... As well as titles of links in the rendered HTML page have “(name)” attached to the name itself.
Social Security Administration - Beyond the Top 1000 Names Data Files
The above is a comprehensive list of first names in use in the US. The zip files contain national and state-level data by year of birth in CSV format. It includes the number of occurrences (minimum 5) and gender. For example, the national file for 2010 includes 33,838 baby names.

Resources