Where can I download a free, text-rich dataset? - database

I want to do a bit of lightweight testing and bench-marking for full-text search, so the dataset should have the qualities:
10,000 - 100,000 records.
good dispersion of English words.
In CSV or Excel format--i.e. I don't want to access it via API.
Something like books or movies with title and description fields would be perfect. I browsed the UCI Machine Learning Repo, but it was too number-oriented.

You can try
- CKAN
- or search for "Open Data"
Or, see Tim Berners-Lee discussing a quick survey of a few Open Data sets.

If you do not find one, you can create one using the LOREM IPSUM generator
T-SQL equivalent of =rand()
You can also get the full StackOverflow data dump
https://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/

Use the Gutenberg Project. You have access to thousands of English book in plain text. That's what I've used once and was happy with it.

Related

Using a thesaurus with full text search is not working, possibly I am loading US english instead of British English

I've created a thesaurus file and loaded it as per the example.
I have a database with a table that is full text indexed.
I loaded it using 1033 which is 'English' according to the MSDN article here
It took a long time to load (4 minutes) it's pretty huge, so I know that it loaded it.
There is also a file for British english which is possibly what my SQL is using which would explain why it didn't work. However the number isn't listed on that chart so I do not know how to load it.
Assuming that's the problem, all is dandy. However I also can't find anything that suggests the thesaurus will simply work automatically on any free text query eg. CONTAINS so I don't know if I have to do something to my free text catalog (couldn't see anything, but you never know).
Any ideas?
Additional information:
In my tsenu.xml file:
<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>John</sub>
<sub>Jon</sub>
</expansion>
</thesaurus>
</XML>
I then ran this:
EXEC sys.sp_fulltext_load_thesaurus_file 1033;
select * from Cats where contains(CatName , 'Jon', language 1033)
There are records for 'Jon' and 'John' but the results are not using the thesaurus to show me alternatives.
try
WHERE CONTAINS(CatName , 'FORMSOF (THESAURUS, Jon)')
using THESAURUS can be a bit tricky sometime, I havent used FTS for a long time but I remember sometimes , it will only return THESAURUS values not the actual value if it is contained in the data. for example if you are looking for thesaurus values for 'Jon' if there is a 'Jon' is data it will not return 'Jon' but will return all the Thesaurus for Jon, strange but this is how it works.
if you are looking for two words is something like
WHERE CONTAINS(CatName , '"Word1" AND "Word2"')
To search for John Smith I think you can do something like
WHERE CONTAINS(NAME, 'FORMSOF (THESAURUS, Jon) OR "Smith"')
Make sure to specify which language you want to use by using the optional language parameter when using Contains or FreeText
Contains and Free Text Information
I also looked at the chart and British English LCID is 2057.

text file of all titles / topic titles in Freebase

I need a text file to contain every title / title of each topic / title of each item in a .txt file each on its own line.
How can I do this or make this if I have already downloaded a freebase rdf dump?
If possible, I also need a separate text file with each topic's / item's description on a single line each description on its own line.
How can I do that?
I would greatly appreciate it if someone could help me make either of these files from a Freebase rdf dump.
Thanks in Advance!
Filter the RDF dump on the predicate/property ns:type.object.name. If you only want a particular language, also filter by that language e.g. #en.
EDIT: I missed the second part about descriptions being desired as well. Here's a three part regex which will get you all the lines with:
English names
English descriptions
a type of /commmon/topic
Combining the three is left as an exercise for the reader.
zegrep $'\tns:(((type\\.object\\.name|common\\.topic\\.description)\t.*#en)|type\\.object\\.type\tns:common\\.topic)\\.$' freebase-rdf-2013-06-30-00-00.gz | gzip > freebase-rdf-2013-06-30-00-00-names-descriptions.gz
It seems to have a performance issue that I'll have to look at. A simple grep of the entire file takes ~11 min on my laptop, but this has been running several times that. I'll have to look at it later though...

How can I use SQL Server's full text search across multiple rows at once?

I'm trying to improve the search functionality on my web forums. I've got a table of posts, and each post has (among other less interesting things):
PostID, a unique ID for the individual post.
ThreadID, an ID of the thread the post belongs to. There can be any number of posts per thread.
Text, because a forum would be really boring without it.
I want to write an efficient query that will search the threads in the forum for a series of words, and it should return a hit for any ThreadID for which there are posts that include all of the search words. For example, let's say that thread 9 has post 1001 with the word "cat" in it, and also post 1027 with the word "hat" in it. I want a search for cat hat to return a hit for thread 9.
This seems like a straightforward requirement, but I don't know of an efficient way to do it. Using the regular FREETEXT and CONTAINS capabilities for N'cat AND hat' won't return any hits in the above example because the words exist in different posts, even though those posts are in the same thread. (As far as I can tell, when using CREATE FULLTEXT INDEX I have to give it my index on the primary key PostID, and can't tell it to index all posts with the same ThreadID together.)
The solution that I currently have in place works, but sucks: maintain a separate table that contains the entire concatenated post text of every thread, and make a full text index on THAT. I'm looking for a solution that doesn't require me to keep a duplicate copy of the entire text of every thread in my forums. Any ideas? Am I missing something obvious?
As far as i can see there is no "easy" way of doing this.
I would create a stored procedure which simply splits up the search words and starts looking for the first word and put the threadid's in a table variable. Then you look for the other words (if any) in the threadids you just collected (inner join).
If intrested i can write a few bits of code but im guessing you wont need it.
What are you searching for?
CAT HAT as a complete word, in which case:
CONTAINS(*,'"CAT HAT")
CAT OR HAT then..
CONTAINS (*,'CAT OR HAT')
Searching for "CAT HAT" and expecting just the post with CAT in doesn't make any sense. If the problem is parsing what the user types, you could just replace SPACES with OR (to search any of the words, AND if both required). The OR will give you both posts for thread 9.
SELECT DISTINCT ThreadId
FROM Posts
WHERE CONTAINS (*,'"CAT OR HAT")
Better still you could , if it helps, use the brilliant irony (http://irony.codeplex.com/) which translates (parses) a search string into a Fulltext query. Might help for you.
Requires the use of google syntax for the original search which can only be a good thing as most people are used to typing in google searches.
Plus here is an article on how to use it.
http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/

Plain, computer parseable lists of common first names?

I need a list of common first names for people, like "Bill", "Gordon", "Jane", etc. Is there some free list of lots of known names, instead of me having to type them out? Something that I can easily parse with the programme to fill in an array for example?
I'm not worried about:
Knowing if a name is masculine or feminine (or both)
If the dataset has a whole pile of false positives
If there are names that aren't on it, obviously no dataset like this will be complete.
If there are 'duplicates', i.e. I don't care if the dataset lists "Bill" and "William" and "Billy" as different names. I'd rather have more data than less
I don't care about knowing the popularity the name
I know Wikipedia has a list of most popular given names, but that's all in a HTML page and manged up with horrible wiki syntax. Is there a better way to get some sample data like this without having to screen scrape wikipedia?
A CSV from the General Register Office of Scotland with all the forenames registered there in 2007.
Another large set of first names in CSV format and SQL format too (but they didn't say which DB dumped the SQL).
GitHub page with the top 1000 baby names from 1880 to 2009, already parsed into a CSV for you from the Social Security Administration.
CSV of baby names and meanings from a Princeton CS page.
That ought to be enough to get you started, I'd think.
You can easily consume the Wikipedia API (http://en.wikipedia.org/w/api.php) to retrieve the list of pages in specific category, looks like Category:Given names is something you want to start from.
http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmnamespace=0&cmlimit=500&cmtitle=Category:Given_names
The part of result from this URL looks like this:
<cm pageid="5797824" ns="0" title="Abdou" />
<cm pageid="5797863" ns="0" title="Abdu" />
<cm pageid="859035" ns="0" title="Abdul Aziz" />
<cm pageid="6504818" ns="0" title="Abdul Qadir" />
Look at the API and select appropriate format and query parameters, and check categories.
P.S.
BTW, The wiki-text from page you linked to contain names in a form that easy to extract using regexp... As well as titles of links in the rendered HTML page have “(name)” attached to the name itself.
Social Security Administration - Beyond the Top 1000 Names Data Files
The above is a comprehensive list of first names in use in the US. The zip files contain national and state-level data by year of birth in CSV format. It includes the number of occurrences (minimum 5) and gender. For example, the national file for 2010 includes 33,838 baby names.

Make SQL Server index small numbers

We're using SQL Server 2005 in a project. The users of the system have the ability to search some objects by using 'keywords'. The way we implement this is by creating a full-text catalog for the significant columns in each table that may contain these 'keywords' and then using CONTAINS to search for the keywords the user inputs in the search box in that index.
So, for example, let say you have the Movie object, and you want to let the user search for keywords in the title and body of the article, then we'd index both the Title and Plot column, and then do something like:
SELECT * FROM Movies WHERE CONTAINS(Title, keywords) OR CONTAINS(Plot, keywords)
(It's actually a bit more advanced than that, but nothing terribly complex)
Some users are adding numbers to their search, so for example they want to find 'Terminator 2'. The problem here is that, as far as I know, by default SQL Server won't index short words, thus doing a search like this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"')
is actually equivalent to doing this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator"') <-- notice the missing '2'
and we are getting a plethora of spurious results.
Is there a way to force SQL Server to index small words? Preferably, I'd rather index only numbers like 1, 2, 21, etc. I don't know where to define the indexing criteria, or even if it's possible to be as specific as that.
Well, I did that, removed the "noise-words" from the list, and now the behaviour is a bit different, but still not what you'd expect.
A search won't for "Terminator 2" (I'm just making this up, my employer might not be really happy if I disclose what we are doing... anyway, the terms are a bit different but the principle the same), I don't get anything, but I know there are objects containing the two words.
Maybe I'm doing something wrong? I removed all numbers 1 ... 9 from my noise configuration for ENG, ENU and NEU (neutral), regenerated the indexes, and tried the search.
These "small words" are considered "noise words" by the full text index. You can customize the list of noise words. This blog post provides more details. You need to repopulate your full text index when you change the noise words file.
I knew about the noise words file, but I'm not why your "Terminator 2" example is still giving you issues. You might want to try asking this on the MSDN Database Engine forum where people that specialize in this sort of thing hang out.
You can combine CONTAINS (or CONTAINSTABLE) with simple where conditions:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"') and Title like '%Terminator 2%'
While the CONTAINS find all Terminator the where will eliminate 'Terminator 1'.
Of course the engine is smart enough to start with the CONTAINS not the like condition.

Resources