List of single word English nouns? - database

i'm trying to find a list of popular English nouns, any list ranging from about 6000 - 20000 words would be fine. I have a list at the moment consisting of almost 150,000 nouns but many of these words are not really nouns such as 'I', 'You', 'Main' etc. It also contains entries which consist of multiple words. I could write a short script to try and filter this list but i thought there might be a better option available. I have tried a few places on the web for stuff like English language learning but most of them only contain things like 'top 1500 nouns'.

Some time back I used some input from Stanford NLP (http://nlp.stanford.edu/), I think it's worth looking at. If they don't provide some static database they definitely have tools that will help you in cleaning up your list.

Related

How to do Norvig spell check for chinese characters mixed with english letters?

I have a list of product names written in mixture of English letters and numbers and Chinese characters stored in my database.
There is a table called products with the fields name_en, name_zh amongst others.
E.g.
AB 10"机翼
Peter Norvig has a fantastic algorithm for spell check but it only works for English.
I was wondering if there's a way to do something similar for a a narrow list of terms containing Chinese characters?
E.g. of mispelling such as
A10机翼
AB 10鸡翼
AB 10鸡一
AB 10木几翼
all will prompt AB 10"机翼 as the correct spelling
How do I do this?
You have a much more complex problem than Norvig's:
Chinese Input-method
The mis-spellings in your case (at least in your example) is mostly caused by the pinyin input method. One same typing of "jiyi" (English: airplane wings) could lead to different Chinese phrases:
机翼
鸡翼
鸡一
几翼
Chinese Segmentation
Also in Chinese to break up a long sentence into small tokens with semantic meaning, you would need to do segmentation. For example:
飞机模型零件 -> Before segmentation
飞机-模型-零件 After segmentation you got three phrases separated by '-'.
Work on the token-level
You probably can experiment starting from a list of mis-spellings. I guess you can collect a bunch of them from your user logs. Take out one misspelling at a time, using your example:
AB 10鸡翼
First break it into tokens:
A-B-10-鸡翼
(here you probably need a Chinese segmentation algorithm to realize that 鸡翼 should be treated together).
Then you should try to find its nearest neighbor in your product db using the edit distance idea. Note that:
you do not remove/edit/replace one character at a time, but remove/edit/replace one token at a time.
when edit/replace, we should limit our candidates to be those near neighbors of the original token. For example, 鸡翼 -> 机翼,几翼,机一
Build Lucene index
You can also try to tackle the problem in a different way, starting from your correct product names. Treat each product name as a document and pre-build lucene index from that. Then for each user query, the query-matching problem is converted to a search problem in which we issue a query to the search-engine for find the best-matching documents in our db. In this case, I believe Lucene would probably takes care of the segmentation (if not, you would need to extend its functionality to suit your own needs) and tokenization for you.

effective way to store and access an existing list of videogame characters with counters and synnergies with C++

I'm attempting to make a DotA drafting companion application in C++
I'll give you a little run down of how it will work. In DotA there are two teams of 5 players. No character can be picked twice in a single match, meaning teams cannot share a character. It is important to keep in mind both your team's and opponent's picks. So what this will do is suggest characters for you to pick based on your and your opponent's teams.
To clarify, the existing list is not some sort of database, just a menagerie of suggestions on various websites, forums, and videos. I would be organizing the list myself, and so my question is ultimately what format should I create this list in?
So for instance a popular character is Phatom Assassin who is strong with characters who can reduce armor, and weak against characters who can do lots of damage.
So a hero class might look like this
class DotaHero
{
Name = "Phantom Assassin";
vector<DotaHero> Counters{"Lina", "Lion"};
vector<DotaHero> Friends{"Templar Assassin", "Shadow Demon"};
CImg<unsigned char> src("PA.jpg");
}
My program would allow you to input the first few heroes and then display a list of characters as a suggestion.
Should I...
Create a class for each and every hero in the game
Stream from a text file with each hero delineated by counters/friends
Create some sort of database to store each hero's counters/friends in
Please feel free to give any suggestions!
A very difficult question.
The question of what you store and how your express relationships between characters is going to define fundamentally how your tool works and how effective it is.
Of the top of my head I would create 2 weighted graphs,
1.) friends.
2.) counters.
Each node represents a hero and each edge of the graph would correspond to how strong of a counter or how strong of a friend you consider that hero. As heros are picks, find the highest weighted neighbor node to all of the current picked hero, and this would be your highest weighted suggestion. That would be the naive approach based only on the next pick (not always the best, a good way to get into trouble). Add layers and create weightings for 2nd order neighbors and you can start doing non deterministic analysis of the data set to get more sophisticated results.
Just a jumping off point. It would be a truly fascinating problem.

Separate Real Words from Random Strings

I'm storing a list of keywords that have been used throughout all searches on a site, and I'm getting a lot of random strings in the keywords field. Here's a sample of the data that I'm getting back:
fRNPRXiPtjDrfTDKH
boom
Mule deer
gVXOFEzRWi
cbFXZcCoSiKcmrvs
Owner Financed ,owner Financed
I'm trying to find a way in SQL or ColdFusion to figure out if something has valid English words, or if it's a random set of characters. I've tried doing some digging for n-gram analysis, but can't seem to come up with any useful solutions that I can run directly on my servers.
UPDATE: The code is now on jsFiddle: http://jsfiddle.net/ybanrab/s6Bs5/1/ it may be interesting to copy and paste a page of news copy and paste in your test data
I'd suggest trying to analyse the probabilities of the individual characters following each other. Below is an example in JavaScript I've written but that ought to translate to T-SQL or ColdFusion pretty easily.
The idea is that you feed in good phrases (the corpus) and analyse the frequency of letters following other letters. If you feed it "this thin the" you'll get something like this:
{
t:{h:3},
h:{i:2,e:1},
i:{s:1,n:1},
s:{},
n:{}
}
You'll get most accuracy by feeding in hand-picked known good inputs from the data you're analysing, but you may also get good results by feeding in plain english. In the example below I'm computing this, but you can obviously store this once you're happy with it.
You then run the sample string against the probabilities to give it a score. This version ignores case, word starting letter, length etc, but you could use them as well if you want.
You then just need to decide on a threshold score and filter like that.
I'm fairly sure this kind of analysis has a name, but my google-fu is weak today.
You can paste the code below into a script block to get an idea of how well (or not) it works.
var corpus=["boom","Mule Deer", "Owner Financed ,owner Financed", "This is a valid String","The quick brown fox jumped over the lazy dog"];
var probs={};
var previous=undefined;
//Compute the probability of one letter following another
corpus.forEach(function(phrase){
phrase.split(" ").forEach(function(word){
word.toLowerCase().split("").forEach(function(chr){
//set up an entry in the probabilities table
if(!probs[chr]){
probs[chr]={};
}
//If this isn't the first letter in the word, record this letter as following the previous one
if(previous){
if(!probs[previous][chr]){
probs[previous][chr]=0;
}
probs[previous][chr]++;
}
//keep track of the previous character
previous=chr;
});
//reset previous as we're moving onto a different word
previous=undefined;
})
});
function calculateProbability(suspect){
var score=0;
var previous=undefined;
suspect.toLowerCase().split("").forEach(function(chr){
if(previous && probs[previous] && probs[previous][chr]){
//Add the score if there is one, otherwise zero
score+=probs[previous][chr];
}
previous=chr;
});
return score/suspect.length;
}
console.log(calculateProbability("boom"));
console.log(calculateProbability("Mood"));
console.log(calculateProbability("Broom"));
console.log(calculateProbability("sajkdkas dak"));
The best thing to do is to check your words against frequency lists: dictionaries won't work because they don't contain grammatical inflections, proper nouns, compounds, and a whole load of other stuff that's valid.
The problem with naive checking against n-gram data is there is a lot of noise in the lower frequency words. The easiest thing to do which should give you the correct answer in the overwhelming majority of cases is to truncate a list of frequency counted words from somewhere suitably large (Google n-gram, Wikipedia, etc) at the top 50,000 or 100,000 words. Adjust the threshold as appropriate to get the results you're looking for, but then you can just check if any/all of your query terms appear in this list.
If you want to know if the query is grammatical, or sensible as a unit rather than its constituent parts, that's a whole other question of course.
There are some non-dictionary-words that can be valid searches (e.g. gethostbyname is a valid and meaningful search here on SO, but not a dictionary word). On the other hand, there are dictionary words that have absolutely nothing to do with your website.
Instead of trying to guess what is a word and what isn't, you could simply check if the search query produced a non-empty result. Those with empty results must be complete off-topic or gibberish.
It sounds like you are looking for a
Bayesian Filter

Is there a way to rank the difficulty of pronunciation of a word?

I'm trying to build a collection English words that are difficult to pronounce.
I was wondering if there is an algorithm of some kind or a theory, that can be used to show how difficult a word is to pronounce.
Does this appear to you as something that can be computed?
As this seems to be a very subjective thing, let me make it more objective, let's say hardest words to pronounce by text to speech technologies.
One approach would be to build a list with two versions of each word. One the correct spelling, and the other being the word spelled using the simplest of phonetic spelling. Apply a distance function on the two words (like Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance). The greater the distance between the two words, the harder the word would be to pronounce.
Great problem! Off the top of my head you could create a system which contains all the letters from the phonetic alphabet and with connected weights betweens every combination based on difficulty (highly specific so may need multiple people testing and take averages etc) then have a list of all words from the English dictionary stored on disk and call a script which cycles through each entry and performs web scraping on wikipedia for the phonetic spelling and ranks their difficulty. This could take into consideration the length of the word as well as the difficulty between joining phonetics then order the list based on the difficulty.
Thats what I would try and do :P
To a certain extent...
Speech programs for example use a system of phonetics to try and pronounce words.
For example, "grasp" would be split into:
Gr-A-Sp
However, for foreign words (or words that don't follow this pattern), exception lists have to be kept e.g. Yacht
Suggestion
Fortunately Pronunciation as a process is dependent on a two factors these include
the phones making up the words and the location of vowels and semi vowels i.e
/a/,/ae/,/e/,/i/,/o/,/u/,/w/,/j/...
length of the word.
the first relates to the mechanics of phone sound production as the velum, cheeks tongue have to be altered to produce various sounds related to individual phones i.e nasal etc. this makes some words more difficult to pronounce as the movement required may be a lot. Refer to books about phonetics to find positions of pronouncing each phone.
Algorithm
a weighted spanning tree with weight being the difficulty of pronouncing two consecutive phones i.e l and r or /sh/ and /s/
good luck.

Display vs. Search vs. Sort strings in a database

Let's say I've got a database full of music artists. Consider the following artists:
The Beatles -
"The" is officially part of the name, but we don't want to sort it with the "T"s if we are alphabetizing. We can't easily store it as "Beatles, The" because then we can't search for it properly.
Beyoncé -
We need to allow the user to be able to search for "Beyonce" (without the diacritic mark)and get the proper results back. No user is going to know how or take the time to type the special diacritcal character on the last "e" when searching, yet we obviously want to display it correctly when we need to output it.
What is the best way around these problems? It seems wasteful to keep an "official name", a "search name", and a "sort name" in the database since a very large majority of entries will all be exactly the same, but I can't think of any other options.
The library science folks have a standard answer for this. The ALA Filing Rules cover all of these cases in a perfectly standard way.
You're talking about the grammatical sort order. This is a debatable topic. Some folks would take issue with your position.
Generally, you transform the title to a normalized form: "Beatles, The". Generally, you leave it that way. Then sort.
You can read about cataloging rules here: http://en.wikipedia.org/wiki/Library_catalog#Cataloging_rules
For "extended" characters, you have several choices. For some folks, é is a first-class letter and the diacritical is part of it. They aren't confused. For other folks, all of the diacritical characters map onto unadorned characters. This mapping is a feature of some Unicode processing tools.
You can read about Unicode diacritical stripping here: http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/lvg/current/docs/designDoc/UDF/unicode/NormOperations/stripDiacritics.html
http://www.siao2.com/2005/02/19/376617.aspx

Resources