Spanish profanity black-list - blacklist

I've been tasked with implementing a blacklist-based profanity filter for a Rails app. I know there are a ton of issues with blacklist-based filtering, but the decision was made above my head. Challenge: I'm looking for a good list of Spanish profanity to run into the filter. For English, we're building on a list which exhaustively lists conjugations/plurals/etc, one per line of a text file. Does such a list exist in the public domain for Spanish?

Finding good lists and having them tuned is difficult. It also sounds like you are doing a lot of manual work that can be automated (i.e. conjugation). I did a lot of this for my company's profanity filter named CleanSpeak and much of this can be automated using POS identifiers for words and in many cases you can manually do POS tagging or find a POS source.
You'll also need to consider the quality of the lists and the up-keep and management of a filter. A lot of people think it is simple and then realize that it is extremely difficult to prevent false-positives.
All that said, we found the majority of our lists for other languages difficult to come by online and ended up paying to have many of the built or purchased from other companies. The lists we did find online ended up being nearly worthless once we had them translated. We also attempted to take out blacklist and have that translated, which was a complete failure because most English profanities don't have equivalents in other languages. I would suggest purchasing lists or working with students at your local university to generate lists. A number of our customers found this method relatively good and not overly expensive.
I would also suggest that you take a look at some of the resources out there that define the best ways to manage User Generated Content. These will help guide you through any build vs. buy decisions.

Related

Database that allows full text search in O(1)

I have a database of documents where searching quickly for keywords and patterns would be very useful to have.
I know of "Burrows–Wheeler transform"/FM-index. I wonder if there are any programs or database programs based on BWT or similar methods in order to search a corpus in O(1) and hopefully more advantages.
Any ideas?
There is a great book by Witten/Moffat/Bell (1994) Managing Gigabytes; this describes in detail everything you need to know about indexing and retrieval. I think their sourcecode is also available, or has been made available in an information retrieval library.
However, it doesn't include the Burrows-Wheeler transform, as that was only invented in the same year.

Semantic Search Engine

I want to design a Semantic Search engine for my final year Master's degree. I have been doing a fair amount of reading both casually on the web and academic papers so I am not a total noob in this field.
My aim is to build a semantic search engine, which parses out the HTML content into its equivatlent RDF triples,stores the triples in a triplestore, through which the engine will try to respond to the query fired using SPARQL. I want to do something out of the box unlike the other students . So, I decided to build a semantic search engine.
Right now, I had a running search engine using Solr which performs keyword search, what I want to do is the semantic search. I know some open source tools regarding Web 3.0 but not sure whether they will be compatible with Solr or not.
So, can you please provide me some help for building the same.
Thanks.
Regards
Although it sounds hard, but you will not be able to capture everything.
You need a lot of data. Of course, there already is a lot of data arranged in formats like owl and rdf which you may use (e.g. WordNet, Yago, GeoNames etc), but although they are of huge size, they only focus on very small portions of a possible discourse universe.
Developing a good semantic search takes a lot of resources and brain power. Projects, like for example KompParse at the German Research Center for Artificial Intelligence, which only focus on a small part of human conversation (gossip or buying furniture) have been running for several years with several employees by now and are still just "ok".
Understanding semantics has already been implemented in different search engines, take google for example, or wolfram alpha. So this topic might not even be as much "out of the box" as you think.
So I will go with user723630 and strongly advise you, to focus on a smaller topic. You will still achieve a lot, but you will not get frustrated.

NLP libraries for simple POS tagging

I'm a student who's working on a summer project in NLP. I'm fairly new to the field, so I apologize if there's a really obvious solution. The project is in C, both due to my familiarity with it, and the computationally intensive nature of the project (my corpus is a plaintext dump of wikipedia).
I'm working on an approach to relationship extraction, exploiting the consistency principle to try to learn (to within some error threshold) a set of rules dictating which clusters of grammar objects imply a connection between those objects.
One of the first steps in the algorithm involves finding the set of all possible grammar objects a given word can refer to (POS disambiguation is done implicitly by the algorithm at a later step). I've looked at several parsers, but they all seem to do the disambiguation step themselves, which (from my end) is counterproductive. I'm looking for something off the shelf that (ideally) gives me a one-command way to turn up this information.
Does such a thing exist? If not, is there an existent dictionary containing this information that's trivially machine parseable?
Thank you for your help.
Look at CMU Sphinx. An open source NLP project. I think its in C++ but you can integrate it or at least get the idea of how to go about things.
What about calling an external POS tagger as a shell script or wrapping it in an http service if you feel frisky?
Java and Python have the vast majority of NLP libraries so it makes sense to take advantage of that. If you can use NLTK in a script to tag stuff, call this script from C, that makes it much easier.

Silverlight demo for bioinformatics

I am a beginner Silverlight programmer preparing for the interview in medical research company. Job sounds damn interesting and I would like to get there.
To show my skills and interest, I want to write a program related to the topic.
What would you suggest?
First ideas: simple statistical analysis of input data, image collections (for example, find HD DNA image and put it in Silverlight Deep Zoom), lab inventory program..
Have a look at http://research.microsoft.com/en-us/projects/bio/default.aspx the Microsoft Biology Foundation, part of Microsoft Research. Its code is OpenSource (sic) and you will find many applications there. The apps cover most of the basics, sequences, etc. and have some nice display tools.
Collection/Maintenance/Retrieval of data is very important for any organization. Try this tutorial:
Silverlight tutorial: Creating a data centric web app with DataGrid, LINQ, and WCF Web Service
You placed a "bioinformatics" tag on your question. Many bioinformatics companies consider Perl programming to be quite important.
I suggest that you perform a search on "bioinformatics Perl" and take a look at books and sites that are retrieved. Perhaps you could park yourself at a local bookstore and peruse some of those titles. Free Perl interpreters are available.
You do have a basic understanding of genetics, yes? Be very familiar with some of the terminology, so you won't have to stare off into space while you pick from genotype/phenotype or mRNA/RNA/DNA or recall what a codon is.
It wouldn't hurt to nose around PubMed and get a basic understanding of what genomes are out there and what statistical tests can be performed on them.
I like your statistics idea. Perhaps you could write a program that tells you whether to accept or reject a null hypothesis based on numbers you read in from a file. Or perhaps you could figure out how to use the statistics portion of Entrez Gene in PubMed.
Best wishes for the interview.

looking for a good project to work on as my graduation project in the university that involves Ai / Machine Learning, please help me

I need help to chose a project to work on for my master graduation, The project must involve Ai / Machine learning or Business intelegence.. but if there is any other suggestion out of these topics it is Ok, please help me.
One of the most rapid growing areas in AI today is Computer Vision. There are many practical needs where the results of your Master Thesis can be helpful. You can try research something like Emotion Detection, Eye-Tracking, etc.
An appropriate work for a MS in CS in any good University can highlight the current status of research on this field, compare different approaches and algorithms. As a practical part, it makes also a lot of fun when your program recognizes your mood properly :)
Netflix
If you want to work more on non trivial datasets (not google size, but not trivial either and with real application), with an objective measure of success, why not working on the netflix challenge (the first one) ? You can get all the data for free, you have many papers on it, as well as pretty good way to compare your results vs other peoples (since everyone used exactly the same dataset, and it was not so easy to "cheat", contrary to what happens quite often in the academic literature). While not trivial in size, you can work on it with only one computer (assuming it is recent enough), and depending on the type of algorithms you are using, you can implement them in a language which is not C/C++, at least for prototyping (for example, I could get decent results doing things entirely in python).
Bonus point, it passes the "family" test: easy to tell your parents what you are working on, which is always a pain in my experience :)
Music-related tasks
A bit more original: something that is both cool, not trivial but not too complicated in data handling is anything around music, like music genre recognition (classical / electronic / jazz / etc...). You would need to know about signal processing as well, though - I would not advise it if you cannot get easy access to professors who know about the topic.
I can use the same answer I used on a previous, similar question:
Russ Greiner has a great list of project topics for his machine learning course, so that's a great place to start.
Both GAs and ANNs are learners/classifiers. So I ask you the question, what is an interesting "thing" to learn? Maybe it's:
Detecting cancer
Predicting the outcome between two sports teams
Filtering spam
Detecting faces
Reading text (OCR)
Playing a game
The sky is the limit, really!
Since it has a business tie in - given some input set determine probable business fraud from the input (something the SEC seems challenged in doing). We now have several examples (Madoff and others). Or a system to estimate investment risk (there are lots of such systems apparently but were any accurate in the case of Lehman for example).
A starting point might be the Chen book Genetic Algorithms and Genetic Programming in Computational Finance.
Here's an AAAI writeup of an award to the National Association of Securities Dealers for a system thatmonitors NASDAQ insider trading.
Many great answers posted already, but I wanted to add my 2 cents.There is one hot topic in which big companies all around are investing lots of resources into, and is still a very challenging topic with lots of potential: Automated detection of fake news.
This is even more relevant nowadays where most of us are connecting though social media and there's a huge crisis looming over.
Fake news, content removal, source reliability... The problem is huge and very exciting. It is as I said challenging as it can be seen from many perspectives (from analising images to detect fakes using adversarial netwotks to detecting fake written news based on text content (NLP) or using graph theory to find sources) and the possbilities for a research proyect are endless.
I suggest you read some general articles (e.g this or this) or have a look at research articles from the last couple of years (a quick google seach will throw you a lot of related stuff).
I wish I had the opportunity of starting over a project based on this topic. I think it's going to be of the upmost relevance in the next few years.

Resources