Efficient Way of Filtering URLs by looking through keyword list - c

What is the best way to filter urls by comparing where a keyword is inside the url or not?
I have a list of keywords (a kind of blacklist) which contains 50000 words.
The search method uses the following steps:
While (end of keywords)
1. Get the keyword from database
2. Check whether the keyword is in the url
3. Redirect the user to a specific page.
When I use this method, the cpu usage becomes around %90. Is there an efficient way to do this? It seems that I can't use regex, since the keyword always changes.

The problem is multi pattern search and can be effectively solved with Aho-Coracisk algorithm. This algorithm searches a set of strings simultaneously. The complexity of the algorithm is linear in the length of the keyowords plus the length of the URL plus the number of output matches.

Check whether the keyword is in the url
[...]
Is there an efficient way to do this?
The vice versa will be much more efficient: split the URL onto the keywords and lookup them in the database.
To speed up the database lookup, you can use a variety of methods. For example, sort the database and do a binary search, use a trie structure, hash table, etc etc.

Aho-Corasick algorithm is the best solution for this problem.
Here is the python implementation Aho-Corasick
Below is a code sample
import ahocorasick
A = ahocorasick.Automaton()
for index, word in enumerate('asim sinan yuksel uksel sel sina sim asi as nan an in ina uks .com .co www. http//'.split()):
A.add_word(word, (index, word))
A.make_automaton()
for item in A.iter('http://wwww.asimsinanyuksel.com'):
print(item)

Related

Creating New Matching Logic in Informatica (Ratcliffe - Obershelp)

I am conducting a matching project in Informatica 10.2.1 wherein I need to identify matching strings within product descriptions. Ratcliffe-Obershelp is the matching strategy I need to implement.
I've heard Ratcliffe-Obershelp yields greater results than Jaro - Winkler but I am not sure how to code this into a transformation in Informatica since it is not built in.
No code to show as I don't even know where to start.
I'd expect this to be a transformation/group of transformations that would reproduce the matching score that Ratcliffe-Obershelp creates on a per-line basis.
If I understand correctly, the matching logic performs operations in a loop iterating over the input strings. It is not possible to implement such "loop over string" in Expression Transformation using built-in functions. I see two options:
create DECODE function with multiple conditions for each possible length. - This will be ugly. And can be possible assuming only that we start at the begining of each string - implementing full substring comparison will be... so ugly I can't imagine :)
use Java Transformation - as much as I have putting Java into mappings, there are some cases where it's justified. This look like one of the few. Here's some JS reference

SOLR: size of term dictionary and how to prune it

Two related questions:
Q1. I would like to find out the term dictionary size (in number of terms) of a core.
One thing I do know how to do is to list the file size of *.tim. For example:
> du -ch *.tim | tail -1
1,3G total
But how can I convert this to number of terms? Even a rough estimate would suffice.
Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?
More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.
You can get information in the Schema Browser page and click in "Load Term info", in the luke admin handler https://wiki.apache.org/solr/LukeRequestHandler and also, in then stats component https://cwiki.apache.org/confluence/display/solr/The+Stats+Component.
To prune the index, you could do it by do a facet of the field, and get the terms with low frecuency. Then, get the docs and update the document without this term (this could be difficult because it's depends the analyzers and tokenizers of your field). Also, you can use the lucene libraries to open the index and do it programmatically.
You can check the number and distribution of your terms with the AdminUI under the collection's Schema Browser screen. You need to Load Term Info:
Or you can use Luke which allows you to look inside the Lucene index.
It is not clear what you mean to 'remove'. You can add them to the stopwords in the analyzer chain for example if you want to avoid indexing them.

Separate Real Words from Random Strings

I'm storing a list of keywords that have been used throughout all searches on a site, and I'm getting a lot of random strings in the keywords field. Here's a sample of the data that I'm getting back:
fRNPRXiPtjDrfTDKH
boom
Mule deer
gVXOFEzRWi
cbFXZcCoSiKcmrvs
Owner Financed ,owner Financed
I'm trying to find a way in SQL or ColdFusion to figure out if something has valid English words, or if it's a random set of characters. I've tried doing some digging for n-gram analysis, but can't seem to come up with any useful solutions that I can run directly on my servers.
UPDATE: The code is now on jsFiddle: http://jsfiddle.net/ybanrab/s6Bs5/1/ it may be interesting to copy and paste a page of news copy and paste in your test data
I'd suggest trying to analyse the probabilities of the individual characters following each other. Below is an example in JavaScript I've written but that ought to translate to T-SQL or ColdFusion pretty easily.
The idea is that you feed in good phrases (the corpus) and analyse the frequency of letters following other letters. If you feed it "this thin the" you'll get something like this:
{
t:{h:3},
h:{i:2,e:1},
i:{s:1,n:1},
s:{},
n:{}
}
You'll get most accuracy by feeding in hand-picked known good inputs from the data you're analysing, but you may also get good results by feeding in plain english. In the example below I'm computing this, but you can obviously store this once you're happy with it.
You then run the sample string against the probabilities to give it a score. This version ignores case, word starting letter, length etc, but you could use them as well if you want.
You then just need to decide on a threshold score and filter like that.
I'm fairly sure this kind of analysis has a name, but my google-fu is weak today.
You can paste the code below into a script block to get an idea of how well (or not) it works.
var corpus=["boom","Mule Deer", "Owner Financed ,owner Financed", "This is a valid String","The quick brown fox jumped over the lazy dog"];
var probs={};
var previous=undefined;
//Compute the probability of one letter following another
corpus.forEach(function(phrase){
phrase.split(" ").forEach(function(word){
word.toLowerCase().split("").forEach(function(chr){
//set up an entry in the probabilities table
if(!probs[chr]){
probs[chr]={};
}
//If this isn't the first letter in the word, record this letter as following the previous one
if(previous){
if(!probs[previous][chr]){
probs[previous][chr]=0;
}
probs[previous][chr]++;
}
//keep track of the previous character
previous=chr;
});
//reset previous as we're moving onto a different word
previous=undefined;
})
});
function calculateProbability(suspect){
var score=0;
var previous=undefined;
suspect.toLowerCase().split("").forEach(function(chr){
if(previous && probs[previous] && probs[previous][chr]){
//Add the score if there is one, otherwise zero
score+=probs[previous][chr];
}
previous=chr;
});
return score/suspect.length;
}
console.log(calculateProbability("boom"));
console.log(calculateProbability("Mood"));
console.log(calculateProbability("Broom"));
console.log(calculateProbability("sajkdkas dak"));
The best thing to do is to check your words against frequency lists: dictionaries won't work because they don't contain grammatical inflections, proper nouns, compounds, and a whole load of other stuff that's valid.
The problem with naive checking against n-gram data is there is a lot of noise in the lower frequency words. The easiest thing to do which should give you the correct answer in the overwhelming majority of cases is to truncate a list of frequency counted words from somewhere suitably large (Google n-gram, Wikipedia, etc) at the top 50,000 or 100,000 words. Adjust the threshold as appropriate to get the results you're looking for, but then you can just check if any/all of your query terms appear in this list.
If you want to know if the query is grammatical, or sensible as a unit rather than its constituent parts, that's a whole other question of course.
There are some non-dictionary-words that can be valid searches (e.g. gethostbyname is a valid and meaningful search here on SO, but not a dictionary word). On the other hand, there are dictionary words that have absolutely nothing to do with your website.
Instead of trying to guess what is a word and what isn't, you could simply check if the search query produced a non-empty result. Those with empty results must be complete off-topic or gibberish.
It sounds like you are looking for a
Bayesian Filter

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

How to Implement a dictionary in C/C++ with autoCorrect, auto-complete, spellcheck

I have to write a C/C++ Code for a dictionary implementation with the following features:
There are basically definitions (1 or more) for words.
1) Insertion
2) Searching (As fast as possible)
3) Auto Complete
4) Auto Correct
5) Spell Check
So I need to know HOW TO DO SO?
Which data structures should be the most efficient? Trie or hast table or something else
Which searching technique to use...?
How to implement auto-complete and spell Checking effectively..?
You would typically use a tree of words, arranged according to edit distance from one another, such as a BK tree.
IIRC, the idea is to have a balanced tree with each word linked through edges numbered according to edit distance. If you want to find the nearest match for a word, you compute it's edit distance to the root word, then follow the root word's link of the same number, and repeat the process until you reach a leaf node which is either the same word, or the closest match.
EDIT: in hindsight, that article I linked does a much better job of explaining it than I did. I'd just recommend reading through it for a good explanation of the approach.
Certainly you need a database with a list of words, then you need to split your text up into words and see if they exist in the database.
For Autocomplete you can just check that the text entered so far matches words in the dictionary (with a LIKE txt+'%' clause), implemented with an AJAX call.

Resources