We have an application that will be generating 4-digit random strings for guest WiFi usage. So you walk into a hotel, get your room key and your WiFi password. I want to make these generated passwords as simple as possible to save calls to the helpdesk, but not so simple that they are so easily guessed.
The problem is that inevitably you'll end up with passwords like "POOP" or "DICK". I think a simple solution is so to have a database table of the "forbidden" words, and upon generation check it against the database first to make sure it isn't a banned word.
I have looked at probably dozens of filtered/banned/censored word lists, but I can't find one that is sufficiently detailed so as to include things like DIKK and P00P, and I don't exactly want to use my time today to try to think of every possible offensive 4-letter combination and type them all out manually.
Does anyone have a good resource/word list that would contain these "potentially-offensive" strings?
First I wrote this as a comment. But then I realized it actually answers your question about skipping offensive words:
Consider generating random strings without vowels. You won't get any actual english word. You will both avoid words like 'tree' or 'fukc'
I suggest you to use numbers too, will be "more secure" and you will eliminate this problem
Related
I'm storing a list of keywords that have been used throughout all searches on a site, and I'm getting a lot of random strings in the keywords field. Here's a sample of the data that I'm getting back:
fRNPRXiPtjDrfTDKH
boom
Mule deer
gVXOFEzRWi
cbFXZcCoSiKcmrvs
Owner Financed ,owner Financed
I'm trying to find a way in SQL or ColdFusion to figure out if something has valid English words, or if it's a random set of characters. I've tried doing some digging for n-gram analysis, but can't seem to come up with any useful solutions that I can run directly on my servers.
UPDATE: The code is now on jsFiddle: http://jsfiddle.net/ybanrab/s6Bs5/1/ it may be interesting to copy and paste a page of news copy and paste in your test data
I'd suggest trying to analyse the probabilities of the individual characters following each other. Below is an example in JavaScript I've written but that ought to translate to T-SQL or ColdFusion pretty easily.
The idea is that you feed in good phrases (the corpus) and analyse the frequency of letters following other letters. If you feed it "this thin the" you'll get something like this:
{
t:{h:3},
h:{i:2,e:1},
i:{s:1,n:1},
s:{},
n:{}
}
You'll get most accuracy by feeding in hand-picked known good inputs from the data you're analysing, but you may also get good results by feeding in plain english. In the example below I'm computing this, but you can obviously store this once you're happy with it.
You then run the sample string against the probabilities to give it a score. This version ignores case, word starting letter, length etc, but you could use them as well if you want.
You then just need to decide on a threshold score and filter like that.
I'm fairly sure this kind of analysis has a name, but my google-fu is weak today.
You can paste the code below into a script block to get an idea of how well (or not) it works.
var corpus=["boom","Mule Deer", "Owner Financed ,owner Financed", "This is a valid String","The quick brown fox jumped over the lazy dog"];
var probs={};
var previous=undefined;
//Compute the probability of one letter following another
corpus.forEach(function(phrase){
phrase.split(" ").forEach(function(word){
word.toLowerCase().split("").forEach(function(chr){
//set up an entry in the probabilities table
if(!probs[chr]){
probs[chr]={};
}
//If this isn't the first letter in the word, record this letter as following the previous one
if(previous){
if(!probs[previous][chr]){
probs[previous][chr]=0;
}
probs[previous][chr]++;
}
//keep track of the previous character
previous=chr;
});
//reset previous as we're moving onto a different word
previous=undefined;
})
});
function calculateProbability(suspect){
var score=0;
var previous=undefined;
suspect.toLowerCase().split("").forEach(function(chr){
if(previous && probs[previous] && probs[previous][chr]){
//Add the score if there is one, otherwise zero
score+=probs[previous][chr];
}
previous=chr;
});
return score/suspect.length;
}
console.log(calculateProbability("boom"));
console.log(calculateProbability("Mood"));
console.log(calculateProbability("Broom"));
console.log(calculateProbability("sajkdkas dak"));
The best thing to do is to check your words against frequency lists: dictionaries won't work because they don't contain grammatical inflections, proper nouns, compounds, and a whole load of other stuff that's valid.
The problem with naive checking against n-gram data is there is a lot of noise in the lower frequency words. The easiest thing to do which should give you the correct answer in the overwhelming majority of cases is to truncate a list of frequency counted words from somewhere suitably large (Google n-gram, Wikipedia, etc) at the top 50,000 or 100,000 words. Adjust the threshold as appropriate to get the results you're looking for, but then you can just check if any/all of your query terms appear in this list.
If you want to know if the query is grammatical, or sensible as a unit rather than its constituent parts, that's a whole other question of course.
There are some non-dictionary-words that can be valid searches (e.g. gethostbyname is a valid and meaningful search here on SO, but not a dictionary word). On the other hand, there are dictionary words that have absolutely nothing to do with your website.
Instead of trying to guess what is a word and what isn't, you could simply check if the search query produced a non-empty result. Those with empty results must be complete off-topic or gibberish.
It sounds like you are looking for a
Bayesian Filter
For example replacing # with at. At least one study demonstrated its effectiveness:
To our surprise, none of the crawlers that visited our departmental research and course and research web pages led to any spam on email addresses containing the at.
Another experiment demonstrated the same thing, showing that using at and dot reduced spam by two orders of magnitude.
The first study speculated that spammers obtain enough plain-text email addresses to ignore the obfuscated ones. But parsing at in addition to # should be trivial. Why don't spammers account for such simple obfuscation?
I am no expert...but it intuitively makes sense that the # symbol is much less commonly used in non-email-related speech. The # sign is what set an email address apart from all the other text. If you simply use at, it blends in with normal English.
At is a pretty common word after all :P I'm sure its still possible to parse out the "at" version of an email, but just much more difficult of a regex.
Why would they bother if they can find hundreds of thousands of plaintext ones? Additionally, what if a email had the word at or dot? like dottie#gmail.com - with a simple find and repalce would change it to .ie#gmail.com
I am parsing the domain name out of a string by strchr() the last . (dot) and counting back until the dot before that (if any), then I know I have my domain.
This is a rather nasty piece code and I was wondering if anyone has a better way.
The possible strings I might get are:
domain.com
something.domain.com
some.some.domain.com
You get the idea. I need to extract the "domain.com" part.
Before you tell me to go search in google, I already did. No answer, hence I am asking here.
Thank you for your help
EDIT:
The string I have contains a full hostname. This usually is in the form of whatever.domain.com but can also take other forms and as someone mentioned it can also have whatever.domain.co.uk. Either way, I need to parse the domain part of the hostname: domain.com or domain.co.uk
Did you mean strrchr()?
I would probably approach this by doing:
strrchr to get the last dot in the string, save a pointer here, replace the dot with a NUL ('\0').
strrchr again to get the next to last dot in the string. The character after this is the start of the name you are looking for (domain.com).
Using the pointer you saved in #1, put the dot back where you set it NUL.
Beware that names can sometimes end with a dot, if this is a valid part of your input set, you'll need to account for it.
Edit: To handle the flexibility you need in terms of example.co.uk and others, the function described above would take an additional parameter telling it how many components to extract from the end of the name.
You're on your own for figuring out how to decide how many components to extract -- as Philip Potter mentions in a comment below, this is a Hard Problem.
This isn't a reply to the question itself, but an idea for an alternate approach:
In the context of already very nasty code, I'd argue that a good way to make it less nasty, and provide a good facility of parsing domain names and the likes - is to use PCRE or a similar library for regular expressions. That will definitly help you out if you also want to validate that the tld exists, for instance.
It may take some effort to learn initially, but if you need to make changes to existing matching/parsing code, or create more code for string matching - I'd argue that a regex-lib may simplify this a lot in the long term. Especially for more advanced matching.
Another library I recall which supports regex, is glib.
Not sure what flavor of C, but you probably want to tokenize the domain using "." as the separator.
Try this: http://www.metalshell.com/source_code/31/String_Tokenizer.html
As for the domain name, not sure what your end goal is, but domains can have lots and lots of nodes, you could have a domain name foo.baz.biz.boz.bar.co.uk.
If you just want the last 2 nodes, then use above and get the last two tokens.
Let's say I've got a database full of music artists. Consider the following artists:
The Beatles -
"The" is officially part of the name, but we don't want to sort it with the "T"s if we are alphabetizing. We can't easily store it as "Beatles, The" because then we can't search for it properly.
Beyoncé -
We need to allow the user to be able to search for "Beyonce" (without the diacritic mark)and get the proper results back. No user is going to know how or take the time to type the special diacritcal character on the last "e" when searching, yet we obviously want to display it correctly when we need to output it.
What is the best way around these problems? It seems wasteful to keep an "official name", a "search name", and a "sort name" in the database since a very large majority of entries will all be exactly the same, but I can't think of any other options.
The library science folks have a standard answer for this. The ALA Filing Rules cover all of these cases in a perfectly standard way.
You're talking about the grammatical sort order. This is a debatable topic. Some folks would take issue with your position.
Generally, you transform the title to a normalized form: "Beatles, The". Generally, you leave it that way. Then sort.
You can read about cataloging rules here: http://en.wikipedia.org/wiki/Library_catalog#Cataloging_rules
For "extended" characters, you have several choices. For some folks, é is a first-class letter and the diacritical is part of it. They aren't confused. For other folks, all of the diacritical characters map onto unadorned characters. This mapping is a feature of some Unicode processing tools.
You can read about Unicode diacritical stripping here: http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/lvg/current/docs/designDoc/UDF/unicode/NormOperations/stripDiacritics.html
http://www.siao2.com/2005/02/19/376617.aspx
In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.