Parsing a domain name - c

I am parsing the domain name out of a string by strchr() the last . (dot) and counting back until the dot before that (if any), then I know I have my domain.
This is a rather nasty piece code and I was wondering if anyone has a better way.
The possible strings I might get are:
domain.com
something.domain.com
some.some.domain.com
You get the idea. I need to extract the "domain.com" part.
Before you tell me to go search in google, I already did. No answer, hence I am asking here.
Thank you for your help
EDIT:
The string I have contains a full hostname. This usually is in the form of whatever.domain.com but can also take other forms and as someone mentioned it can also have whatever.domain.co.uk. Either way, I need to parse the domain part of the hostname: domain.com or domain.co.uk

Did you mean strrchr()?
I would probably approach this by doing:
strrchr to get the last dot in the string, save a pointer here, replace the dot with a NUL ('\0').
strrchr again to get the next to last dot in the string. The character after this is the start of the name you are looking for (domain.com).
Using the pointer you saved in #1, put the dot back where you set it NUL.
Beware that names can sometimes end with a dot, if this is a valid part of your input set, you'll need to account for it.
Edit: To handle the flexibility you need in terms of example.co.uk and others, the function described above would take an additional parameter telling it how many components to extract from the end of the name.
You're on your own for figuring out how to decide how many components to extract -- as Philip Potter mentions in a comment below, this is a Hard Problem.

This isn't a reply to the question itself, but an idea for an alternate approach:
In the context of already very nasty code, I'd argue that a good way to make it less nasty, and provide a good facility of parsing domain names and the likes - is to use PCRE or a similar library for regular expressions. That will definitly help you out if you also want to validate that the tld exists, for instance.
It may take some effort to learn initially, but if you need to make changes to existing matching/parsing code, or create more code for string matching - I'd argue that a regex-lib may simplify this a lot in the long term. Especially for more advanced matching.
Another library I recall which supports regex, is glib.

Not sure what flavor of C, but you probably want to tokenize the domain using "." as the separator.
Try this: http://www.metalshell.com/source_code/31/String_Tokenizer.html
As for the domain name, not sure what your end goal is, but domains can have lots and lots of nodes, you could have a domain name foo.baz.biz.boz.bar.co.uk.
If you just want the last 2 nodes, then use above and get the last two tokens.

Related

Generating 4-Digit Passwords, But Skipping Bad Words

We have an application that will be generating 4-digit random strings for guest WiFi usage. So you walk into a hotel, get your room key and your WiFi password. I want to make these generated passwords as simple as possible to save calls to the helpdesk, but not so simple that they are so easily guessed.
The problem is that inevitably you'll end up with passwords like "POOP" or "DICK". I think a simple solution is so to have a database table of the "forbidden" words, and upon generation check it against the database first to make sure it isn't a banned word.
I have looked at probably dozens of filtered/banned/censored word lists, but I can't find one that is sufficiently detailed so as to include things like DIKK and P00P, and I don't exactly want to use my time today to try to think of every possible offensive 4-letter combination and type them all out manually.
Does anyone have a good resource/word list that would contain these "potentially-offensive" strings?
First I wrote this as a comment. But then I realized it actually answers your question about skipping offensive words:
Consider generating random strings without vowels. You won't get any actual english word. You will both avoid words like 'tree' or 'fukc'
I suggest you to use numbers too, will be "more secure" and you will eliminate this problem

How to avoid appengine search QueryErrors resulting from syntax errors in the query string?

When I enter "I am searching for an entire stringoni google, it doesn't barf up an error at me. Instead, it tries to guess what I mean. What's the best way to make sure that whatever the user enters as a query, they'll always get something (preferably something usefu) back when their query is passed into search.Index("myIndex").search()?
Well, there's no easy way to do that out of the box.
What Google does is running very complex algorithms to infer not only what you said, but "what you actually meant". This of course is a very complex problem to solve, and certainly not something you can do by simply flipping a switch.
One thing that you can do though, is to start using the ~ character in your searches. This is called "Stemming":
To search for common variations of a word, like plural forms and verb
endings, use the ~ stem operator (the tilde character). This is a
prefix operator which must precede a value with no intervening space.
The value ~cat will match "cat" or "cats," and likewise ~dog matches
"dog" or "dogs." The stemming algorithm is not fool-proof. The value
~care will match "care" and "caring," but not "cares" or "cared."
Stemming is only used when searching text and HTML fields.
There are a bunch other stuff you can do to make your searches "smarter". Check the following link:
https://developers.google.com/appengine/docs/python/search/query_strings

Facebook content collation and non-western encoded characters

If a user writes a string of text in arabic into a facebook comment and saves, what is the collation type of the data storage?
I don't believe that they use a mysql table for comments, but I've just messed with the topic using a localhost mysql table, where I stored some arabic in a binary character.
it transformed the text into some presumably escaped sequence of character. but once you've saved it, it stayed that way.
If you consider i18n, even when I have facebook set to english, typing in other non-western encoded characters still saves and displays correctly.
any insight into how they've achieved this?
First; I don't know for sure but I don't believe MySQL comes into play anywhere for this.
The right thing to do is store it UTF-8 in <some-system>, period. Which might as well be MySQL I guess. I don't know specifics but I do believe MySQL (and PHP for that matter**) are not really up-to-par with UTF-8/Unicode support and so they might manifest some "glitches". For example, you need to execute "set names=utf8" or some crazy stuff first thing after opening the connection for utf8 to work at all (which might be why your test didn't work). Also, I remember something about MySQL not supporting 4-byte encoded UTF-8 characters, only up to 3. Don't know if that is true currently, but I vaguely remember something about it. [edit] Should be fixed in 5.5+
I don't know about Arabic but they might be the 4-byte kind. [edit] They should need 2 or 3 bytes.
And while we're on glitches: about PHP I remember stuff like strlen() returning bytes instead of actual characters etc. If I'm not mistaken it has some mb_XXX functions (multibyte string) that should handle UTF-8 better. [edit] Turns out it does.
I don't see how i18n and setting facebook to English (or Swahili for that matter) would affect this at all. It's just the language used in the interface (and maybe/probably affecting datetime formatting etc.) and has nothing to do with user-generated content.
Oh, almost forgot the obligatory The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)-link
** Just mentioning it because it usually goes hand-in-hand with MySQL.

Why does simple email obfuscation work so well?

For example replacing # with at. At least one study demonstrated its effectiveness:
To our surprise, none of the crawlers that visited our departmental research and course and research web pages led to any spam on email addresses containing the at.
Another experiment demonstrated the same thing, showing that using at and dot reduced spam by two orders of magnitude.
The first study speculated that spammers obtain enough plain-text email addresses to ignore the obfuscated ones. But parsing at in addition to # should be trivial. Why don't spammers account for such simple obfuscation?
I am no expert...but it intuitively makes sense that the # symbol is much less commonly used in non-email-related speech. The # sign is what set an email address apart from all the other text. If you simply use at, it blends in with normal English.
At is a pretty common word after all :P I'm sure its still possible to parse out the "at" version of an email, but just much more difficult of a regex.
Why would they bother if they can find hundreds of thousands of plaintext ones? Additionally, what if a email had the word at or dot? like dottie#gmail.com - with a simple find and repalce would change it to .ie#gmail.com

Display vs. Search vs. Sort strings in a database

Let's say I've got a database full of music artists. Consider the following artists:
The Beatles -
"The" is officially part of the name, but we don't want to sort it with the "T"s if we are alphabetizing. We can't easily store it as "Beatles, The" because then we can't search for it properly.
Beyoncé -
We need to allow the user to be able to search for "Beyonce" (without the diacritic mark)and get the proper results back. No user is going to know how or take the time to type the special diacritcal character on the last "e" when searching, yet we obviously want to display it correctly when we need to output it.
What is the best way around these problems? It seems wasteful to keep an "official name", a "search name", and a "sort name" in the database since a very large majority of entries will all be exactly the same, but I can't think of any other options.
The library science folks have a standard answer for this. The ALA Filing Rules cover all of these cases in a perfectly standard way.
You're talking about the grammatical sort order. This is a debatable topic. Some folks would take issue with your position.
Generally, you transform the title to a normalized form: "Beatles, The". Generally, you leave it that way. Then sort.
You can read about cataloging rules here: http://en.wikipedia.org/wiki/Library_catalog#Cataloging_rules
For "extended" characters, you have several choices. For some folks, é is a first-class letter and the diacritical is part of it. They aren't confused. For other folks, all of the diacritical characters map onto unadorned characters. This mapping is a feature of some Unicode processing tools.
You can read about Unicode diacritical stripping here: http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/lvg/current/docs/designDoc/UDF/unicode/NormOperations/stripDiacritics.html
http://www.siao2.com/2005/02/19/376617.aspx

Resources