Web client classification using Artifical Neural Network - artificial-intelligence

I have a high traffic web site.
I want to create software which analyses client requests on-the-fly and decide if they come from a real user or a botnet bot. For training the neural network to identify legitimate ("good") users I can use logs when there are no DDoS activity. Once trained, the network would distinguish real users from bots.
What I have:
request URI (and order)
cookie
user agent
request frequency.
Any ideas on how to best design ANN for this task and how to tune it?
Edit: [in response to comments about the overly broad scope of this question]
I currently have a working C# program which blocks clients on the basis the frequency of identical requests. Now I'd like to improve its "intelligence" with a classifier based on neural network.
I don't know how to normalize these inputs for ANN and I need suggestions in this specific area.

This isn't really suited to neural networks. Neural networks are great provided (as a rough guide):
You can spare the processing power,
The data is not temporal,
The input data is finite,
I don't think that you really pass any of these.

Re: normalizing the inputs: you map your input data to a set of symbols (which are then turned into numbers) or you map the inputs to a floating point number where the number represents some degree of intensity. You can map any kind of data to any kind of scheme but you would really only want to use ANN's when the problem solution is nonlinear (all the data for one classification of another classification CAN'T be clustered on one side of a line with all the data for the other classification on the other side of the line). In both cases you end up with a vector of inputs associated with an output ([BOT, HUMAN], or [BOT, HUMAN, UNKNOWN] or [BOT, PROBABLY-BOT, PROBABLY-HUMAN, HUMAN], etc).
How do you distinguish between two users coincidentally submitting the exact same book request equentially in time (let's assume you are selling books)?

Related

How to secure fingerprint data?

Our apartment association is planning to implement, biometrics gate passes (fingerprint turnstile) for all residents. But residents are bothered about the data privacy of the fingerprints that are stored in databases. This data resides in association harddisks, which is intended to access by some contract employees working in our apartment.
How I can make sure, data is secure and not misused/sold?
I found something here, can someone explain how?
Actually, fingerprint template is nothing but the features of the finger like crosses, deltas, parallel lines, curves and etc. So, using fingerprint template provided by the regular attendance/access control machines, you can not generate image of the real fingerprint. The templates are not unique, every time you register, you will get different string all the time. Also only machines will have the matching algorithm, hence only if the user gives the thumb impression to the the machine it can validate. So, you will not have the security threads. If see anything as threat, you can specifically say what thread, we can provide you solution.

Save Informations as a Data Net

My aim is to write an intelligent ChatBot. He should save known informations likely to the human brain.
That is why I am looking for a filetype wich stores data as a net of connected keywords. What filetype or database system could reach this?
Further Informations:
The information input will be wikipedia, google search, and facts teached by a human during a conversation.
I could give specific informations about my requirements and wishes but I don't know if there exists even any approach to this. Maybe there are more useful specifications as my thoughts.
Just one example: the connections should have weights. Requesting an information net should increase the weights of the used connections.
What I expect is that the ChatBot could get real associations (or ideas) using the data net.
As an extension to my above comments:
A graph is definitely the way you want to go in terms of data representation...it maps perfectly to your problem description.
What you seem to be asking is how you can [persistently] store this information on disk (rather than memory). That completely depends on what constraints you need. There is a "Graph Database" which is more geared to storing graphs than say relational or hierarchical databases, and would be perform far better than say pushing your adjacency matrix or list to a flat file. Here's the wikipedia entry:
http://en.wikipedia.org/wiki/Graph_database
Now, there is the issue of what happens when you have so many nodes and edges that you can't load them all into memory at once, and unfortunately if you have nodes that are connected to every other node, that can be a problem (because you won't be able to load the complete/valid graph. I can't answer that right now, but I'm sure there are paradigms to address this problem. I will update my answer after some digging.
Edit-You'll probably have to consult someone who knows more about graph databases. It's possible that there are ways to load chunks of the graph from the database without loading the whole thing. If that's what your issue is, you may want to reform a question about working with large graphs stored on graph databases and post it again, tagged with graphs,databases,algorithms, stuff like that, and just post it again in a more specific manner.

Graph Data Structures with millions of nodes (Social network)

In the context of design of a social network using Graphs data structure, where you can perform a BFS to find a connection from one person to another, I have some questions pertaining to it.
If there are million users, the topology would indeed be much more complicated and interconnected than the graphs we normally design and I am trying to comprehend how you could solve these problems.
In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
In real life, some people have more friends of friends than others, and are therefore more likely
to make a path between you and someone else. How could you use this data to pick where you
start traverse?
Your question seems interesting and curious :)
1) Well... of course, data is stored in disks, not in ram.
Disks have systems that avoid failure, in particular, RAID-5 for example.
Redundancy is the key: if one system fail there is another system ready to take his place.
There is also redundancy and workload sharing together... there are two computers that work in parallel and share their jobs but if one stops only one works and take the full workload.
In places like google or facebook redundancy is not 2, is 1200000000 :)
And consider also that data is not in a single server farm, in google there are several datacenters connected together, so if one building explodes, another one will take his place for example.
2) Not an easy question at all, but usually these systems have big cache for disk arrays too, so reading and writing data on disk is faster than on our laptops :)
Data can be processed in parallel by several concurrent systems and this is the key of the speed of services like facebook.
3) The end of the graph is not infinite.
So it is possible with actual technology indeed.
The computational complexity of exploring all connections and all nodes on a graph is O(n + m) where n is the number of vertices and m the number of edges.
This means, it is linear to the number of registered user and to the number of connection between users. And RAM these days is very cheap.
Being a linear growth is easy to add resources when needed.
Add more computers the more you get rich :)
Consider also that no-one will perform a real search for every node, everything in facebook is quite "local", you can view the direct friend of one person, not the friend of friend of friend .... it would be not useful.
Getting the number of vertices directly connected to a vertex, if the data structure is well done, is very easy and fast. In SQL it would be a simple select and if tables are well indexed it will be very fast and also not very dependant on the total number of users (see the concept of hash tables).

Booking logic and architecture, database sync: Hotels, tennis courts reservation system

Imagine that you want to design a tennis booking system.
You have 5 tennis clubs as partners with no online api allowing you to check on their side if a court is booked or not: You have to build this part as well.
Every time a booking is done on their side you want it to be known by our system. Probably using a POST request form tennis partner to our server.
Every time a booking is done on our website, we want to push the booking to their system. The difficulty is that their system need to be online and accessible from outside. Ip may change, we have to use a dns updater.
In case their system is not available we still accept the booking and fallback to an async email with 'i confirm booking/reject booking' link sent to the club.
I find the whole process quite complex and was wondering about the way online hotel booking system and hotel were working. Do they all have their data open and online ?
The good thing is that the data will grow large and fits nicely to some no SQL ;) like couch db
There are several questions here, let me try and address each one...
Since this appears to be an internet application with federated servers, using the implied HTTP Protocol makes a lot of sense. This could be done via Form POSTs, GET, or even REST-ful submission of some custom data structure. In the end, the exact approach to use will need to come down to the size and complexity of the information being communicated. Many architectures employ these approaches and often combine them with encrypted, signed, and/or encoded payloads for security. One short-fall to consider with these approaches is that they will require you to clearly communicate all request / response message formats, field ranges, and variations since these mechanisms are not really self-describing. On the other hand, these patterns use very common protocols, are easily understood, easy to implemented, and are typically lean on-the-wire.
In constrast, architectures with very complex structures often chose to use WSDL-based web services. Also driven by common standards, these tend to be self-describing, inherently versionable, although they can take more time and energy to implement. There are a lot of advantanges to web services which are driven by many WS-* standards which may be worth investigating further in your case.
As for the reservation process... many similar architectures will employ an orchestration model such as the following:
Find open booking spaces
Make a reservation for a booking space. This places an expiring lock on a space while the requestor fills in all required booking information. This mitigates against race conditions that could lead to multiple bookings for the same space
Once all required booking information is received and validated the booking is confirmed and permamently locked from use by other requestors
As for the SQL-style DB comment, I can't really say given the amount of information supplied. With that said, my instincts tell me a SQL-style DB is completely reasonable for this problem set. I have databases with many pedabytes and have very high SLA's. You implied a need for high availability and SQL-based databases have a few decades of proven support behind them in this area.
Hope this helps.
I think you will find most on-line hotel reservation systems aren't really on-line. My experience is that those companies (not the hotels themselves) offering on-line booking systems also insist that the hotel itself also books their rooms on-line using the same system.
Everything works fine as long as connectivity is not an issue - and in small motels scenario it normally will. Of course the bigger hotels use the same system the airlines do and they have dedicated communications links for the purpose. The reservations are of course maintained on one central computer with appropriate backup links etc etc etc.
It is very easy for individual tennis clubs to offer their own real-time online booking systems using their own database/website with programs like MyCourts offers however once you want to link more than one clubs facilities then you really don't have much option other than to have a centralized server that both the user and the club both have to use to reserve facilities.

Common web problems where Neural Networks could help

I was wondering if you creative minds out there could think of some situations or applications in the web environment where Neural Networks would be suitable or an interesting spin.
Edit: Some great ideas here. I was thinking more web centric. Maybe bot detectors or AI in games.
To name a few:
Any type of recommendation system (whether it's movies, books, or targeted advertisement)
Systems where you want to adapt behaviour to user preferences (spam detection, for example)
Recognition tasks (intrusion detection)
Computer Vision oriented tasks (image classification for search engines and indexers, specific objects detection)
Natural Language Processing tasks (document/article classification, again search engines and the like)
The game located at 20q.net is one of my favorite web-based neural networks. You could adapt this idea to create a learning system that knows how to play a simple game and slowly learns how to beat humans at it. As it plays human opponents, it records data on game situations, the actions taken, and whether or not the NN won the game. Every time it plays, win or lose, it gets a little better. (Note: don't try this with too simple of a game like checkers, an overly simple game can have every possible game/combination of moves pre-computed which defeats the purpose of using the NN).
Any sort of classification system based on multiple criteria might be worth looking at. I have heard of some company developing a NN that looks at employee records and determines which ones are the least satisfied or the most likely to quit.
Neural networks are also good for doing certain types of language processing, including OCR or converting text to speech. Try creating a system that can decipher capchas, either from the graphical representation or the audio representation.
If you screen scrap or accept other sites item sales info for price comparison, NN can be used to flag possible errors in the item description for a human to then eyeball.
Often, as one example, computer hardware descriptions are wrong in what capacity, speed, features that are portrayed. Your NN will learn that generally a Video card should not contain a "Raid 10" string. If there is a trend to add Raid to GPUs then your NN will learn this over time by the eyeball-er accepting an advert to teach the NN this is now a new class of hardware.
This hardware example can be extended to other industries.
Web advertising based on consumer choice prediction
Forecasting of user's Web browsing direction in micro-scale and very short term (current session). This idea is quite similar, a generalisation, to the first one. A user browsing Web could be proposed with suggestions with other potentially interesting websites. The suggestions could be relevance-ranked according to prediction calculated in real-time during user's activity. For instance, a list of proposed links or categories or tags could be displayed in form of a cloud and font size indicates rank score. Each and every click a user makes is an input to the forecasting system, so the forecast is being constantly refined to provide user with as much accurate suggestions, in terms of match against user's interest, as possible.
Ignoring the "Common web problems" angle request but rather "interesting spin" view.
One of the many ways that a NN can be viewed/configured, is as a giant self adjusting, multi-input, multi-output kind of case flow control.
So when you want to offer match ups that are fuzzy, (not to be confused directly with fuzzy logic per se, which is another area of maths/computing) NN may offer a usable alternative.
So to save energy, you offer a lift club site, one-offs or regular trips. People enter where they are, where they want to go and at what time. Sort by city and display in browse control.
Using a NN you could, over time, offer transport owners to transport seekers by watching what owners and seekers link up. As a owner may not live in the same suburb that a seeker resides. The NN learns over time what variances in owners, seekers physical location difference appear to be acceptable. So it can then expand its search area when offering a seekers potential owners.
An idea.
Search! Recognize! Classify! Basically everything search engines do nowadays could benefit from a dose of neural networks and fuzzy logic. This applies in particular to multimedia content (e.g. content-indexing images and videos) since that's where current search technologies are lagging behind.
One thing that always amazes me is that we still don't have any pseudo-intelligent firewalling technology. Something that says "hey his range of urls is making too much requests when they are not supposed to", blocks them, and sends a report to an administrator. That could be done with a neural network.
On the nasty part of things, some virus makers could find lucrative uses to neural networks. Adaptative trojans that "recognized" credit card numbers on a hard drive (instead of looking for certain cookies) or that "learn" how to mask themselves from detectors automatically.
I've been having fun trying to implement a bot based on a neural net for the Diplomacy board game, interacting via DAIDE protocols. It turns out to be extremely tricky, so I've turned to XCS to simplify the problem.
Suppose EBay used neural nets to predict how likely a particular item was to sell; predict what the best day to list items of that type would be, suggest a starting price or "buy it now price"; or grade your description based on how likely it was to attract buyers? All of those could be useful features, if they worked well enough.
Neural net applications are great for representing discrete choices and the whole behavior of how an individual acts (or how groups of individuals act) when mucking around on the web.
Take news reading for instance:
Back in the olden days, you picked up usually one newspaper (a choice), picked a section (a choice), scanned a page and chose an article (a choice), and read the basics or the entire article (another choice).
Now you choose which news site to visit and continue as above, but now you can drop one paper, pick up another, click on ads, change sections, and keep going with few limits.
The whole use of the web and the choices people make based on their demographics, interests, experience, politics, time of day, location, etc. is a very rich area for NN application. This is especially relevant to news organizations, web page design, ad revenue, and may even be an under explored area.
Of course, it's very hard to predict what one person will do, but put 10,000 of them that are the same age, income, gender, time of day, etc. together and you might be able to predict behavior that will lead to better designs. Imagine a newspaper (or even a game) that could be scaled to people's needs based on demographics. An ad man's dream !
How about connecting users to the closest DNS, and making sure there are as few bounces as possible between the request and the destination?
Friend recommendation in social apps (Linkedin,facebook,etc)

Resources