Traffic profiling: distinguish between streaming and downloading and other services? - c

I'm a Libpcap and Wireshark novice: for my school project I have to distinguish between different types of traffic (SMTP, web traffic, VoIP, online gaming, downloading, streaming, ...).
While at first I relied on port numbers (25 for SMTP, 80/443 for HTTP/HTTPS, ...), some problems came up: always more sites supports HTTPS (so, no more payload investigation) and the simple port number can't tell me important differences (port 443 may bring different types of services).
So I thought to classify traffic according to some known behaviours, for example download and streaming have different bandwidth (bitrate): the first has constant high bandwidth, the second has spikes of high bandwidth that go back to zero when you have the "piece" you need.
Because of my unfamiliarity with the topic, this is the only known behaviour I got from the Web.
Anyone can point me in the right direction?

use wireshark to partition your traffic into sessions.
for those where categorization is clear based on protocol/port, categorize (e.g. port 25 = SMTP should be a given).
for those that need further analysis, find appropriate features, such as:
average packet size,
packet size std deviation/variance,
packets per second in upstream/downstream direction,
overall amount of data,
up/downstream data amount ratio,
up/downstream packet number ratio
so much more you could think of
with the numerical values for the features from 3., build vectors, and apply all your classification knowledge: Maybe this is a case for support vector machines? Maybe you just look at the clusters you might see and come to conclusions? Maybe you just generate "known" traffic of all relevant kinds and map that into that vector space, categorizing each unknown session as the euclidean distance-"closest" known traffic type? Maybe you pre-condition your vectors by what you learn from a principal component analysis?
As you see in 4., there's a lot of tools for classification, and you will need some proficiency in classification theory to deal with your problem.

Related

Historic Azure Map data

My company is interested in using azure Maps for traffic data. Data related to the traffic density surrounding to the garage location. Keeping the garage location in the center we are trying to find what's the traffic flow (is it heavy traffic, light traffic, road closed, traffic jam etc) and also we are trying to find the speed limit of each road. My question is, does anyone know if Azure Maps can provide this information?
Thank you in advance
Historical traffic data is not currently available in Azure Maps. However this is something that we are investigating as a potential future feature.
Real-time traffic data is available. Details on all the traffic services can be found here: https://learn.microsoft.com/en-us/rest/api/maps/traffic The traffic flow segment sounds like it might be what you are looking for. The vector tiles could also be used and would be more efficient if you needed to analyze a large number of roads/large area, but would be more dev work. The flow data has a free flow speed with is not the speed limit, but the speed traffic generally travels at (usually close to the speed limit). The actual speed limit data can be retrieved using the reverse geocoding service. https://learn.microsoft.com/en-us/rest/api/maps/search/getsearchaddressreverse Be sure to set the returnSpeedLimit option.

"Standard" approach to collecting data from/distributing data to multiple devices/servers?

I'll start with the scenario I am most interested in:
We have multiple devices (2 - 10) which all need to know about
a growing set of data (thousands to hundreds of thousands of small chunks,
say 100 - 1000 bytes each).
Data can be generated on any device and we
want every device to be able to get all the data (edit: ..eventually. devices are not connected and/or online all the time, but they synchronize now and then) No data needs
to be deleted or modified.
There are of course a few naive approaches to handle this, but I think
they all have some major drawbacks. Naively sending everything I
have to everyone else will lead to poor performance with lots of old data
being sent again and again. Sending an inventory first and then letting
other devices request what they are missing won't do much good for small
data. So maybe having each device remember when and who they talked to
could be a worthwhile tradeoff? As long as the number of partners
is relatively small saving the date of our last sync does not use that much
space, but it should be easy to just send what has been added since then.
But that's all just conjecture.
This could be a very broad
topic and I am also interested in the problem as a whole: (Decentralized) version control probably does something similar
to what I want, as does a piece of
software syncing photos from a users smart phone, tablet and camera to an online
storage, and so on.
Somehow they're all different though, and there are many factors like data size, bandwith, consistency requirements, processing power or how many devices have aggregated new data between syncs, to keep in mind, so what is the theory about this?
Where do I have to look to find
papers and such about what works and what doesn't, or is each case just so much
different from all the others that there are no good all round solutions?
Clarification: I'm not looking for ready made software solutions/products. It's more like the question what search algorithm to use to find paths in a graph. Computer science books will probably tell you it depends on the features of the graph (directed? weighted? hypergraph? euclidian?) or whether you will eventually need every possible path or just a few. There are different algorithms for whatever you need. I also considered posting this question on https://cs.stackexchange.com/.
In your situation, I would investigate a messaging service that implements the AMQP standard such as RabbitMQ or OpenAMQ, each time a new chunk is emitted, it should be sent to the AMQP broker which will broadcast it to all devices queues. Then the message may be pushed to the consumers or pulled from the queue.
You can also consider Kafka for data streaming from several producers to several consumers. Other possibility is ZeroMQ. It depends on your specific needs
Have you considered using Amazon Simple notification service to solve this problem?
You can create a topic for each group of device you want to keep in sync. Whenever there is an update in dataset, the device can publish to the topic which in turn will be pushed to all devices using SNS.

Realtime game on Google Cloud : Channel API or Compute Engine?

We need to develop a multi-player game with real-time performance.
This needs to be working worldwide (servers in America, Europe, Asia), and supporting a huge traffic. Using Google Cloud services for the hosting.
We're thinking of references like Jam with Chrome, Chrome Maze or Cube Slam.
The game :
2 players challenge a race
We need to simultaneously display the progression of the 2 players
Each match could last around 30 to 45 seconds
The hosting :
We will obviously host the website on AppEngine, automagically scaling,
but are thinking about 2 solutions for the real-time servers :
Using websocket servers with Compute Engine
Like they did for Jam with Chrome, Maze, etc.
Developing our own websocket servers (technology TBD), deploying on datacenters in Europe, US, Asia, handling scaling, syncing between them, computing latency issues on servers and clients, etc.
But it's pretty technically challenging as we are very short on time, and missing an admin sys and network guy for now.
Or using Channel API
We understand that it's not a websocket platform, and real-time performances are lower.
But it would be way more simple and secure for us and the time we have.
So, we would also like to know more about that.
In any case, we think we could use some graphical tricks on front ends, to make it look like real-time, but it really depends if we have a 100~500ms or a 500ms~10s latency.
Some questions :
What would the latency range values look like for the different solutions ?
(Jam w/ Chrome got 100ms with GCE, could Channel API reach several seconds ?)
How would Channel API servers handle high traffic, how does scaling work, could the latency go very high ? (no info about that on Channel docs ?)
What if someone in France play with someone in US, connecting to different servers, waiting them to sync, how to deal with it ?
Any advice or experience to share ?
Any interesting reading or viewing ? (seen some but not very precise)
Any other solution ?
Thank you for any helping comment !
EDIT :
Only 2 players connected together, potentially from different world zone, no broadcasting needed.
We could find some front side tricks to avoid server side processing. This is a race between 2 players, so we actually just need to compare their progression, and the real winner resolution is not that important as there is no real stuff to win, this is more for fun.
If you need a server for processing the data:
I would definitely go with websockets at Compute Engine!
The Channels API is much slower, and also quite unpredictable (latency differs from message to message)! Data has to go to the Channels server, which sends it to the App Engine instance, which has to do a request back to the Channels server, which will push the message to the client. There is too much going on there if you want to keep latency down!
Here is a Channels API stress test:
http://channelapistresstest.appspot.com/
Try clicking "send 5"-button a lot, and you will see latency numbers going up to several seconds.
The Channels API is also quite expensive under heavy load (it probably does not scale well, even if Google of course can solve that with more instances).
When keeping latency down, geolocation is quite important. With a websocket server at Compute Engine, you can send your european visitors to google's european datacenter and your american visitors to the US datacenter (using the geo location headers that AppEngine will provide). You have no such control with the Channels API (or app engine, which all your messages are relayed through). Maybe Google has edge servers for the Channels API (I don't know), but if your AppEngine instance is on the other side of the planet, that does not matter.
If you do NOT need a server for processing the data:
You should establish a peer-to-peer connection with WebRTC, sending stuff directly between the users' browsers. That is was Cube Slam does. (WebRTC requires some initial handshaking ("signaling") so the two peers can find each other, and Channels API would work fine for that handshaking, that's just a couple of messages to establish the peer-to-peer connection.)
WebRTC DataChannels API will give you a nice websocket-like interface like channel.onmessage = function(e) { yadayada()... }; and channel.send("yadayada"); to send your data between the peers.
Occasionally, WebRTC is not able to make a peer-to-peer connection. Then it will fall back to a TURN server, which relays traffic between the peers. Cube Slam is using TURN servers running on ComputeEngine (in both Europe and America to keep latency down), but that is just the fallback when true peer-to-peer is not possible.
It also depends on other things like scalability.
Ingress is built on app engine and a part from the occasional cache glitch it is pretty impressive.
Remember that the channel api is using talk.Google which is the service that hangouts is built on. Scalable and real time.
Personally if your traffic levels are going to be erratic and unpredictable, go app engine. If you think it can be controlled and predictable use compute engine or something else.
Alfred's answer is the best in the frame of the question I asked.
Thank you very much !
However, I forgot to mention a few important points and the scope changed a bit :
We have very little development time (about 1 week only)
This is for a campaign that will last 3 weeks only (we'll need to keep it online a few months afterward, but this is not like we need a long-lasting architecture)
We need to make it work on the broader browser audience as possible (WebRTC only runs on Chrome & Firefox for now)
According to these points, we eventually came up to a 3rd solution :
Using a real-time PAAS.
It's way easier and faster to develop, way cheaper as we don't need a solid backend developer and system/network admin, and we can concentrate more on the project than on the infrastructure and platform.
There are a couple of services that seems good out there, already hosting MMO RPG and the kind, worldwide, with low latency, and good scaling systems.
Here is a list of providers :
https://github.com/leggetter/realtime-web-technologies-guide/blob/master/guide.md

Use Google Go's Goroutines To Create A Bayes Network

I have a large dataset of philosophic arguments, each of which connect to other arguments as proof or disproof of a given statement. A root statement can have many proofs and disproofs, each of which may also have proofs and disproofs. Statements can also be used in multiple graphs, and graphs can be analyzed under a "given context" or assumption.
I need to construct a bayesian network of related arguments, so that each node propagates influence fairly and accurately to it's connected arguments; I need to be able to calculate the probability of chains of connected nodes concurrently, with each node requiring datastore lookups that must block to get results; the process is mostly I/O bound, and my datastore connection can run asynchronously in java, go and python {google appengine}. Once each lookup completes, it propagates the effects to all other connected nodes until the probability delta drops below a threshold of irrelevance {currently 0.1%}. Each node of the process must calculate chains of connections, then sum up all the results across all queries to adjust validity results, with results chained outward to any connected arguments.
In order to avoid recurring infinitely, I was thinking of using an A*-like process in goroutines to propagate updates to the argument maps, with a heuristic based on compounding influence which ignores nodes once probability of influence dips below, say 0.1% . I'd tried to set up the calculations with SQL triggers, but it got complex and messy way too fast. Then I moved to google appengine to take advantage of asynchronous nosql, and it was better, but still too slow. I need to be run the updates fast enough to get a snappy UI, so when a user creates or votes for or against a proof or disproof, they can see the results reflected in UI immediately.
I think Go is the language of choice to support the concurrency I need, but I'm open to suggestions. The client is a monolithic javascript app that just uses XHR and websockets to push and pull argument maps {and their updates} in real time. I have a java prototype that can compute large chains in 10~15s, but monitoring of performance shows that most of my runtime is wasted in synchronization and overhead from ConcurrentHashMap.
If there are other highly-concurrent languages worth trying out, please let me know. I know java, python, go, ruby and scala, but will learn any language if it suits my needs.
Similarly, if there are open source implementations of huge Bayesian networks, please leave a suggestion.
I think it's a bit difficult to tell what you are asking about. Maybe you can elaborate on your question.
Goroutines are quite cheap, and are a perfect match for modern web applications which use XHR or Websockets heavily (and other I/O bound applications which have to wait for database responses and stuff like that). Additionally, the go runtime is also able to execute those goroutines in parallel, so that Go is also a good fit for CPU bound tasks, which should take advantage of multiple cores and the speed of a natively compiled language.
But you should also keep in mind, that goroutines and channels aren't for free. They still require some amount of memory and each synchronization point (e.g. a channel send or receive) comes with its cost. That's normally not a problem, since the synchronization is, in comparison to a database query for example, extremely cheap, but it might not be suited for building efficient Bayesian networks, especially if the actual work of each goroutine / node is negligible in comparison to the synchronization overhead.
Your primary goal for every concurrent program should be to avoid shared mutability as far as possible. So a Bayesian network modeled with goroutines and channels might be a good educational example and a great way to measure the performance of Go's channel implementation, but it's probably not the best fit for your problem.

Calculate server requirements based on programming specs

Have you ever encountered something so easy to develop but stopped a while to think of server requirements for your project ? It is my case.
I want to compete with a gaming site, they have multiplayer Flash games like poker, rummy, backgammon, and other card games, 8 games in total. For each game they have rooms and tables.
I'll use Silverlight with Sockets. I already managed to develop the policy server, the Socket Server app using WinForms, the Client Socket app in Silverlight. I own a VPS for tests, so there is no problem in developing what I want, the problem is How to calculate server requirements, RAM, bandwidth, internet speed based on the following requirements:
Server should support 24.000 users / day or 1000 users / hour
Each game room should have it's own tables where users can play
Users should not lose scores and game speed should be fast in general
I just wonder how to handle the following situation: if 1000 users are connected through Socket connection to a room full of tables and one user leave a table, all 1000 users must be updated and UI should reflect the changes. Let's say that I'll update the clients by sending a small Message of 100 bytes to each user, this will eat 100 bytes * 1000 users = 100 kb, and this just for 1 UI change, for 1 Game and for 1 Room, not counting all my other games and rooms. Also 1000 iterations that sends bytes to clients should be very time consuming.I am a developer, but not experienced in those situations. Please advice. Numbers will be great.
Until you've built -- and optimized -- your actual applications, you cannot predict much about the hardware required for some level of performance.
You have to finish the apps first. Then you can measure their performance under load. Then you can decide how much to spend on what levels of performance.
The best answer I can offer you is to run stress tests and see how much load a single server can support. While running those tests, monitor memory, IO, CPU and disk activity (if relevant) to understand which resource is running out first.
We deploy our applications on Amazon's EC2 cloud infrastructure. That lets us easily (within minutes) add or remove capacity as needed. Perhaps it's worth considering for your situation.
Always follow these two rules
“The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” - Michael A. Jackson
First of all you should think more about how and when to send what information to which clients. Not every client needs to be informed about every table change.
That there are only so much informations that a client needs, and you need to decide when/how it will be transmitted. Also you should pack the informations into meaningfull packets. Whats happening at a table is only interesting for that table.
Also you need to profile your application to make sure you know what ressources it consumes. Cardgames should not eat up so much ressources. But the important point is to FIRST build it, and when you HAVE a bottleneck, then try to fix it.
It's very difficult to guess at these things at this point.
From a pragmatic standpoint, you may eventually want to look into a) a cloud-hosting type service for better bandwidth price-scaling for you, or b) a very experienced full-service hosting company that can help you calculate your needs based on prior experience.
Disclaimer: I work for Rackspace Hosting which provides both of the above.

Resources