How is MapReduce a good method to analyse http server logs? - distributed

I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.
But here is my problem : I can't figure out how it can help with http server logs analysis.
My understanding is that big companies (Facebook for instance) use MapReduce for the purpose of computing their http logs in order to speed up the process of extracting audience statistics out of these. The company I work for, while smaller than Facebook, has a big volume of web logs to compute everyday (100Go growing between 5 and 10 percent every month). Right now we process these logs on a single server, and it works just fine. But distributing the computing jobs instantly come to mind as a soon-to-be useful optimization.
Here are the questions I can't answer right now, any help would be greatly appreciated :
Can the MapReduce concept really be applied to weblogs analysis ?
Is MapReduce the most clever way of doing it ?
How would you split the web log files between the various computing instances ?
Thank you.
Nicolas

Can the MapReduce concept really be applied to weblogs analysis ?
Yes.
You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified):
192.168.1.1,FireFox x.x,username1
192.168.1.1,FireFox x.x,username1
192.168.1.2,FireFox y.y,username1
192.168.1.7,IE 7.0,username1
You can extract browsers, ignoring version, using a map operation to get this list:
FireFox
FireFox
FireFox
IE
Then reduce to get this :
FireFox,3
IE,1
Is MapReduce the most clever way of doing it ?
It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs.
To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.
You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs...
With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.
http://img355.imageshack.us/img355/7355/mqlogs.png
How would you split the web log files between the various computing instances ?
By number of elements or lines if it's a text-based logfile.
In order to test MapReduce, I'd like to suggest that you play with Hadoop.

Can the MapReduce concept really be applied to weblogs analysis ?
Sure. What sort of data are you storing?
Is MapReduce the most clever way of doing it ?
It would allow you to query across many commodity machines at once, so yes it can be useful. Alternatively, you could try Sharding.
How would you split the web log files between the various computing instances ?
Generally you would distribute your data using a consistent hashing algorithm, so you can easily add more instances later. You should hash by whatever would be your primary key in an ordinary database. It could be a user id, an ip address, referer, page, advert; whatever is the topic of your logging.

Related

Tech-stack for querying and alerting on GB scale (streaming and at rest) datasets

Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.

Which approach and database to use in performance-critical solution

I have the following scenario:
Around 70 million of equipments send a signal every 3~5 minutes to
the server sending its id, status (online or offiline), IP, location
(latitude and longitude), parent node and some other information.
The other information might not be in an standard format (so no schema for me) but I still need to query it.
The equipments might disappear for some time (or forever) not sending
signals in the process. So I need a way to "forget" the equipments if
they have not sent a signal in the last X days. Also new equipments
might come online at any time.
I need to query all this data. Like knowing how many equipments are offline on a specific region or over
an IP range. There won't be many queries running at the same time.
Some of the queries need to run fast (less than 3 min per query) and
at the same time as the database is updating. So I need indexes on
the main attributes (id, status, IP, location and parent node). The
query results do not need to be 100% accurate, eventual consistency
is fine as long as it doesn't take too long (more than 20 min on
avarage) for them to appear in the queries results.
I don't need
persistence at all, if the power goes out it's okay to lose
everything.
Given all this I thought of using a noSQL approach maybe MongoDB or CouchDB since I have experience with MapReduce and Javascript but I don't know which one is better for my problem (I'm gravitating towards CouchDB) or if they are fit at all to handle this massive workload. I don't even know if I actually need a "traditional" database since I don't need persistence to disk (maybe a main-memory approach would be better?), but I do need a way to build custom queries easily.
The main problem I detect are the following:
Need to insert/update lots of tuples really fast and I don't know
beforehand if the signal I receive is already in the database or not.
Almost all of the signals will be in the same state as they were the
last time, so maybe query by id and check to see if the tuple changed if not do nothing, if it did update?
Forgeting offline equipments. A batch job that runs during the night
removing expired tuples would solve this problem.
There won't be many queries running at the same time, but they need
to run fast. So I guess I need to have a cluster that perform a
single query on multiple nodes of the cluster (does CouchDB MapReduce
splits the workload to multiple nodes of the cluster?). I'm not
enterily sure I need a cluster though, could a single more expensive
machine handle all the load?
I have never used a noSQL system before, but I have theoretical
knowledge of the subject.
Does this make sense?
Apache Flume for collecting the signals.
It is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Easy to configure and scale. Store the data in HDFS as files using Flume.
Hive for batch queries.
Map the data files in HDFS as external tables in Hive warehouse. Write SQL like queries using HiveQL whenever you need offline-batch processing.
HBase for random real-time reads/writes.
Since HDFS, being a FS, lacks the random read/write capability, you would require a DB to serve that purpose. Looking at your use case HBase seems good to me. I would not say MongoDB or CouchDB as you are not dealing with documents here and both these are document-oriented databases.
Impala for fast, interactive queries.
Impala allows you to run fast, interactive SQL queries directly on your data stored in HDFS or HBase. Unlike Hive it does not use MapReduce. It instead leverages the power of MPP so it's good for real time stuff. And it's easy to use since it uses the same metadata, SQL syntax (Hive SQL), ODBC driver etc as Hive.
HTH
Depending on the type of analysis, CouchDB, HBase of Flume may be all be good choices. For strictly numeric "write-once" metrics data graphite is a very popular open source solution.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

Is GAE a viable platform for my application? (if not, what would be a better option?)

Here's the requirement at a very high level.
We are going to distribute desktop agents (or browser plugins) to collect certain information from tons of users (in thousands or possibly millions down the road).
These agents collect data and periodically upload it to a server app.
The server app will allow for analyzing collected data (filter, sort etc based on 4-5 attributes) and summarize in form of charts etc.
We should also be able to export some of the collected data (csv or pdf)
We are looking for an platform to host the server app. GAE seems attractive because of low administrative cost and scalability (as users base increases, the platform will handle the scale... hopefully!).
Is GAE a viable option for us?
One important consideration is that sometimes the volume of uploads from the agents can exceed 50MB per upload cycle. We will have users in places where Internet connections could be very slow too. Apparently GAE has a limit on the duration a request can last. The upload volume may cause the request (transferring data from an agent to the server) to last longer than 30 seconds. How would one handle such situation?
Thanks!
The time of the upload is not considered part of the script execution time, so no worries there.
Google App Engine is very good to perform a vast number of smaller jobs but not so much to do complex long running background jobs (because of the 30 sec limit + even smaller database connection time limit). So probably GAE would be a very good platform to GATHER the data but not for actually ANALYZING it. You probably would like to separate these two.
We went ahead an implemented the first version on GAE anyway. The experience has been very much what is described here http://www.carlosble.com/?p=719
For a proof-of-concept prototype, what we have built so far is acceptable. However, we have decided not to go with GAE (at least in its current shape) for the production version. The pains somewhat outweigh the benefits in our case.
The problems we faced were numerous. Unlike my experience dealing with J2EE stacks, when you run into an issue, many a times it is a dead end. Workarounds are very complicated and ugly, if you can find one.
By writing good prototypes one could figure out whether GAE is right for the solution being built, however, the hype is a problem. Many newcomers are going get overly excited about GAE due to its hype and end up failing badly. Because they will choose GAE for all kinds of purpose that it is not suitable for.

Calculate server requirements based on programming specs

Have you ever encountered something so easy to develop but stopped a while to think of server requirements for your project ? It is my case.
I want to compete with a gaming site, they have multiplayer Flash games like poker, rummy, backgammon, and other card games, 8 games in total. For each game they have rooms and tables.
I'll use Silverlight with Sockets. I already managed to develop the policy server, the Socket Server app using WinForms, the Client Socket app in Silverlight. I own a VPS for tests, so there is no problem in developing what I want, the problem is How to calculate server requirements, RAM, bandwidth, internet speed based on the following requirements:
Server should support 24.000 users / day or 1000 users / hour
Each game room should have it's own tables where users can play
Users should not lose scores and game speed should be fast in general
I just wonder how to handle the following situation: if 1000 users are connected through Socket connection to a room full of tables and one user leave a table, all 1000 users must be updated and UI should reflect the changes. Let's say that I'll update the clients by sending a small Message of 100 bytes to each user, this will eat 100 bytes * 1000 users = 100 kb, and this just for 1 UI change, for 1 Game and for 1 Room, not counting all my other games and rooms. Also 1000 iterations that sends bytes to clients should be very time consuming.I am a developer, but not experienced in those situations. Please advice. Numbers will be great.
Until you've built -- and optimized -- your actual applications, you cannot predict much about the hardware required for some level of performance.
You have to finish the apps first. Then you can measure their performance under load. Then you can decide how much to spend on what levels of performance.
The best answer I can offer you is to run stress tests and see how much load a single server can support. While running those tests, monitor memory, IO, CPU and disk activity (if relevant) to understand which resource is running out first.
We deploy our applications on Amazon's EC2 cloud infrastructure. That lets us easily (within minutes) add or remove capacity as needed. Perhaps it's worth considering for your situation.
Always follow these two rules
“The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” - Michael A. Jackson
First of all you should think more about how and when to send what information to which clients. Not every client needs to be informed about every table change.
That there are only so much informations that a client needs, and you need to decide when/how it will be transmitted. Also you should pack the informations into meaningfull packets. Whats happening at a table is only interesting for that table.
Also you need to profile your application to make sure you know what ressources it consumes. Cardgames should not eat up so much ressources. But the important point is to FIRST build it, and when you HAVE a bottleneck, then try to fix it.
It's very difficult to guess at these things at this point.
From a pragmatic standpoint, you may eventually want to look into a) a cloud-hosting type service for better bandwidth price-scaling for you, or b) a very experienced full-service hosting company that can help you calculate your needs based on prior experience.
Disclaimer: I work for Rackspace Hosting which provides both of the above.

Resources