Hadoop on periodically generated files - file

I would like to use Hadoop to process input files which are generated every n minute. How should I approach this problem? For example I have temperature measurements of cities in USA received every 10 minute and I want to compute average temperatures per day per week and month.
PS: So far I have considered Apache Flume to get the readings. Which will get data from multiple servers and write the data periodically to HDFS. From where I can read and process them.
But how can I avoid working on same files again and again?

You should consider a Big Data stream processing platform like Storm (which I'm very familiar with, there are others, though) which might be better suited for the kinds of aggregations and metrics you mention.
Either way, however, you're going to implement something which has the entire set of processed data in a form that makes it very easy to apply the delta of just-gathered data to give you your latest metrics. Another output of this merge is a new set of data to which you'll apply the next hour's data. And so on.

Related

DynamoDB: How to distribute workload over the month?

TL;DR
I have a table with about 2 million WRITEs over the month and 0 READs. Every 1st day of a month, I need to read all the rows written on the previous month and generate CSVs + statistics.
How to work with DynamoDB in this scenario? How to choose the READ throughput capacity?
Long description
I have an application that logs client requests. It has about 200 clients. The clients need to receive on every 1st day of a month a CSV with all the requests they've made. They also need to be billed, and for that we need to calculate some stats with the requests they've made, grouping by type of request.
So in the end of the month, a client receives a report like:
I've already come to two solutions, but I'm not still convinced on any of them.
1st solution: ok, every last day of the month I increase the READ throughput capacity and then I run a map reduce job. When the job is done, I decrease the capacity back to the original value.
Cons: not fully automated, risk of the DynamoDB capacity not being available when the job starts.
2nd solution: I can break the generation of CSVs + statistics to small jobs in a daily or hourly routine. I could store partial CSVs on S3 and on every 1st day of a month I could join those files and generate a new one. The statistics would be much easier to generate, just some calculations derived from the daily/hourly statistics.
Cons: I feel like I'm turning something simple into something complex.
Do you have a better solution? If not, what solution would you choose? Why?
Having been in a similar place myself before, I used, and now recommend to you, to process the raw data:
as often as you reasonably can (start with daily)
to a format as close as possible to the desired report output
with as much calculation/CPU intensive work done as possible
leaving as little to do at report time as possible.
This approach is entirely scaleable - the incremental frequency can be:
reduced to as small a window as needed
parallelised if required
It also, makes possible re-running past months reports on demand, as the report generation time should be quite small.
In my example, I shipped denormalized, pre-processed (financial calculations) data every hour to a data warehouse, then reporting just involved a very basic (and fast) SQL query.
This had the additional benefit of spreading the load on the production database server to lots of small bites, instead of bringing it to its knees once a week at invoice time (30000 invoiced produced every week).
I would use the service kinesis to produce a daily and almost real time billing.
for this purpose I would create a special DynamoDB table just for the calculated data.
(other option is to run it on flat files)
then I would add a process which will send events to kinesis service just after you update the regular DynamoDB table.
thus when you reach the end of the month you can just execute whatever post billing calculations you have and create your CSV files from the already calculated table.
I hope that helps.
Take a look at Dynamic DynamoDB. It will increase/decrease the throughput when you need it without any manual intervention. The good news is you will not need to change the way the export job is done.

How to store and retrieve large numbers of data points for graphical visualization?

I'm thinking about building a web-based data logging and visualization service. The basic idea is that at some timed interval something (e.g. a sensor) reports a value (e.g. temperature) to the server. The server records this value into a database. There would be a web-based UI that allows me to view this data on a time-based graph. Ideally this graph would have various resolutions (last 30 seconds, last week, last year, etc). In a super ideal world, I would be able to zoom into the data for any point in time.
The problem is that the sensors are going to generate enormous amounts of data. For example, a sensor that reports a value every 5 seconds will generate about 18k values a day. I'm imagining a system that has thousands of sensors. Over time, this becomes lots of data.
The naive solution is to throw this data into a relational database and retrieve it in the various ways I want, but that won't scale.
The simple solution is to reduce the amount of data by performing periodic roll-ups of the data. New data might go into a table that has data points every 5 seconds. Every hour, some system pumps this data into another table that has data points every minute and the original data is deleted. This repeats for a few levels. The downside to this is that the further back in time you go, the less detailed the data is. That's probably fine. I would imagine that I would need enormous amounts of hardware to support full resolution of data over all time as compared to a system with this sort of rollup.
Is there a better way to do this? Is there an existing solution? I have to imagine this is a fairly common problem.
You probably want a fixed sized database like RRDTool: http://oss.oetiker.ch/rrdtool/
Also Graphite is built on top of a similar datastore implementation: http://graphite.wikidot.com/

Architecture and pattern for large scale, time series based, aggregation operation

I will try to describe my challenge and operation:
I need to calculate stocks price indices over historical period. For example, I will take 100 stocks and calc their aggregated avg price each second (or even less) for the last year.
I need to create many different indices like this where the stocks are picked dynamically out of 30,000~ different instruments.
The main consideration is speed. I need to output a few months of this kind of index as fast as i can.
For that reason, i think a traditional RDBMS are too slow, and so i am looking for a sophisticated and original solution.
Here is something i had In mind, using NoSql or column oriented approach:
Distribute all stocks into some kind of a key value pairs of time:price with matching time rows on all of them. Then use some sort of a map reduce pattern to select only the required stocks and aggregate their prices while reading them line by line.
I would like some feedback on my approach, suggestion for tools and use cases, or suggestion of a completely different design pattern. My guidelines for the solution is price (would like to use open source), ability to handle huge amounts of data and again, fast lookup (I don't care about inserts since it is only made one time and never change)
Update: by fast lookup i don't mean real time, but a reasonably quick operation. Currently it takes me a few minutes to process each day of data, which translates to a few hours per yearly calculation. I want to achieve this within minutes or so.
In the past, I've worked on several projects that involved the storage and processing of time series using different storage techniques (files, RDBMS, NoSQL databases). In all these projects, the essential point was to make sure that the time series samples are stored sequentially on the disk. This made sure reading several thousand consecutive samples was quick.
Since you seem to have a moderate number of time series (approx. 30,000) each having a large number of samples (1 price a second), a simple yet effective approach could be to write each time series into a separate file. Within the file, the prices are ordered by time.
You then need an index for each file so that you can quickly find certain points of time within the file and don't need to read the file from the start when you just need a certain period of time.
With this approach you can take full advantage of today's operating systems which have a large file cache and are optimized for sequential reads (usually reading ahead in the file when they detect a sequential pattern).
Aggregating several time series involves reading a certain period from each of these files into memory, computing the aggregated numbers and writing them somewhere. To fully leverage the operating system, read the full required period of each time series one by one and don't try to read them in parallel. If you need to compute a long period, then don’t break it into smaller periods.
You mention that you have 25,000 prices a day when you reduce them to a single one per second. It seems to me that in such a time series, many consecutive prices would be the same as few instruments are traded (or even priced) more than once a second (unless you only process S&P 500 stocks and their derivatives). So an additional optimization could be to further condense your time series by only storing a new sample when the price has indeed changed.
On a lower level, the time series files could be organized as a binary files consisting of sample runs. Each run starts with the time stamp of the first price and the length of the run. After that, the prices for the several consecutive seconds follow. The file offset of each run could be stored in the index, which could be implemented with a relational DBMS (such as MySQL). This database would also contain all the meta data for each time series.
(Do stay away from memory mapped files. They're slower because they aren’t optimized for sequential access.)
If the scenario you described is the ONLY requirement, then there are "low tech" simple solutions which are cheaper and easier to implement. The first that comes to mind is LogParser. In case you haven't heard of it, it is a tool which runs SQL queries on simple CSV files. It is unbelievably fast - typically around 500K rows/sec, depending on row size and the IO throughput of the HDs.
Dump the raw data into CSVs, run a simple aggregate SQL query via the command line, and you are done. Hard to believe it can be that simple, but it is.
More info about logparser:
Wikipedia
Coding Horror
What you really need is a relational database that has built in time series functionality, IBM released one very recently Informix 11.7 ( note it must be 11.7 to get this feature). What is even better news is that for what you are doing the free version, Informix Innovator-C will be more than adequate.
http://www.freeinformix.com/time-series-presentation-technical.html

Storing Signals in a Database

I'm designing an application that receives information from roughly 100k sensors that measure time-series data. Each sensor measures a single integer data point once every 15 minutes, saves a log of these values, and sends that log to my application once every 4 hours. My application should maintain about 5 years of historical data. The packet I receive once every 4 hours is of the following structure:
Data and time of the sequence start
Number of samples to arrive (assume this is fixed for the sake of simplicity, although in practice there may be partials)
The sequence of samples, each of exactly 4 bytes
My application's main usage scenario is showing graphs of composite signals at certain dates. When I say "composite" signals I mean that for example I need to show the result of adding Sensor A's signal to Sensor B's signal and subtracting Sensor C's signal.
My dilemma is how to store this time-series data in my database. I see two options, assuming I use a relational database:
Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
Store every 4-hour signal as a separate row with its starting time. In this case, whenever a signal arrives, I just add it as a BLOB to the database.
There are obvious pros and cons for each of the options, including storage size, performance, and complexity of the code "above" the database.
I wondered if there are best practices for such cases.
Many thanks.
Storing each sample in it's own row sounds simple and logical to me. Don't be too hasty to optimize unless there is actually a good reason for it. Maybe you should do some tests with dummy data to see if any optimization is really necessary.
I think storing the data in the form that makes it easiest to carry out your main goal is likely the least painful overall. In this case, it's likely the more efficient as well.
Since your main goal appears to be to display the information in interesting and flexible ways I'd go with separate rows for each data point. I presume most of the effort required to write this program well is likely on the display side, you should minimize the complexity on that side as much as possible.
Storing data in BLOBs is good if the content isn't relevent and you would never want to run queries against it. In this case, your data will be the contents of the database, and therefore, very relevent.
I think you should:
1.Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
I see two database operations here: the first is to store the data as it comes in, and the second is to retrieve the data in a (potentially large) number of ways.
As Kieveli says, since you'll be using discrete parts of the data (as opposed to all of the data all at once), storing it as a blob won't help you when it comes time to read it. So for the first task, storing the data line by line would be optimal.
This might also be "good enough" when querying the data. However, if performance is an issue, and/or if you get massive amounts of volume [100,000 sensors x 1 per 15 minutes x 4 hours = 9,600,000 rows per day, x 5 years = 17,529,600,000 or so rows in five years]. To my mind, if you want to write flexible queries against that kind of data, you'll want some form of star schema structure (as gets used in data warehouses).
Whether you load the data directly into the warehouse, or let it build up "row by row" to be added to the warehouse ever day/week/month/whatever, depends on time, effort, available resources, and so on.
A final suggestion: when you set up a test environment for your new code, load it with several years of (dummy) data, to see how it will perform.

Storage of many log files

I have a system which is receiving log files from different places through http (>10k producers, 10 logs per day, ~100 lines of text each).
I would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ...
My question is : what's the best way to store them ?
Flat text files (with proper locking), one file per uploaded file, one directory per day/producer
Flat text files, one (big) file per day for all producers (problem here will be indexing and locking)
Database Table with text (MySQL is preferred for internal reasons) (pb with DB purge as delete can be very long !)
Database Table with one record per line of text
Database with sharding (one table per day), allowing simple data purge. (this is partitioning. However the version of mysql I have access to (ie supported internally) does not support it)
Document based DB à la couchdb or mongodb (problem could be with indexing / maturity / speed of ingestion)
Any advice ?
(Disclaimer: I work on MongoDB.)
I think MongoDB is the best solution for logging. It is blazingly fast, as in, it can probably insert data faster than you can send it. You can do interesting queries on the data (e.g., ranges of dates or log levels) and index and field or combination of fields. It's also nice because you can randomly add more fields to logs ("oops, we want a stack trace field for some of these") and it won't cause problems (as it would with flat text files).
As far as stability goes, a lot of people are already using MongoDB in production (see http://www.mongodb.org/display/DOCS/Production+Deployments). We just have a few more features we want to add before we go to 1.0.
I'd pick the very first solution.
I don't see why would you need DB at all. Seems like all you need is to scan through the data. Keep the logs in the most "raw" state, then process it and then create a tarball for each day.
The only reason to aggregate would be to reduce the number of files. On some file systems, if you put more than N files in a directory, the performance decreases rapidly. Check your filesystem and if it's the case, organize a simple 2-level hierarchy, say, using the first 2 digits of producer ID as the first level directory name.
I would write one file per upload, and one directory/day as you first suggested. At the end of the day, run your processing over the files, and then tar.bz2 the directory.
The tarball will still be searchable, and will likely be quite small as logs can usually compress quite well.
For total data, you are talking about 1GB [corrected 10MB] a day uncompressed. This will likely compress to 100MB or less. I've seen 200x compression on my log files with bzip2. You could easily store the compressed data on a file system for years without any worries. For additional processing you can write scripts which can search the compressed tarball and generate more stats.
Since you would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... You're expecting 100,000 files a day, at a total of 10,000,000 lines:
I'd suggest:
Store all the files as regular textfiles using the following format : yyyymmdd/producerid/fileno.
At the end of the day, clear the database, and load all the textfiles for the day.
After loading the files, it would be easy to get the stats from the database, and post them in any format needed. (maybe even another "stats" database). You could also generate graphs.
To save space ,you could compress the daily folder. Since they're textfiles, they would compress well.
So you would only be using the database to be able to easily aggregate the data. You could also reproduce the reports for an older day if the process didn't work, by going through the same steps.
To my experience, single large table performs much faster then several linked tables if we talk about database solution. Particularly on write and delete operations. For example, splitting one table into three linked tables decreases performance 3-5 times. This is very rough, of course it depends on details, but generally this is the risk. It gets worse when data volumes get very large. Best way, IMO, to store log data is not in a flat text, but rather in a structured form, so that you can do efficient queries and formatting later. Managing log files could be pain, especially when there are lots of them and coming from many sources and locations. Check out our solution, IMO it can save you lots of development time.

Resources