I have a following scenario:
Measurements are uploaded through a web service in form of files
Those files are later copied to HDFS
Each measurement contains a number of features (values), for one or more parameters
Measurements might have different number of values
Measurements are processed using machine learning algorithms on Hadoop
Not all measurements are taken, but for a certain user, for certain time period (e.g. perform processing on files from user X uploaded during period Y-Z)
Intermediate results are stored on HDFS, as well as the final result
My question is related to second point - Those files are later copied to HDFS - I'm worried that it could be a problem that there is a large number of small files (e.g. 1MB).
My idea is to store that files in a database, so I would avoid the problem with small files and also be able to query data (select data for user for period). Is that a better approach?
If the answer is positive, which databases can I use? So I need the database to be:
Compatible with Hadoop (Big data)
Rows may contain different number of values (like in case of time series)
Retrieve measurements for certain user for certain period
Records are input to MapReduce job
I think that HBase is perfect for you necessity.
I had also the "small file problem" and I solved it using HBase.
Storing small file in HDFS directly it's a bad practice and could be a problem.
From the HBase project site:
Apache HBase is the Hadoop database. Use it when you need random,
realtime read/write access to your Big Data. This project's goal is
the hosting of very large tables -- billions of rows X millions of
columns -- atop clusters of commodity hardware.
HBase is made for Hadoop
Rows can stores different columns in a column family and updated values have timestamp, so you can go back in the history of the cell
HBase and Hadoop are made for MaReduce jobs ( Rows can be input/output for a job)
In my case I had a lot of small file (200 Kb / 1 Mb) and now I store these files in a table with some column as Header/Information and a column for the binary content of the file and the file name as key (the file name is a UUID)
Related
The goal that I wish to achieve is to generate a file of the table, so that afterwards that can be checked for data (monthly calculations). What I have done so far is to create a Backup using the PipeLine option from DynamoDB to an S3 bucket, but:
It is taking too long, the pipeline has been running for more than 24h since the table I am exporting is 7 GB in DynamoDB size (which is compressed and it will take even more time to finish with the backup);
I will need to do that monthly, which means that I will only need the data between first and last day of the month, while the PIPELINE can create a backup I could not find an option to make it so that only the changes in the table from specific timelines is exported;
The files that the Pipeline export are around 10 MB each and that means hundreds of files, instead of a couple (for example 100 MB files or 1 GB files).
In this case I am interested if there is a different way which I can make a full backup of current information and afterwards do a month to month on the changes that where performed (something like a monthly incremental) and not to have millions of 10 MB files.
Any comments, clarifications, code samples, corrections are appreciated.
Thanks for your time.
You have, basically, two options:
Implement your own logic by DynamoDB Steams and process your data by
your own logic
Use combination on AWS Glue for ETL processing and,
possible, AWS Athena for query your data from S3. Be careful and use
Apache Parquet format for better query performance and cache your
results somewhere else
We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.
Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.
First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?
AWS has shared time-series data processing architecture:
Very simplified I think of it this way:
Devices push data directly to DynamoDB using HTTP API
Metrics are stored in one table per 24 hours
At the end of the day some procedure runs on Elastic Map Reduce and
produces ready JSON files with data required to show graphs per time
period.
Old tables are stored in RedShift for further applications.
Has anyone already done something similar before? Maybe there is simpler architecture?
This requires bigdata infrastructure like
1) Hadoop cluster
2) Spark
3) HDFS
4) HBase
You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.
IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.
The reason I am suggesting this architecture is for
scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.
This is proven architecture in big data analytics application.
I have a "database choice" and arhitecture question.
Use-case:
Clients will upload large .json files (or other format like .tsv, it is irrelevant) where each line is a data about their customers (e.g name, address etc.)
We need to stream this data later on to process it and store results which will also be some large file where each line is data about each customer (approximately same as uploaded file).
My requirements:
Streaming should be as fast it could (e.g > 1000 rps) and we could have multiple process running in parallel (for multiple clients)
Database should be scalable and fault tolerant. Because there could easily be uploaded many GB of data it should be easy for me to implement automatically adding new commodity instances (using AWS) if storage gets low.
Database should have kind of replication because we don't want to lose data.
No index is required since we are just streaming data.
What would you suggest for database for this problem? We tried to upload it to Amazon S3 and let them take care of scaling etc. but there is a problem of slow read/streaming.
Thanks,
Ivan
Initially uploading the files to S3 is fine, but then pick them up and push each line to Kinesis (or MSK or even Kafka on EC2s if you prefer); from there, you can hook up the stream processing framework of your choice (Flink, Spark Streaming, Samza, Kafka Streams, Kinesis KCL) to do transformations and enrichment, and finally you’ll want to pipe the results into a storage stack that will allow streaming appends. A few obvious candidates:
HBase
Druid
Keyspaces for Cassandra
Hudi (or maybe LakeFS?) on top of S3
Which one you choose is kind of up to your needs downstream in terms of query flexibility, latency, integration options/standards, etc.
I have a large dataset (>40G) which I want to store in S3 and then use Athena for query.
As suggested by this blog post, I could store my data in the following hierarchical directory structure to enable usingMSCK REPAIR to automatically add partitions while creating table from my dataset.
s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/
However, this requires me to split my dataset into many smaller data files and each will be stored under a nested folder depending on the partition keys.
Although using partition could reduce amount of data to be scanned by Athena and therefore speed up a query, would managing large amount of small files cause performance issue for S3? Is there a tradeoff here I need to consider?
Yes, you may experience an important decrease of efficiency with small files and lots of partitions.
Here there is a good explanation and suggestion on file sizes and number of partitions, which should be larger than 128 MB to compensate the overhead.
Also, I performed some experiments in a very small dataset (1 GB), partitioning my data by minute, hour and day. The scanned data decreases when you make the partitions smaller, but the time spent on the query will increase a lot (40 times slower in some experiments).
I will try to get into it without veering too much into the realm of opinion.
For the use cases which I have used Athena, 40 GB is actually a very small dataset by the standards of what the underlying technology (Presto) is designed to handle. According to the Presto web page, Facebook uses the underlying technology to query their 300 PB data warehouse. I routinely use it on datasets between 500 GB and 1 TB in size.
Considering the underlying S3 technology, S3 was used to host Dropbox and Netflix, so I doubt most enterprises could come anywhere near taxing the storage infrastructure. Where you may have heard about performance issues and S3 relates to websites storing multiple, small, pieces of static content on many files scattered across S3. In this case, a delay in retrieving one of these small pieces of content might affect user experience on the larger site.
Related Reading:
Presto
We are currently developing a tool to count wildlife passing through defined areas. The gadget that automatically counts the animals will be sending data (weather, # of animals passing etc.) in a 5 minute interval via HTTP to our API. There will be hundreds of these measurement stations and it should be scalable.
Now the question arised whether to use a filesystem or a RDBMS to save this data.
Pro DB
save exact time and date when the entry was created
directly related to area# via indexed key
Pro Filesystem
Collecting data is not as resource intensive since for every API call only 1 line will be appended to the file
Properties of the data:
only related to 1 DB entry (the area #)
the measurement stations are in remote areas we have to account for outages
What will be done with the data
Give a overview over timeperiods per area#
act as a early warning system if the # of animals is surprisingly low/high
Probably by using a cronjob and comparing to simliar data
We are thinking to chose a RDBMS to save the data but I am worried that after millions of entries the DB will slow down and eventually stop working. This question was asked here where 360M entries is not really considered "big data" so I'm not quite sure about my task either.
Should we chose these recommended techniques (MongoDB ...) or can this task be handled by PostgreSQL or MySQL?
I have created such a system for marine boyes. The devices sends data over GPRS / iridum using HTTP or raw tcp sockets (to minimize bandwidth).
The recieving server stores the data in a db-table, with data provided and timestamp.
The data is then parsed and records are created in another table.
The devices can also request UTC-time from the server, thus not needing a RTC.
Before any storage is made to the "raw" table, a row is appended to a text-file. This is puerely for logging or being able to recover from database downtime.
As for database type, I'd recommend regular RDBMS. Define markers for your data. We use 4-digit codes that gives headroom for 10000 types of measure values.