database to be used for log storing - database

I am starting up with a log monitoring tool which captures audit logs,firewall logs and many other logs.I Have an issue in choosing the right kind of database for this project as number of logs generated per second is at least 500 which has to be stored.
Let's assume that this should be able to support 1BN+ log entries per month. The two factors that will likely come into play most are ability to write quickly and also the ability display reports quickly.

A common stack used for log storing is the ELK stack composed of Elastic search, Logstash, and Kibana.
Elastic search is used to store the documents and can execute queries.
Logstash monitors and parses the logs.
Kibana is used to create reports based off of the data in elastic search.
There are other options out there. Splunk is a paid solution that does all of the above. Graylog is another solution that is similar to splunk.

Related

Continuously updated database shared between multiple AWS EC2 instances

For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!
Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.
As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.
Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)

Is google Datastore recommended for storing logs?

I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.

Automatically push engine datastore data to bigquery tables

To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.

Managing high-volume writes to SQL Server database

I have a web service that is used to manage files on a filesystem that are also tracked in a Microsoft SQL Server database. We have a .NET system service that watches for files that are added using the FileSystemWatcher class. When a file-added callback comes from FileSystemWatcher, metadata about the file is added to our database, and it works fairly well.
I've now come to a bit of a scalability problem. I'm adding large quantities of files to the filesystem in rapid succession, and this ends up hammering the database with file adds which results in locking up my web front-end.
I have yet to work on database scability issues, so I'm trying to come up with mitigate tactics. I was thinking of perhaps caching file adds and only writing them off to the database every five minutes or so, but I'm not sure how practical that is. This is data that needs to find its way into our database at some point anyway, and so it's going to have to get hammered at some point. Maybe I could limit the number of file db entries written per second to a certain amount, but then I risk having that amount be less than the rate at which files are added. How can I best tackle this?
Have you thought about using something like SQL Server Service Broker? That way you could push through tons of entries in a burst and it would level out the inserts into your database.
Basically you'd be pushing messages onto a queue which would then be consumed by a receiver stored procedure that would perform the insert for you. You could limit the maximum number of receivers executing to help with the responsiveness issues in your web interface.
There's a nice intro paper here. Although it's for 2005, not much has changed between 2005 and the newer versions of SQL Server.
You have a performance problem and you should approach it with a performance investigation methodology like Waits and Queues. Once you identify the actual problem, we can discuss solutions.
This is just a guess but, assuming the notification 'update metadata' code is a stright forward insert, the likely problem is that you're generating one transaction per notification. This results in commit flush waits, see Diagnosing Transaction Log Performance . Batch commit (aggregate multiple notifications before committing) is the canonical solution.
first option is using Caching to handle high-volume data. or using clusters for analysis high volume data. please click here for more information.

Retrieving information from aggregated weblogs data, how to do it?

I would like to know how to retrieve data from aggregated logs? This is what I have:
- about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB)
This is my idea:
- each night this data is processed with Pig
- logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id (lets say, this is all what I need)
- from log entry and loads this into HBase (log data will be stored infinitely)
Then if I want to know which users saw particular page within given time range I can quickly query HBase without scanning whole log data with each query (and I want fast answers - minutes are acceptable). And there will be multiple querying taking place simultaneously.
What do you think about this workflow? Do you think, that loading this information into HBase would make sense? What are other options and how do they compare to my solution?
I appreciate all comments/questions and answers. Thank you in advance.
With Hadoop you are always doing one of two things (either processing or querying).
For what you are looking to-do I would suggest using HIVE http://hadoop.apache.org/hive/. You can take your data and then create a M/R job to process and push that data how you like it into HIVE tables. From there (you can even partition on data as it might be appropriate for speed to not look at data not required as you say). From here you can query out your data results as you like. Here is very good online tutorial http://www.cloudera.com/videos/hive_tutorial
There are a lots of ways to solve this but it sounds like HBase is a bit overkill unless you want to setup all the server required for it to run as an exercise to learn it. HBase would be good if you have thousands of people simultaneously looking to get at the information.
You might also want to look into FLUME which is new import server from Cloudera . It will get your files from some place straight to HDFS http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/

Resources