I'm starting to learn about big data and Apache Spark and I have a doubt.
In the future I'll need to collect data from IoT and this data will come to me as time series data. I was reading about Time Series Databases (TSDB) and I have found some open-source options like Atlas, KairosDB, OpenTSDB, etc.
I actually need Apache Spark, so I want to know: can I use a Time Series Database over Apache Spark? Does it makes any sense? Please, remember that I'm very new to the concepts of big data, Apache Spark and all matters that I've talked in this question.
If I can run TSDB over Spark, how can I achieve that?
I'm an OpenTSDB committer, I know this is an old question, but I wanted to answer. My suggestion would be to write your incoming data to OpenTSDB, assuming you just want to store the raw data and process it later. Then with Spark, execute OpenTSDB queries using the OpenTSDB classes.
You can write data with the classes also, I think you want to use the IncomingDataPoint construct, I actually don't have the details at hand at the moment. Feel free to contact me on the OpenTSDB mailing list for more questions.
You can see how OpenTSDB handles the incoming "put" request here, you should be able to do the same thing in your code for writes:
https://github.com/OpenTSDB/opentsdb/blob/master/src/tsd/PutDataPointRpc.java#L42
You can see the Splicer project submitting OpenTSDB queries here, a similar method could be used in your Spark project I think:
https://github.com/turn/splicer/blob/master/src/main/java/com/turn/splicer/tsdbutils/SplicerQueryRunner.java#L87
Related
Recently I had a chance to get to know the flink-table-store project. I was attracted by the idea behind it at the first glance.
After reading the docs, I've got a question in my head for a while. It's about the design of the file storage.
It looks it can implemented based on the other popular open-source libraries other than creating a total new component (lsm tree based). Hudi or iceburg looks like a good choice, since they both support change logs saving and querying.
If do it like so, there is no need to create a component for other related computation engine (spark, hive or trinno) since they are already supported by hudi or iceburg. It looks like a better solution for me instead of create another wheel.
So, here is my questions. Is there any issue writing data as hudi or iceburg? Why not choose them in the first design decision?
Looking for design explanation.
Flink Data Store is a new project created to natively support update/delete operations on DFS tables using data snapshots.
These features are already available in Apache Hudi, the first open lakehouse format, Delta Lake, the lakehouse format developed and maintained by Databricks and Apache Iceberg which evolve quickly.
The table created with these tools can be queried from different tools/engines (Spark, Flink, Trino, Athena, Spectrum, Dremio, ...), but to support all these tools, they do some changes on the design which can affect the performance, while Flink Data Store is created and optimized for Flink, so it gives you the best performance with Apache Flink comparing with the other 3 projects.
Is there any issue writing data as hudi or iceberg?
Not at all, a lot of companies use Hudi and Iceberg with Spark, Flink and Trino in production, and they have no issues.
Why not choose them in the first design decision?
If you want to create tables readable by the other tools, you should avoid using Flink Data Store, and you need to choose between the other options, but the main idea of Flink Data Store was to create internal tables used to transform your streaming data, which is similar for KTables in kafka stream, so you can write your streaming data to Flink Data Store tables, transform them on multiple stage, and at the end, write the result to Hudi or Iceberg table to query it by the different tools
I am working on migrating my old indexing tool to solr(version 7). But I am not so sure, how do I index my files to solr.
Data in our system are located at oracle DB, mysql and cassendra. But update in these DBs are not so frequent(2-3 times in 24 hrs) and these will be the source for my solr index files.
In one of the collections I will have around 300k-400k records and in another somewhere around 5k.
I could come up with 2 methods.
Create ETL pipeline from diff data source using apache Storm.
Use Kafka connect source and sink.
which among 2 is good for system like ours? or is both method overkill for system like ours?
The size of the data is small enough that just do whatever you're comfortable with - either use an existing tool or write a small indexer in a language you have experience with. There is no need to overthink this at that stage.
And outside of that - it's usually impossible to make a recommendation without have in depth knowledge of your situation, except for very specific questions.
Problem in a nutshell:
There's a huge amount of input data in JSON format. Like right now it's about 1 Tb, but it's going to grow. I was told that we're going to have a cluster.
I need to process this data, make a graph out of it and store it in a database. So every time I get a new JSON, I have to traverse the whole graph in a database to complete it.
Later I'm going to have a thin client in a browser, where I'm going to visualize some parts of the graph, search in it, traverse it, do some filtering, etc. So this system is not high load, just a lot of processing and data.
I have no experience in distributed systems, NoSQL databases and other "big data"-like stuff. During my little research I found out that there are too many of them and right now I'm just lost.
What I've got on my whiteboard at the moment:
Apache Spark's GraphX (GraphFrames) for distributed computing on top of some storage (HDFS, Cassanda, HBase, ...) and processor (Yarn, Mesos, Kubernetes, ...).
Some graph database. I think it's good to use a graph query language like Cipher in neo4j or Gremlin in JanusGraph/TitanDB. Neo4j is good, but it has clustering only in EE and I need something open source. So now I'm thinking about the latter ones, which have Gremlin + Cassandra + Elasticsearch by default.
Maybe I don't need any of these, just store graph as adjacency matrix in some RDBMS like Postgres and that's it.
Don't know if I need Spark in 2 or 3. Do I need it at all?
My chief told me to check out Elasticsearch. But I guess I can use it only as an additional full-text search engine.
Thanks for any reply!
Let us start with a couple of follow-up questions :
1Tb is not a huge amount of data if that is also (close to) the total amount of data. Is it ? How much new data are you expecting and at what rate will it arrive.
Why would you have to traverse the whole graph if each JSON is merely referring to a small part of the graph ? It's either new data or an update of existing data (which you should be able to pinpoint), isn't it ?
Yes, that's how you use a graph database ...
The rest sort of depends on your answer on 1). If we're talking about IOT numbers of arriving events (tens of thousands per second ... sustained) you might need a big data solution. If not, your main problem is getting the initial load done and do some easy sailing from there ;-).
Hope this helps.
Regards,
Tom
So I'm designing this blog engine and I'm trying to just keep my blog data without considering comments or membership system or any other type of multi-user data.
The blog itself is surrounded around 2 types of data, the first is the actual blog post entry which consists of: title, post body, meta data (mostly dates and statistics), so it's really simple and can be represented by simple json object. The second type of data is the blog admin configuration and personal information. Comment system and other will be implemented using disqus.
My main concern here is the ability of such engine to scale with spiked visits (I know you might argue this but lets take it for granted). So since I've started this project I'm moving well with the rest of my stack except the data layer. Now I've been having this dilemma choosing the database, I've considered MongoDB but some reviews and articles/benchmarking were suggesting slow reads after collections read certain size. Next I was looking at Redis and using its persistence features RDB and AOF, while Redis is good at both fast reading/writing I'm afraid of using it because I'm not familiar with it. And this whole search keeps going on to things like "PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents" etc.
So is there any way I can settle this issue for good? considering that I only need to represent my data in key,value structure and only require fast reading but not writing and the ability to be fault tolerant.
Thank you
If I were you I would start small and not try to optimize for big data just yet. A lot of blogs you read about the downsides of a NoSQL solution are around large data sets - or people that are trying to do relational things with a database designed for de-normalized data.
My list of databases to consider:
Mongo. It has huge community support and based on recent funding - it's going to be around for a while. It runs very well on a single instance and a basic replica set. It's easy to set up and free, so it's worth spending a day or two running your own tests to settle the issue once and for all. Don't trust a blog.
Couchbase. Supports key/value storage and also has persistence to disk. http://www.couchbase.com/couchbase-server/features Also has had some recent funding so hopefully that means stability. =)
CouchDB/PouchDB. You can use PouchDB purely on the client side and it can connect to a server side CouchDB. CouchDB might not have the same momentum as Mongo or Couchbase, but it's an actively supported product and does key/value with persistence to disk.
Riak. http://basho.com/riak/. Another NoSQL that scales and is a key/value store.
You can install and run a proof-of-concept on all of the above products in a few hours. I would recommend this for the following reasons:
A given database might scale and hit your points, but be unpleasant to use. Consider picking a database that feels fun! Sort of akin to picking Ruby/Python over Java because the syntax is nicer.
Your use case and domain will be fairly unique. Worth testing various products to see what fits best.
Each database has quirks and you won't find those until you actually try one. One might have quirks that are passable, one will have quirks that are a show stopper.
The benefit of trying all of them is that they all support schemaless data, so if you write JSON, you can use all of them! No need to create objects in your code for each database.
If you abstract the database correctly in code, swapping out data stores won't be that painful. In other words, your code will be happier if you make it easy to swap out data stores.
This is only an option for really simple CMSes, but it sounds like that's what you're building.
If your blog is super-simple as you describe and your main concern is very high traffic then the best option might be to avoid a database entirely and have your CMS generate static files instead. By doing this, you eliminate all your database concerns completely.
It's not the best option if you're doing anything dynamic or complex, but in this small use case it might fit the bill.
I want to switch my Rails project from Solr to Elastic Search (just for fun), but I'm not sure about the best approach to index the documents. Right now I'm using Resque (background job) for this task, but I've been digging about "rivers" on Elastic Search and they look promising.
Anyone who has experience on this topic can bring me some tips? performance results? scalability?
Thanks in advance
P.S: Although is just for fun at the moment, I have in mind to migrate from Solr to Elastic Search a larger project in production.
It's hard to understand your situation/concerns from your question. With elasticsearch, you either push data in, or use a river to pull them.
When you are pushing the data in, you're in control of how your feeder operates, how it processes documents, how the whole pipeline looks (gather data > language analysis > etc > index). Using a river may be a convenient way how to quickly pull some data into elasticsearch from a certain source (CouchDB, RDBMS), or to continuously pull data eg. from a RabbitMQ stream.
Since you're considering elasticsearch in a context of a Rails project, you'll probably try out theTire gem at some point. Supposing you're using an ActiveModel-compatible ORM (for SQL or NoSQL databases), importing is as easy as:
$ rake environment tire:import CLASS=MyClass
See the Tire documentation and the relevant Railscasts episode for more information.