I wonder why is there an AvroKeyValueSinkWriter for Flink, but there isn't a simple AvroSinkWriter with regular Schema (non key-value).
I use this to generate near-streaming Avro files, and I batch them once an hour to Parquet files.
I use the BucktingSink of Flink.
The Key-Value Schema is giving me some hard time when generating Parquet,
did I miss something? Thanks!
You will not find much help with anything Flink.
The documentation relies on javadoc and the examples are almost one-liners, like word count and other nonsense.
I have yet to see what a "pro" flink coder can do, to learn what the right way to do some of the simplest tasks. Reading from Kafka, parsing an avro or json record, then putting in specific data on a file system or hdfs would be great. You won't find any such examples.
You would think that by now that searching the net for some solid complex examples would be available.
Most of these projects require you reading through all the source code and try and figure out an approach.
In the end it is just easier to Spring boot and jam code into a service than to buy into Flink, and to some degree Spark.
Best of luck to you.
Related
Recently I had a chance to get to know the flink-table-store project. I was attracted by the idea behind it at the first glance.
After reading the docs, I've got a question in my head for a while. It's about the design of the file storage.
It looks it can implemented based on the other popular open-source libraries other than creating a total new component (lsm tree based). Hudi or iceburg looks like a good choice, since they both support change logs saving and querying.
If do it like so, there is no need to create a component for other related computation engine (spark, hive or trinno) since they are already supported by hudi or iceburg. It looks like a better solution for me instead of create another wheel.
So, here is my questions. Is there any issue writing data as hudi or iceburg? Why not choose them in the first design decision?
Looking for design explanation.
Flink Data Store is a new project created to natively support update/delete operations on DFS tables using data snapshots.
These features are already available in Apache Hudi, the first open lakehouse format, Delta Lake, the lakehouse format developed and maintained by Databricks and Apache Iceberg which evolve quickly.
The table created with these tools can be queried from different tools/engines (Spark, Flink, Trino, Athena, Spectrum, Dremio, ...), but to support all these tools, they do some changes on the design which can affect the performance, while Flink Data Store is created and optimized for Flink, so it gives you the best performance with Apache Flink comparing with the other 3 projects.
Is there any issue writing data as hudi or iceberg?
Not at all, a lot of companies use Hudi and Iceberg with Spark, Flink and Trino in production, and they have no issues.
Why not choose them in the first design decision?
If you want to create tables readable by the other tools, you should avoid using Flink Data Store, and you need to choose between the other options, but the main idea of Flink Data Store was to create internal tables used to transform your streaming data, which is similar for KTables in kafka stream, so you can write your streaming data to Flink Data Store tables, transform them on multiple stage, and at the end, write the result to Hudi or Iceberg table to query it by the different tools
I'd like to advice a quick possible improvement you could do on the DOC online wrt Serialization.
As matter of fact you did an amazing job both on the implementation and on the documentation. The way flink automatically understands how to best serialize objects is very smart and powerful.
While developing a real-time analytics project that is going to leverage Flink I encountered an issue that is more related to missing doc than to flink.
I'd like to suggest here to amend, as it could spare several hours of despare of other people in the future :)
I had a couple classes needed custom serializers. I created Kryo serializers and also plugged those with registerTypeWithKryoSerializer.
What was not clear in the current doc is that since some of those are POJO, Flink prefers that over GenericType that then uses my kryo serializers.
Once I understood, after several hours of deep debugging, I just made sure those were not POJO any more, then all of a sudden my serializers were used.
So on one side you could think about always preferring custom serializers over POJO. But in the very short term I'd just suggest to amend the doc.
Let me know what you think and congrats for this amazing piece of work.
For previous projects we did use storm or spark streaming, but Flink is miles ahead for real-time streaming analytics.
Thanks and keep up the good work!
So the current quick workaround is to make sure your objects are not POJO.
In other case they are not serialized via GenericType that uses Kryo and sees your custom serializers.
Very useful to debug when you have this kind of issues is the:
env.getConfig().disableGenericTypes();
That allows to stop task startup with exception allowing you to check what kind of serializers and HintTypes were used.
I'm starting to learn about big data and Apache Spark and I have a doubt.
In the future I'll need to collect data from IoT and this data will come to me as time series data. I was reading about Time Series Databases (TSDB) and I have found some open-source options like Atlas, KairosDB, OpenTSDB, etc.
I actually need Apache Spark, so I want to know: can I use a Time Series Database over Apache Spark? Does it makes any sense? Please, remember that I'm very new to the concepts of big data, Apache Spark and all matters that I've talked in this question.
If I can run TSDB over Spark, how can I achieve that?
I'm an OpenTSDB committer, I know this is an old question, but I wanted to answer. My suggestion would be to write your incoming data to OpenTSDB, assuming you just want to store the raw data and process it later. Then with Spark, execute OpenTSDB queries using the OpenTSDB classes.
You can write data with the classes also, I think you want to use the IncomingDataPoint construct, I actually don't have the details at hand at the moment. Feel free to contact me on the OpenTSDB mailing list for more questions.
You can see how OpenTSDB handles the incoming "put" request here, you should be able to do the same thing in your code for writes:
https://github.com/OpenTSDB/opentsdb/blob/master/src/tsd/PutDataPointRpc.java#L42
You can see the Splicer project submitting OpenTSDB queries here, a similar method could be used in your Spark project I think:
https://github.com/turn/splicer/blob/master/src/main/java/com/turn/splicer/tsdbutils/SplicerQueryRunner.java#L87
I want to switch my Rails project from Solr to Elastic Search (just for fun), but I'm not sure about the best approach to index the documents. Right now I'm using Resque (background job) for this task, but I've been digging about "rivers" on Elastic Search and they look promising.
Anyone who has experience on this topic can bring me some tips? performance results? scalability?
Thanks in advance
P.S: Although is just for fun at the moment, I have in mind to migrate from Solr to Elastic Search a larger project in production.
It's hard to understand your situation/concerns from your question. With elasticsearch, you either push data in, or use a river to pull them.
When you are pushing the data in, you're in control of how your feeder operates, how it processes documents, how the whole pipeline looks (gather data > language analysis > etc > index). Using a river may be a convenient way how to quickly pull some data into elasticsearch from a certain source (CouchDB, RDBMS), or to continuously pull data eg. from a RabbitMQ stream.
Since you're considering elasticsearch in a context of a Rails project, you'll probably try out theTire gem at some point. Supposing you're using an ActiveModel-compatible ORM (for SQL or NoSQL databases), importing is as easy as:
$ rake environment tire:import CLASS=MyClass
See the Tire documentation and the relevant Railscasts episode for more information.
We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?
Have you had a look at MongoDB's GridFS.
http://www.mongodb.org/display/DOCS/GridFS+Specification
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.
Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.
For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?