I am trying to dive into the new Stateful Functions approach and I already tried to create a savepoint manually (https://ci.apache.org/projects/flink/flink-statefun-docs-release-2.1/deployment-and-operations/state-bootstrap.html#creating-a-savepoint).
It works like a charm but I can't find a way how to do it automatically. For example, I have a couple millions of keys and I need to write them all to savepoint.
Is your question about how to replace the env.fromElements in the example with something that reads from a file, or other data source? Flink's DataSet API, which is what's used here, can read from any HadoopInputFormat. See DataSet Connectors for details.
There are easy-to-use shortcuts for common cases. If you just want to read data from a file using a TextInputFormat, that would look like this:
env.readTextFile(path)
and to read from a CSV file using a CsvInputFormat:
env.readCsvFile(path)
See Data Sources for more about working with these shortcuts.
If I've misinterpreted the question, please clarify your concerns.
Related
I am trying to save data as it arrives in a streaming fashion (with the least amount of delay) to my database which is InfluxDB. Currently I save it in batches.
Current setup - interval based
Currently I have an Airflow instance where I read the data from a REST API every 5min and then save it to the InfluxDB.
Desired setup - continuous
Instead of saving data every 5 min, I would like to establish a connection via a Web-socket (I guess) and save the data as it arrives. I have never done this before and I am confusing how actually it is done? Some question I have are:
One I write the code for it, do I keep it up like a daemon?
Do I need to use something like Telegraf for this or that's not really the case (example article)
Instead of Airflow (since it is for batch processing) do I need to use something like Apache Beam or Spark?
As you can see, I am quite lost on where to start, what to read and how to make sense from all this. Any advise on direction and/or guidance for a set-up would be very appreciated.
If I understand correctly, you are keen to code a java service which would process the incoming data, so one of the solution is to implement a websocket with for example jetty.
From there you receive the data in json format for example and you process the data using the influxdb-java framework with which you fill the database. Influxdb-java will allow you to create and manage the data.
I don't know airflow, and how you produce the data, so maybe there is built-in tools (influxdb sinks) that can save you some work in your context.
I hope that this can give you some guide lines to start digging more.
This is kind of just pie-in-the-sky brainstorming kind of stuff, not expecting concrete answers but hoping for some pointers.
I'm imagining a workflow where we trigger a savepoint, and inspect the savepoint files to look at the state for specific operators -- as a debugging aide, perhaps, or as a simpler(?) way of achieving what we might do with queryable state...
Assuming that could work, how about the possibility of modifying / fixing the data in the savepoint to be used when restarting the same or a modified version of the job?
Or perhaps generating a savepoint more or less from scratch to define the initial state for a new job? Sort of in lieu of feeding data in to backfill state?
Do such facilities exist already? My guess is no, based on what I've been able to find so far. How would I go about accomplishing something like that? My high-level idea so far goes something like:
savepoint -->
SavepointV2Serializer.deserialize -->
write to json -->
manually inspect / edit the files, or
other tooling that works with json to inspect / modify
SavepointV2Serializer.serialize -->
new savepoint
I haven't actually written any code yet, so I really don't know how feasible that is. Thoughts?
You want to use the State Processor API, which is coming soon as part of Flink 1.9. This will make it possible to read, write, and modify savepoints using Flink’s batch DataSet api.
I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
Read files as stream and use BucketingSink.
My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .
We faced the same problem. We too are surprised that DataSet does not support addSink().
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.
Even I can use Sql statement with CRUD.In this case, I can show some data to UI easily by some specific change condition. Thanks in advance.
After a few days of research, I find some method. In Linux We can use MongoDB + memory-mapped file, such as tmpfs in Linux. Guide is here, English original and Chinese Translation.
But this is a bit troublesome when relating to Windows, which need help of some tool. If so, that's unnecessary. Write your own file helper honestly manage your data (also for me).
I want to store a set of data (like drop tables for a game) that can be edited and "forked" (like a open source project, just data, so if I stop updating it, someone can continue with it) like a coding project. I also want that data to be easy to implement in code (for example, the same way you can use a database in code to get your values) for people that makes companion apps for said game.
What type of data storage would be the best for this scenario?
EDIT: By type of data storage I mean something Like XML or JSON or a database like Access or SQL as well as noSQL
It is a very general question, but I'm getting the feeling that you're looking for something like GitHub. If you don't know what that is, then you should probably look into it. GitHub supports svn and allows you to edit your code quite easily and let you look back to previous versions of your code. Hope this helps!