This is kind of just pie-in-the-sky brainstorming kind of stuff, not expecting concrete answers but hoping for some pointers.
I'm imagining a workflow where we trigger a savepoint, and inspect the savepoint files to look at the state for specific operators -- as a debugging aide, perhaps, or as a simpler(?) way of achieving what we might do with queryable state...
Assuming that could work, how about the possibility of modifying / fixing the data in the savepoint to be used when restarting the same or a modified version of the job?
Or perhaps generating a savepoint more or less from scratch to define the initial state for a new job? Sort of in lieu of feeding data in to backfill state?
Do such facilities exist already? My guess is no, based on what I've been able to find so far. How would I go about accomplishing something like that? My high-level idea so far goes something like:
savepoint -->
SavepointV2Serializer.deserialize -->
write to json -->
manually inspect / edit the files, or
other tooling that works with json to inspect / modify
SavepointV2Serializer.serialize -->
new savepoint
I haven't actually written any code yet, so I really don't know how feasible that is. Thoughts?
You want to use the State Processor API, which is coming soon as part of Flink 1.9. This will make it possible to read, write, and modify savepoints using Flinkβs batch DataSet api.
Related
Is there any way to preserve the load history when re-creating pipes (using CREATE OR REPLACE)?
We do a lot of automated CI/CD on Snowflake, and sometimes pipes need to get re-created. When this happens, the load history is lost. Right now, the accepted workaround is a manual process, which doesn't work very well in an automated workflow.
This makes refreshing pipes dangerous, as duplicate data could be loaded. There is also a danger of losing some notifications/files while the pipe is being re-created -- with or without the manual process, automated or not (which is unacceptable, for obvious reasons).
I wish there was a simple parameter to enable this. Something like:
CREATE OR REPLACE PIPE my_pipe
PRESERVE_HISTORY = [ TRUE | FALSE ]
AS <copy_statement>
An alternative to this would be an option/parameter for pipes to share the load history with the table instead. This way, when the pipe is re-created (but the table isn't), the load history is preserved. If the table is dropped/truncated, then the load history for both the table and the pipe would be lost.
Another option would be the ability to modify pipes using an ALTER command instead, but currently this is very limited. This way, we wouldn't even need to re-create the pipe in the first place.
EDIT: Tried automating the manual process with a procedure, but there's a still chance of losing notifications.
Creating a pipe creates a new object with its own history, I don't see how this is something that would be feasible to do.
Why do you need to re-create the pipes?
Your other option is to manage the source files, after content is ingested by a pipe remove the files that were ingested. The new pipe won't even know about the new files. This, of course, can be automated too
Since preserving the load history doesn't seem possible currently, I explored a few alternatives:
tl;dr: Here is the solution.
Deleting/Removing/Moving the files after ingestion
Thank you #patrick_at_snowflake for the recommendation! π
This turned out to be a bit tricky to do with high reliability, because there's no simple way to tie the ingestion of files in Snowflake to their deletion/removal in could storage (i.e. life cycle management policies are not aware of whether or not the files were ingested successfully by Snowpipe).
It could be possible to monitor the ingestion using a stream or COPY_HISTORY as a trigger for the deletion/removal of the files, but this is not simple (would probably require the use of an external function).
Refreshing a subset of the pipe
Thanks #GregPavlik for the suggestion! π
The idea here would be to save the timestamp at which the initial pipe is paused/dropped. This timestamp could then be used to refresh the new pipe with a "safe" subset of the staged files (in order to avoid re-ingesting the same files and creating duplicates records).
I think this is a great idea (my favorite so far), but I also had monitoring in mind and wanted to confirm that this would work, so I continued exploring alternatives for a while.
Replaying the missed notifications
I asked a separate question about this here.
The idea would be to simply replay the notifications that were neither processed by the initial pipe or the new pipe.
However this doesn't seem possible either.
Monitoring the load of every staged file
Finally, I arrived at this solution.
This is the one I went with as it not only allows to refresh missing files, but also to monitor the loading of all staged files as a whole (no matter the source of the failure).
I was already working on monitoring Snowpipe as part of a project, so this solution added another layer of monitoring. π
I am trying to dive into the new Stateful Functions approach and I already tried to create a savepoint manually (https://ci.apache.org/projects/flink/flink-statefun-docs-release-2.1/deployment-and-operations/state-bootstrap.html#creating-a-savepoint).
It works like a charm but I can't find a way how to do it automatically. For example, I have a couple millions of keys and I need to write them all to savepoint.
Is your question about how to replace the env.fromElements in the example with something that reads from a file, or other data source? Flink's DataSet API, which is what's used here, can read from any HadoopInputFormat. See DataSet Connectors for details.
There are easy-to-use shortcuts for common cases. If you just want to read data from a file using a TextInputFormat, that would look like this:
env.readTextFile(path)
and to read from a CSV file using a CsvInputFormat:
env.readCsvFile(path)
See Data Sources for more about working with these shortcuts.
If I've misinterpreted the question, please clarify your concerns.
I am using the java post tool for solr to upload and index a directory of documents. There are several thousand documents. Solr only does a commit at the very end of the process and sometimes things stop before it completes so I lose all the work.
Has anyone a technique to fetch the name of each doc and call post on that so you get the commit for each document? Rather than the large commit of all the docs at the end?
From the help page for the post tool:
Other options:
..
-params "<key>=<value>[&<key>=<value>...]" (values must be URL-encoded; these pass through to Solr update request)
This should allow you to use -params "commitWithin=1000" to make sure each document shows up within one second of being added to the index.
Committing after each document is an overkill for the performance, in any case it's quite strange that you had to resubmit anything from start if something goes wrong. I suggest to seriously to change the indexing strategy you're using instead of investigating in a different way to commit.
Given that, if you not have any other way that change the commit configuration, I suggest to configure autocommit in your Solr collection/index or use the parameter commitWithin, as suggested by #MatsLindh. Just be aware if the tool you're using has the chance to add this parameter.
autoCommit
These settings control how often pending updates will be automatically pushed to the index. An alternative to autoCommit is
to use commitWithin, which can be defined when making the update
request to Solr (i.e., when pushing documents), or in an update
RequestHandler.
I want to store a set of data (like drop tables for a game) that can be edited and "forked" (like a open source project, just data, so if I stop updating it, someone can continue with it) like a coding project. I also want that data to be easy to implement in code (for example, the same way you can use a database in code to get your values) for people that makes companion apps for said game.
What type of data storage would be the best for this scenario?
EDIT: By type of data storage I mean something Like XML or JSON or a database like Access or SQL as well as noSQL
It is a very general question, but I'm getting the feeling that you're looking for something like GitHub. If you don't know what that is, then you should probably look into it. GitHub supports svn and allows you to edit your code quite easily and let you look back to previous versions of your code. Hope this helps!
I'm doing a personal, just for fun, project that is using screen scraping to give me a System Tray notification in case another line on an HTML table is added, modified or deleted.
Having done this before I thought: well let's go with the regular expression thing and that's it, but being a curious person, made me think that there could be something else out there that could have another paradigm but be as simple to use.
I know about DOM and X-Path and all the xml'ish approaches. I'm looking for something outside the box, something that can even be defined in a set of rules so you can make a plugin system to aggregate various sites.
See Options for HTML Scraping
Here's an idea: assuming your main use case is getting a notification whenever an HTML file changes, why not use a standard diff tool and then loop through the changed lines, applying your rules?
Also, if this is a situation where you have access to the server and the files you're watching, you might be able to put everything under source control with CVS (or similar) and just watch for commits. If you want to use this approach for random sites on the web, just write a script that periodically downloads the html for the appropriate URLs and then commits it to source control and watch the diffs.
Not very practical, but outside the box.
If you can convert the source into valid XHTML/XML using something like SgmlReader or HtmlTidy then you could use XSLT. Simply create a XSL template for each site you wish to scrape.