Preserving the load history when re-creating pipes in Snowflake

Preserving the load history when re-creating pipes in Snowflake - snowflake-cloud-data-platform

Is there any way to preserve the load history when re-creating pipes (using CREATE OR REPLACE)?
We do a lot of automated CI/CD on Snowflake, and sometimes pipes need to get re-created. When this happens, the load history is lost. Right now, the accepted workaround is a manual process, which doesn't work very well in an automated workflow.
This makes refreshing pipes dangerous, as duplicate data could be loaded. There is also a danger of losing some notifications/files while the pipe is being re-created -- with or without the manual process, automated or not (which is unacceptable, for obvious reasons).
I wish there was a simple parameter to enable this. Something like:
CREATE OR REPLACE PIPE my_pipe
PRESERVE_HISTORY = [ TRUE | FALSE ]
AS <copy_statement>
An alternative to this would be an option/parameter for pipes to share the load history with the table instead. This way, when the pipe is re-created (but the table isn't), the load history is preserved. If the table is dropped/truncated, then the load history for both the table and the pipe would be lost.
Another option would be the ability to modify pipes using an ALTER command instead, but currently this is very limited. This way, we wouldn't even need to re-create the pipe in the first place.
EDIT: Tried automating the manual process with a procedure, but there's a still chance of losing notifications.

Creating a pipe creates a new object with its own history, I don't see how this is something that would be feasible to do.
Why do you need to re-create the pipes?
Your other option is to manage the source files, after content is ingested by a pipe remove the files that were ingested. The new pipe won't even know about the new files. This, of course, can be automated too

Since preserving the load history doesn't seem possible currently, I explored a few alternatives:
tl;dr: Here is the solution.
Deleting/Removing/Moving the files after ingestion
Thank you #patrick_at_snowflake for the recommendation! 🙏
This turned out to be a bit tricky to do with high reliability, because there's no simple way to tie the ingestion of files in Snowflake to their deletion/removal in could storage (i.e. life cycle management policies are not aware of whether or not the files were ingested successfully by Snowpipe).
It could be possible to monitor the ingestion using a stream or COPY_HISTORY as a trigger for the deletion/removal of the files, but this is not simple (would probably require the use of an external function).
Refreshing a subset of the pipe
Thanks #GregPavlik for the suggestion! 🙏
The idea here would be to save the timestamp at which the initial pipe is paused/dropped. This timestamp could then be used to refresh the new pipe with a "safe" subset of the staged files (in order to avoid re-ingesting the same files and creating duplicates records).
I think this is a great idea (my favorite so far), but I also had monitoring in mind and wanted to confirm that this would work, so I continued exploring alternatives for a while.
Replaying the missed notifications
I asked a separate question about this here.
The idea would be to simply replay the notifications that were neither processed by the initial pipe or the new pipe.
However this doesn't seem possible either.
Monitoring the load of every staged file
Finally, I arrived at this solution.
This is the one I went with as it not only allows to refresh missing files, but also to monitor the loading of all staged files as a whole (no matter the source of the failure).
I was already working on monitoring Snowpipe as part of a project, so this solution added another layer of monitoring. 👍

Related

How to deal with issues when storing uploaded files in the file system for a web app?

I am building a web application where the users can create reports and then upload some images for the created reports. Those images will be rendered in the browser when the user clicks a button on the report page. The images are confidential and only authorized users will be able to access them.
I am aware of the pros and cons of storing images in database, in filesystem or a service like amazon S3. For my application, I am inclined to keep the images in the filesystem and paths of the images in the database. That means I have to deal with the problems arising around distributed transaction management. I need some advice on how to deal with these problems.
1- I believe one of the proper solutions is to use technologies like JTA and XADisk. I am not very knowledgeable about these technologies but I believe 2 phase commit is how automicity is achieved. I am using MySQL as the database, and it seems like 2 phase commit is supported by MySQL. Problem with this approach is XADisk does not seem to be an active project and there is not much documentation about it and there is the fact that I am not very knowlegable about the ins and outs of this approach. I am not sure if I should invest in this approach.
2- I believe I can get away with some of the problems arising from the violation of ACID properties for my application. While uploading images, I can first write the files to disk, if this operation succeeds I can update the paths in the database. If database transaction fails, I can delete the files from the disk. I know that is still not bulletproof; an electricity shortage might occur just after the db transaction or the disk might not be responsive for a while etc...I know there are also concurrency issues, for instance if one user tries to modify the uploaded image and another tries to delete it at the same time, there will be some problems. Still the chances for concurrent updates in my application will be relatively low.
I believe I can live with orphan files on the disk or orphan image paths on the db if such exceptional cases occur. If a file path exists in db and not in the file system, I can show a notification to the user on report page and he might try to reupload the image. Orphan files in the file system would not be too much problem, I might run a process to detect such files time to time. Still, I am not very comfortable with this approach.
3- The last option might be to not store file paths in the db at all. I can structure the filesystem such that I can infer the file path in code and load all images at once. For instance, I can create a folder with the name of report id for each report. When a request has been made to load images of the report, I can load the images at once since I know the report id. That might end up with huge number of folders in the filesystem and I am not sure if such a design is acceptable. Concurrency issues will still exist in this scheme.
I would appreciate some advice on which approach I should follow.

I believe you are trying to be ultra-correct, and maybe not that much is needed, but I also faced some similar situation some time ago and explored also different possibilities. I disliked options aligned to your option 1, but about the 2 and 3, I had different successful approaches.
Let's sum up first the list of concerns:
You want the file to be saved
You want the file path to be linked to the corresponding entity (i.e the report)
You don't want a file path to be linked to a file that doesn't exist
You don't want files in the filesystem not linked to any report
And the different approaches:
1. Using DB
You can assure transactions in the DB pretty much with any relational database, and with S3 you can ensure read-after-write consistency for both new objects and upload of new objects. If you PUT an object and you get a 200 OK, it will be readable. Now, how to put all this together? You need to keep track of the process. I can figure 2 ways:
1.1 With a progress table
The upload request is saved to a table with anything need to identify this file, report id, temp uploaded file path, destination path, and a status column
You save the file
If the file safe fails you can update the record in the table, or delete it
If saving the file is successful, in a transaction:
update the progress table with successful status
update the table where you actually save the relationship report-image
Have a cron, but not checking the filesystem, but checking the process table. If there is any file in the filesystem that is orphan, definitely it had been added to the table (it was point 1). Here you can decide if you will delete the file, or if you have enough info, you can continue with the aborted process triggering the point 4.
The same report-image relationship table with some extra status columns.
1.2 With a queue system
Like RabbitMQ, SQS, AMQ, etc
A very similar approach could be done with any queue system instead of a db table. I wont give much details because it depends more on your real infrastructure, but just the general idea.
The upload request goes to a queue, you send a message with anything you may need to identify this file, report id, and if you want a tentative final path.
You upload the file
A worker reads pending messages in the queue and does the work. The message is marked as consumed only when everything goes well.
If something fails, naturally the message will come back to the queue
In the next time a message is read, the worker can have enough info to see if there is work to resume, or even a file to delete if resuming is not possible
In both cases, concurrency problems wont be straightforward to manage, but can be managed (relying on DB locks in fist case, and FIFO queues in second cases) but always with some application logic
2. Without DB
To some extent a system without a database would be perfectly acceptable, if we can defend it as a proper convention over configuration design.
You have to deal with 3 things:
Save files
Read files
Make sure that the structure of the filesystem is manageable
Lets start with 3:
Folder structure
In general, something like one folder for report id will be too simple, and maybe hard to maintain, and also ultimately too plain. This will cause issues, because if we have a folder images with one folder per report, and tomorrow you have less say 200k reports, the images folder will have 200k elements, and even an ls will take too much time, same for any programing language trying to access. That will kill you
You can think about something more sophisticated. Personally like a way that I learnt from Magento 1 more than 10 years ago and I used a lot since then: Using a folder structure following first outside rules, but extended with rules derived extended with the file name itself.
We want to save a product image. The image name is: myproduct.jpg
first rule is: for product images i use /media/catalog/product
then, to avoid many images in the same one, i create one folder per every letter of the image name, up to some number of letters. Lets say 3. So my final folder will be something like /media/catalog/product/m/y/p/myproduct.jpg
like this, it is clear where to save any new image. You can do something similar using your reports id, categories, or anything that makes sense for you. The final objective is to avoid too flat structure, and to create a tree that makes sense to you, and also that can be automatized easily.
And that takes us to the next part:
Read and write.
I implemented a similar system before quite successfully. It allowed me to save files easy, and to retrieve them easily, with locations that were purely dynamic. The parts here were:
S3 (but you can do with any filesystem)
A small microservice acting as a proxy for both read and write.
Some namespace system and attached logic.
The logic is quite simple. The namespace lets me know where the file will be saved. For example, the namespace can be companyname/reports/images.
Lets say a develop a microservice for read and write:
For saving a file, it receives:
namespace
entity id (ie you report)
file to upload
And it will do:
based on the rules I have for that namespace, and the id and file name will save the file in this folder
it doesn't return the physical location. That remains unknown to the client.
Then, for reading, clients will use a URL that uses also convention. For example you can have something like
https://myservice.com/{NAMESPACE}/{entity_id}
And based on the logic, the microservice will know where to find that in the storage and return the image.
If you have more than one image per report, you can do different things, such as:
- you may want to have a third slug in the path such as https://myservice.com/{NAMESPACE}/{entity_id}/1 https://myservice.com/{NAMESPACE}/{entity_id}/2 etc...
- if it is for your internal application usage, you can have one endpoint that returns the list of all eligible images, lets say https://myservice.com/{NAMESPACE}/{entity_id} returns an array with all image urls
How I implemented this was with quite simple yml config to define the logic, and very simple code reading that config. That allowed me to have a lot of flexibility. For example save reports in total different paths or servers or s3 buckets if they belong to different companies or are different report types

Need advice with multiple people working on an Access Database

This may seem like an ignorant issue, but I am inexperienced with Access. This is for a school project. I am in a group of 5 people who are all working on a database. We were wondering: what is the most efficient way for multiple people to work on a database without sending the database file to and fro or just putting a copy on a file sharing service? Is there a way we can all log in and modify the same database?

There is a way to do this. If you put the database file in a shared location where everyone can access it. You need to change the database locking options so that multiple users can make changes to it.
In Access 2013 go to File->Client Settings->Advanced and change the Open Mode to shared, also change Record locking to Edited Records - this will enable uses to make changes but not to the same record.
If you want users to make changes to the same record select No Locks but it is not a good idea as if two users change at the same time the last one will be saved.

As for file sharing service, well, this expert and this one too say it doesn't work.

Separate start directory for each client

I have a select based server system, where I can manage multiple clients. The server automatically reads and responds to the client, which is great. But there's a minor issue. For instance user#1 changes directory (coded with chdir), all of the other users are affected by this change. I really do wish that prevented for happening.

There's two ways to solve this:
Fork off a separate process to handle each connection. This process can have its own state, including current working directory. The disadvantages are that you'll need to refactor your code quite a lot, and if you have a lot of concurrent connections then it can be a performance problem. This is harder on Windows that *nix, but not impossible.
Keep the current directory as a per-connection setting within your program, and (re)set the directory before executing every user command.

Database - Version Control - Managing dropped/deleted objects

We want to clean up our database schema and drop/delete objects which are no longer being used.
We suspect that sometime in the future we'll want to resurrect the removed functionality.
We've discussed the following options for dealing with dropped objects in version control:
Deleting the .sql files from source control once they are gone from the database and relying on the version history to store the definitions. Our concern with this approach is that sometime over the years source control will be moved and we will lose the history. It also seems difficult to know what to look for to recover if we can't see all the dropped objects.
Leaving the .sql files in source control but updating the definitions to "drop proc {someproc}". With this approach we our concerned about leaving the objects in version control which no longer exists and also the risk to losing the history if the vcs was moved
Creating a new repo for dropped objects and migrating .sql files to this repo once they have been dropped from SQL Server.
We're working in a windows environment and are fairly new to working with VCS for databases. Currently GIT + SSDT.
Currently option 3 is our preferred approach.

I see this a lot with database code, what happens is over time people end up with stuff in the database that is either not used or just does not work (think a proc that references a table and the table is modified but not the proc).
The thing to do is to get everything in source control (which it looks like you have) and then create a tag or branch of all the code before and after deleting it so you can get it back.
Two things normally transpire, either the code was genuinely never used or it was used at year end and when you find out, the world is about to fall on your head so better have a quick way to get it back.
Of course if you had a full suite of tests then even the year end process would be safe :)
I personally wouldn't use option 3, I would just keep the history in the main branch so you keep the history with it.
ed

There are a lot of good tools for versioning database changes: you have a big chance to get this question closed with "Too broad" reason, but I'll try to suggest to
Read about, understand and try to add Liquibase to your Development-Toolbox
Adopt your workflow for using this additional layer - technically it will be one more file (changelog in terms of Liquibase) in changesets, where you changing DD and|or data.
These changelogs provide good and smooth way of moving back and forth in linear history of changes in databases, not so good (or I don't know The Right Way) for direct jumping between nodes of diverged history, but it seems not your case
From your options-list it will be more p.1, than others (but it's storing changes in database in version-contol, not states)

Just to note another option, in SSDT you can mark the file property as Build Action = None. The file won't be included in the dacpac when this build option is selected. But I tend to agree with the idea that you should rely on your VCS to handle history.

Creating the Front End MDE

I created a database for tracking metrics, with some automation tricks (email, .doc,.ppt presentations, etc) with a very large Main-table, and lots of forms/GUI. This is the first time I have ever I worried about an MDE/front-end for the thing. So if you would be so kind to answer a few questions, or offer any advice, it would be greatly appreciated (I would hate for all this work to not be utilized).
What is the first thing I need to do? It the 2000 version that must be converted to 03 to create the MDE, but does that get done before I use the database splitter?
Will the amount of objects in the database effect the ability to do this? I have something like 80 forms, 70 queries, 20+ macros, 12 tables, etc...but does the amount of objects prevent some of this from working well once the front end is there?
when i split the database, can I continue to work/make changes and such on the "back end", and have those changes directly effect the front end?
These may be some basic questions, but I don't know the answer so.....Thanks!

Here is my 2 ¢.
Question 1 - I have never used the database splitter as I feel I have more control doing it manually. If you do it manually you can do it to a version that does not have a database splitter. But if you do use the splitter then--yes--you will have to upgrade to a version that has a splitter before doing it.
To do it manually here are the steps.
Backup everything.
Create a copy of your file into the same directory. So if you have an MyApp.MDB create a copy into the same directory with a new name, such as MyAppDATA.mdb.
Open the new DATA file (MyAppDATA.mdb) and delete all of the objects EXCEPT the TABLES.
Open the App file (MyApp.mdb) and delete all of the tables.
Also in MyApp.mdb...go to the File/Get External Data/Link Tables menu to link the tables in MyAppDATA.mdb to MyApp.mdb. Select All and create the links.
That should do it. And if you screw up you made a backup...right?
A couple of tips and gotchas...be sure that you go to Tools/Options and that you are NOT showing System and Hidden tables. You just don't want to delete system tables from MyApp. Another way to do it is do NOT delete tables that start with MSys or USys.
Question 2 - Does not matter how many object you have. In fact you don't have that many objects anyway.
Question 3 - Yes...you will make backend changes in MyAppData.mdb and when you open MyApp.mdb those changes will auto-magically be there to see and query against etc. (In the query designer you may need to save/close/reopen to see new fields if you made the mod while in the query). The EXCEPTION to that is New Tables You will have to use the File/Get External Data/Link Tables option to create links to new tables.
One thing to remember (and that I hope you already realize) is that the one downside of splitting the database is that when you deploy the front end file that usually the relative path to the data will vary from machine to machine and there is no automatic re-linking of tables in access. If your target clients have full access you can always use Tools/Database Utilities/Linked Table Manager to refresh the links to the right location. If you can't do that then you will have to do one of the following:
1. Write code that does the automatic re-linking for you. Basically it will check the links...if invalid it will prompt the user for the data location (or look it up in an INI file) and re-link the tables.
2. Always deploy your app to the same location on all machines. If you have commercial visions for your application this won't work...I mention it for academic reasons. It might be doable for a limited deployment where you have a lot of control over file placement on each machine.
3. Put the Data file (MyAppDATA.mdb) onto a network share and link the table across the network using a drive mapping or UNC (\myserver\mydata\ApplicationData\MyAppData.mdb). The latter is preferred but both of them run the same risks as number two.
Seth
PS This answer assumes Access 2003.
PPS If you have commercial visions for your application then the table linking has got to be REALLY robust.
PPPS I agree with the commenter that you may want to take the plunge and do SQL if it is in your skill set.

One thing that hasn't been discussed, and that's the issue of whether the compile to MDE could fail. Basically, if your code compiles in your front-end MDB, it will convert to an MDE. But I've noticed that lots of people never compile.
Some hints for keeping your VBA code in good shape:
in VBE options, turn off COMPILE ON DEMAND.
add the COMPILE button to your standard VBE toolbar and USE IT OFTEN.
periodically, backup your MDB and decompile/recompile it.
Also, remember that you must keep the MDB source, as the VBA code is not editable in an MDE and not recoverable by any good method.
EDIT:
Steps for a decompile:
backup your MDB.
start an instance of Access with the /decompile commandline argument. For, instance, I have a shortcut on my deskstop that has this as the target:
"C:\Program Files\Microsoft Office\OFFICE11\MSACCESS.EXE" /decompile
having opened that instance of Access, open the MDB you want to decompile. You will see nothing happen. DO NOTHING FURTHER IN THIS INSTANCE OF ACCESS -- close this instance of Access (the reason for this is that Michael Kaplan, who knows a thing or two about this, recommended that you never do any work in an Access instance opened with the decompile switch because he said there was no guarantee that the Access application code executed under those circumstances in a way that was fully safe for all kinds of Access work).
open the just-decompiled MDB holding down the shift key (you want to be sure that startup routines don't run because that would likely recompile the product before you've finished your cleanup) and compact the MDB (holding down the shift key again).
open the code editor and compile the project (DEBUG -> COMPILE [db name] for those who haven't step #2 in my original compiling instructions at the top of the post before the edit).
compact the MDB (doesn't matter if you bypass startup, since it's already fully compiled).
Why so many steps?
Because the purpose of the decompile is to get rid of the compiled p-code in order to start afresh from the canonical VBA code. Following the steps above insures that you have completely cleared the data pages storing the compiled code before you recompile. The reason for this is that without the compact step after the decompile, under some very rare circumstances, the code can behave strangely. I can't imagine that the old discarded p-code is being used again, but there's something about the pointers between the canonical code and the compiled code that apparently doesn't get completely flushed by a decompile without a compact.

This would be a comment to Seth's answer, but my rep isn't high enough to comment yet.
Seth did a great job answering your questions, I just wanted to add a bit more to part #1 about using the Database Splitter. The Database Splitter in the Tools menu works fine. Doing it manually is alright too, but it's a whole lot faster and easier to use the Database Splitter. I've used it a dozen times and never encountered any issues after using it.
http://www.databasedev.co.uk/split_a_database.html has a decent page about some of the pros, cons of splitting your database.
http://www.accessmvp.com/TWickerath/articles/multiuser.htm also has some good info when dealing with a split database in a multi-user environment.

Seth gave you a very good answer. But I'll add a few comments.
The number of objects only becomes relevant when you get close to about 1000 forms, reports and modules which have code. There's a limit about there. If you do get that message when trying to make an MDE then you almost certainly have a code error and need to compile to find the error
Another resource is "Splitting your app into a front end and back end Tips"
See the Auto FE Updater downloads page to make the process of distributing new FEs relatively painless.. The utility also supports Terminal Server/Citrix quite nicely.