snowflake selective removal of load metadata - snowflake-cloud-data-platform

I have a large table in snowflake(terabytes per month) which gets loaded using an external S3 stage. I have loaded data into it using a wrong COPY command for a month. I know the pattern of the S3 objects thats got loaded and I store them in the table in one of the column.
Is there anyway for me to selectively remove already loaded metadata for this table. So the table forgets about these specific metadata being already loaded.
Idea is If I can then delete those records using the s3 Objects name and then load back those s3 objects again by fixing my COPY command ?
The other option I have is load these S3 objects into another table and perform an UPSERT.
But I am just checking if there is any option for selective removal of Loaded Metadata on Snowflake Table ?
Any answer is welcome . Thanks in advance.

I don't think you can alter the metadata.. (which really means I don't feel you can but also have not looked too hard, as I have other options)
You can
COPY INTO
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
and this list the set of files
and use the FORCE = TRUE to make it ignore that meta list.
Which might be as hoop jumping as loading into another load, and then merging in. It depends if you are trying have it all auto-magic as part of your "normal process" and thus do all the other normal things.
But that begs the question if you have 100's or 1000's of files, and are relying on the "metadata/file age" to prune the list of files, to just the new's, your COPY INTO will suffer a performance hit of doing this, all the time. One way around this is to use highwater marks of dates, and step into the now. Which mean you could reset the highwater mark (this is how we handled the task) and our code dynamically loaded blocks in batch, so would normally be working in the "current hour" but if was weeks/months behind (the watermark was reset) it would load month chunks. So the data was not too much for the warehouse (and because we where load balancing many tables)...
Anyways. Those are some other options/thoughts.

Related

Best way to handle large amount of inserts to Azure SQL database using TypeORM

I have an API created with Azure Functions (TypeScript). These functions receive arrays of JSON data, converts them to TypeORM entities and inserts them into Azure SQL database. I recently ran into an issue where the array had hundreads of items, and I got an error:
The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request
I figured that saving all of the data at once using the entity manager causes the issue:
const connection = await createConnection();
connection.manager.save(ARRAY OF ENTITIES);
What would be the best scalable solution to handle this? I've got a couple ideas but I've no idea if they're any good, especially performance wise.
Begin transaction -> Start saving the entities individually inside forEach loop -> Commit
Split the array into smaller arrays -> Begin transaction -> Save the smaller arrays individually -> Commit
Something else?
Right now the array sizes are in tens or hundreads, but occasional arrays with 10k+ items are also a possibility.
One way you can massively scale is let DB deal with that problem. E.g. use External Tables. DB does the parsing. Let your code only orchestrates.
E.g.
Make data to be inserted available in ADLS (Datalake):
instead of calling your REST API with all data (in body or query params as array), caller would write the data to ADLS-location as csv/json/parquet/... file. OR
Caller remains unchanged. Your Azure Function writes data to some csv/json/parquet/... file in ADLS-location (instead of writing to DB).
Make DB read and load the data from ADLS.
First `CREATE EXTERNAL TABLE tmpExtTable LOCATION = ADLS-location
Then INSERT INTO actualTable (SELECT * from tmpExtTable)
See formats supported by EXTERNAL FILE FORMAT.
You need not delete and re-create external table each time. Whenever you run SELECT on it, DB will go parse the data in ADLS. But it's a choice.
I ended up doing this the easy way, as TypeORM already provided the ability to save in chunks. It might not be the most optimal way but at least I got away from the "too many parameters" error.
// Save all data in chunks of 100 entities
connection.manager.save(ARRAY OF ENTITIES, { chunk: 100 });

Performance issue, flat file vs database storage

I am working on a project that asked me to develop a system to provide JSON output, the following is the flow:
1a) Some tables will be updated via the administration panel (my company side)
1b) Some related tables will be updated via the administration panel (partner side)
-- Let say SuperHeroes & Males were updated in 1a), Studios & Years were updated in 1b)
2) Client browse our site and request an information set which:
Has an enabled and not deleted row in SuperHeroes (Ant-man)
Has an enabled and not deleted row in Males linked to SuperHeroes (Scott Lang)
Joined the above records then look if they are linked to and exists in Studios (Marvel)
Linked to an existing row in Years (2015)
3) A very small data will be outputted to a JSON string as the following: { id:1,type:marvel },{ id:1,type:dc }
All rows in the above 4 tables will be updated/deleted at anytime without notification, [No Foreign Key as well]
I am thinking to update the information in a flat file every time 1a is performed (since we can update the system of my company side but not the partner side, and they are rejected to save some extra information into a flat file, so the situation is we have no easy way to know if the Studios or Years tables are modified)
Then while the JSON request will first load the information from the flat file (all outputting data will be stored in this file), then use a simple SQL statement to filter if a linked record exists in Studios & Years
I have done my research and getting confused, I concluded when the data amount is small then flat file will be great but beware the file comes larger and larger (The flat file we are talking will noway more than 50 rows at 1 time, and that should not be modified frequently)
Some answer said database is good at Query data (I think so and the requirement will perform SQL check too)
So I don't know if its good when my data amounts are small but still need some communication with the database..
I appreciate your time and your help, all idea & hints are welcome, thanks!
Your conclusion regarding amount of data is absolutely correct, and file should handle those 50 rows, but..
Using database as storage should give you more options in the future, e.g.:
you'll be able to produce any output due to separation of data
representation (today it's JSON, but what if you have to produce XML
at some point? will you add files that store XML's next to JSON
files? and then CSV or any other?)
transactions will guarantee the ACID
scalability and performance - if your dataset get bigger (you never know :)), many
DB's offer you many possibilities like table partitioning, partitioning-based clustering
or replication
Wrong choices regarding architecture and technology made at the beginning of the project will always backfire.

Do nothing with (no) match output in the lookup transformation (SSIS)

I'm new to SSIS as well as Stackoverflow.
Here's my situation.
I'm building a database and an archive database which need to be synced daily. The records in the database need to be copied to the archieve. I use SSIS and daily jobs to do this. Obviously, I don't want SSIS to load all the data everytime, only the new records (that aren't in the archive yet). I want to use the lookup transformation to achieve this. I'm testing it and it works, it only copies the new data to the "no match output" (which is my archive). But I linked the "match output" to a new destination. But as there are many columns and records, it would be way too much to keep all those redundant data (ofcourse I can purge the data but I don't want to have those extra columns in the first place!). I actually don't want the "match output" to be send to anywhere! How to do this? Or some solution that is more efficient than what I'm doing now (sending the matched outputs to new destinations and deleting those columns or records later on).
P.S.
I already found this question on stackoverflow which is a similar question (except the other way around, the TS wants to do nothing with "no match output"): Sending no match output rows to nowhere
But the thing is that I don't want to download/use "thrash destination", I'd rather use everything that is already built in SSIS itself. And I don't understand how the derived column transformation could solve the problem.
There are no other answers on that question, so that's why I made a new thread.
Can anyone help me with this? (and excuse my English, it's not my native language)
Just don't map the match output. In case that gives an error map it to a row count, that way you can keep track of the amount of duplicate data being handled.
Would be even better to filter this in the source component though, for performance reasons

Import Access database table design properties but not data

I've got a large amount of access databases that need to have the same table design changes (and a few new tables created) in each of them. Is there any way to take my most recent (properly designed) database, export the design properties, and import them to each of the other databases overwriting changes and creating any new fields, tables, etc. as needed?
My research has only led me to the Database documenter which seems to only be helpful in cases where I'd manually update the properties. I also know I could potentially copy each table over manually specifying 'Structure Only' for each case but that'd be a rather daunting task and I'm unsure what exactly would be copied using this method.
Let me see if I have the outline...
Open Proper.mdb
For each OtherMDB in Folder1
Open OtherMDB
for each ProperTable in Proper.mdb
If ProperTable is absent from OtherMDB
Add ProperTable to OtherMDB
Else
For each Field in ProperTable.Fields
If ProperField is absent from OtherTable.Fields
Add Field to OtherTable
Elseif ' is this a possibility?? wanting to change field type?
ProperField.Type <> OtherTable.Field("xx").Type Then
Change Field.Type
endif
Next Field
Endif
Next Table
Close OtherMDB
Next MDB
I found a utility called DBWeigher which is able to analyze and compare two access databases and automatically generate the necessary VBcode to update the changes between the two. From here I quickly ran through the changes manually and was able to see firsthand what changes would be made prior to running them through the DBConsole.
For anyone trying to update older access databases (especially when they're at different stages and could have some variances), I can't suggest checking this lightweight utility out.

How to attach and view pdf documents to access database

I have a very simple database in access, but for each record i need to attach a scanned in document (probably pdf). What is the best way to do this, the database should not just link to a file on the pc, but should copy and keep the file with it, meaning if the original file goes missing the database is moved or copied, the file should still be accessable from within the Database. Is This possible? and what is the easiest way of doing it? If is should i can write a macro, i just dont know where to start. and also when i display a report of the table, i would like to just see thumbnails of the documents.
Thank you.
As the other answerers have noted, storing file data inside a database table can be a questionable practice. That said, I wouldn't personally rule it out, though if you are going to take that option, I'd strongly suggest splitting out the file data into its own table in its own backend file. For example:
Create a new database file called Scanned files.mdb (or Scanned files.accdb).
Add a single table called Scans with fields such as FileID (AutoNumber, primary key), MainTableID (matches whatever is the primary key of the main table in the main database file), FileName (Text), FileExt (Text) and FileData ('OLE object', really just a BLOB - don't actually use OLE Objects because they will bloat the database horribly).
Back in the frontend, add a reference to Scans as a linked table.
Use a bit of VBA to upload and extract files from the Scans table (if you're interested in the mechanics of this, post a separate question).
Use the VBA Shell routine (if you must) or ShellExecute from the Windows API (= the better option IMO) to open extracted data.
If you are using the newer ACCDB format, then you have the 'attachment' field type available as smk081 suggests. This basically does most of the above steps for you, however doing things 'by hand' gives you greater flexibilty - for example, it allows giving each file a 'DateScanned' or 'DateEffective' field.
That said, your requirement for thumbnails will require explicit coding whatever option you take. It might be possible to leverage the Windows file previewing API, though I'd be certain thumbnails are a definite requirement before investigating this - Access VBA is powerful enough to encourage attempts at complex solutions, but frequently not clean and modern enough to allow fulfilling them in a particularly maintainable fashion.
There is an Attachment type under Data Type when you go into Design View of your table. You can add an attachment field here. When you go into the Datasheet view of the table you can select this field for a particular row and a window will open for you to specify the attachment. This will cause your database to quickly grow in size if you add a lot of large attachments.
You can use an OLE field in a table, but I would really suggest you not use this approach. The database is going to be HUGE in no time, and you're going to regret it.
Instead, you should consider adding a field that stores the path to the file, and keep the files in one folder on your network. Then you can use a SHELL() command to open the file. What's the difference between restoring an Access database and restoring PDF files if something goes wrong? This will keep your database at a manageable size and reduce the possibility of corruption.

Resources