Migrating SQlserver NVARCHAR(MAX) --> Snowflake - sql-server

Looking for some past experience from others on this.
Scenario : We have SQLserver 2016 environment with tables defined with NVARCHAR(MAX) as the datatype with many records getting very close to the 2GB limit allowed by Sqlserver.
We would like to migrate this data to Snowflake but with VARCHAR limited to 16MB there doesn't seem to be any other suitable data type.
The data migration needs to retain the complete data set, so any truncation of data is not acceptable.
Running a transformation to divide the 2GB of data into 16MB chunks is not feasible either as that would generate 100+ columns on the snowflake end.
Let me know what others have done in the situation.
Regards
Nick

I would take a look at Snowflake's unstructured file support. As others have noted, a 2GB varchar sounds a lot like a file.
https://docs.snowflake.com/en/user-guide/unstructured-intro.html
This method supports files of unlimited size (so you no longer have to worry about the file size approaching the limit) and gives users the ability to download the files as needed by providing a URL (including time-limited URLs).
This can be used for any file type including video, audio, pdfs, etc.

Related

Compressing large text data before storing into db?

I have application which retrieves many large log files from a system LAN.
Currently I put all log files on Postgresql, the table has a column type TEXT and I don't plan any search on this text column because I use another external process which nightly retrieves all files and scans for sensitive pattern.
So the column value could be also a BLOB or a CLOB, but now my question is the following,
the database has already its compression system, but could I improve this compression manually like with common compressor utilities? And above all WHAT IF I manually pre-compress the large file and then I put as binary into the data table, is it unuseful as database system provides its internal compression?
I don't know who would compress the data more efficiently, you or the db, depends on the algo used etc. But what is sure is that if you compress it, asking the db to compress it again will be a waste of CPU. Once compressed, trying to compress it again yields less gain each time until you end up consuming more space eventually.
The internal compression used in PostgreSQL is designed to err on the side of speed, particularly for decompression. Thus, if you don't actually need that, you will be able to reach higher compression ratios if you compress it in your application.
Note also that if the database does the compression, the data will travel between the database and the application server in uncompressed format - which may or may not be a problem depending on your network.
As others have mentioned, if you do this, be sure to turn off the builtin compression, or you're wasting cycles.
The question you need to ask yourself is do you really need more compression than the database provides, and can you spare the CPU cycles for this on your application server. The only way to find out how much more compression you can get on your data is to try it out. Unless there's a substantial gain, don't bother with it.
My guess here is that if you do not need any searching or querying ability here that you could gain a reduction in disk usage by zipping the file and then just storing the binary data directly in the database.

how to store data crawled from website

I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work.
Suggestions?
Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?
It all depends on the effective amount of text and/or web pages you intent to crawl. A generic solution is probably to
use an RDBMS (SQL server of sorts) to store the meta-data associated with the pages.
Such info would be stored in a simple table (maybe with a very few support/related tables) containing fields such as Url, FileName (where you'll be saving it), Offset in File where stored (the idea is to keep several pages in the same file) date of crawl, size, and a few other fields.
use a flat file storage for the text proper.
The file name and path matters little (i.e. the path may be shallow and the name cryptic/automatically generated). This name / path is stored in the meta-data. Several crawled pages are stored in the same flat file, to optimize the overhead in the OS to manage too many files. The text itself may be compressed (ZIP etc.) on a per-page basis (there's little extra compression gain to be had by compressing bigger chunks.), allowing a per-file handling (no need to decompress all the text before it!). The decision to use compression depends on various factors; the compression/decompression overhead is typically relatively minimal, CPU-wise, and offers a nice saving on HD Space and generally disk I/O performance.
The advantage of this approach is that the DBMS remains small, but is available for SQL-driven queries (of an ad-hoc or programmed nature) to search on various criteria. There is typically little gain (and a lot of headache) associated with storing many/big files within the SQL server itself. Furthermore as each page gets processed / analyzed, additional meta-data (such as say title, language, most repeated 5 words, whatever) can be added to the database.
Having it in a database will help search through the content and page matadata. You can also try in-memory databases or "memcached" like storage to speed in up.
Depending on the processing power of the PC which will do the data mining, you could add the scraped data to a compressible archive like a 7zip, zip, or tarball. You'll be able to keep the directory structure intact and may end up saving a great deal of disk space - if that happens to be a concern.
On the other hand, a RDBMS like SqLite will balloon out really fast but wont mind ridiculously long directory hierarchies.

Storing Images : DB or File System -

I read some post in this regard but I still don't understand what's the best solution in my case.
I'm start writing a new webApp and the backend is going to provide about 1-10 million images. (average size 200-500kB for a single image)
My site will provide content and images to 100-1000 users at the same time.
I'd like also to keep Provider costs as low as possible (but this is a secondary requirement).
I'm thinking that File System space is less expensive if compared to the cost of DB size.
Personally I like the idea of having all my images in the DB but any suggestion will be really appreciated :)
Do you think that in my case the DB approach is the right choice?
Putting all of those images in your database will make it very, very large. This means your DB engine will be busy caching all those images (a task it's not really designed for) when it could be caching hot application data instead.
Leave the file caching up to the OS and/or your reverse proxy - they'll be better at it.
Some other reasons to store images on the file system:
Image servers can run even when the database is busy or down.
File systems are made to store files and are quite efficient at it.
Dumping data in your database means slower backups and other operations.
No server-side coded needed to serve up an image, just plain old IIS/Apache.
You can scale up faster with dirt-cheap web servers, or potentially to a CDN.
You can perform related work (generating thumbnails, etc.) without involving the database.
Your database server can keep more of the "real" table data in memory, which is where you get your database speed for queries. If it uses its precious memory to keep image files cached, that doesn't buy you hardly anything speed-wise versus having more of the photo index in memory.
Most large sites use the filesystem.
See Store pictures as files or in the database for a web app?
When dealing with binary objects, follow a document centric approach for architecture, and not store documents like pdf's and images in the database, you will eventually have to refactor it out when you start seeing all kinds of performance issues with your database. Just store the file on the file system and have the path inside a table of your databse. There is also a physical limitation on the size of the data type that you will use to serialize and save it in the database. Just store it on the file system and access it.
Your first sentence says that you've read some posts on the subject, so I won't bother putting in links to articles that cover this. In my experience, and based on what you've posted as far as the number of images and sizes of the images, you're going to pay dearly in DB performance if you store them in the DB. I'd store them on the file system.
What database are you using? MS SQL Server 2008 provides FILESTREAM storage
allows storage of and efficient access to BLOB data using a combination of SQL Server 2008 and the NTFS file system. It covers choices for BLOB storage, configuring Windows and SQL Server for using FILESTREAM data, considerations for combining FILESTREAM with other features, and implementation details such as partitioning and performance.
details
We use FileNet, a server optimized for imaging. It's very expensive. A cheaper solution is to use a file server.
Please don't consider storing large files on a database server.
As others have mentioned, store references to the large files in the database.

What storage location, SQL Server or file system, would result in better performance in saving tiff images?

Our system needs to store tiff images of ~3k in size. We receive ~300 images at a given time and need to quickly process them. Once ~100,000 images are received, the images are transferred off our system to another archival system or purged.
I am looking for best performance in regards to the initial save of the image files. The task of transferring the images for archival is less performance critical.
What storage location, SQL Server or file system, would result in better performance in saving tiff images?
Are there any other considerations or gotchas to be aware of?
Storing the images in the filesystem will give you better performance. You just need to put an entry into a relevant database table for the tiff image attachments - and use that to get the path of the image on the filesystem.
You might want to further boost performance by hosting the images on a web server - IIS (if relevant) and have your client applications (again if relevant) retrieve them directly frmo there instead.
In my experience SQL Server has been decent with storing blobs into the database. As long as I follow Best Practices related to queries, normalization, etc. I have found them to work well.
For some reason, I personally do not want to store huge PDF and DOC and JPG files in my database, but then, that is exactly what Microsoft SharePoint does, and does well.
I'd definitely consider putting blobs in my db.
The SQL Server 2008 version has a new feature called FILESTREAM. Part of their documentation also has a section on best practices, in which the MS folks state that FILESTREAM should come into play if the BLOB objects are typically larger 1 MB.
That MSDN page states:
When to Use FILESTREAM If the
following conditions are true, you
should consider using FILESTREAM:
- Objects that are being stored are, on average, larger than 1 MB. For
smaller objects, storing
varbinary(max) BLOBs in the database
often provides better streaming
performance.
So I guess with a 3 KB TIFF, you could store that nicely inside a VARBINARY(MAX) field in your SQL Server 2005 table. Since it's even smaller than the 8k page size for SQL Server, that'll fit nicely!
You might also want to consider putting your BLOBs into their own table and reference your "base" data row from there. That way, if you only need to query the base data (your ints, varchars etc.), your query won't be bogged down by BLOBs being stored intermingled with other stuff.
Marc
The satellite catalog system at INPE/Brazil stores a reference of tiff images stored in filesystem. But the images are a little bigger - +/- 100 MB. If the file must be displayed at browser, the php code reads the tiff content at disk and draw it.

BLOB Storage - 100+ GB, MySQL, SQLite, or PostgreSQL + Python

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.
Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.
I'm still researching this option for one of my own projects, but CouchDB may be worth a look.
Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.
Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.
My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.
I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.
To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.
Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.
The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:
Read the entire file into memory
Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
Executing the generated query (which itself means - for the database - that it has to parse it)
I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.
why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

Resources