File system query - filesystems

Is there an easy way to query data in file system?
We are storing data in File system (instead of database)
Is there a way to query the content of the file system?
The data in the file system is stored in xml format.
since the data is growing day by day we are finding it difficult to query the content of the files in the file system.
Can anyone suggest what could be the tool/method to query the data in the existing file system?

Of course you can write an application to provide a query interface, parse the query and then go and search through the file system.
But seeing that a database is perfect for this kind of thing, why don't you migrate to a database. You can use/write a FileWatcher to load the data into a database as it's written to the file system (so you don't have to change the system that's creating the files), and make your job a whole lot easier.
?

Related

What's is the faster way to extract 1 terabyte of data from tables in SQL Server to Parquet files without hadoop

I need to extract 2 tables from a SQL Server Database to files in Apache Parquet (I don't use Hadoop, only parquet files). The options I know of are:
Load data to a dataframe in Pandas and save to parquet file. But, this method don't stream the data from SQL Server to Parquet, and i have 6 GB of RAM memory only.
Use TurboODBC to query SQL Server, convert the data to Apache Arrow on the fly and then convert to Parquet. Same problem that above, TurboODBC doesn't stream currently.
Does a tool or library exist that can easily and "quickly" extract the 1 TB of data from tables in SQL Server to parquet files?
The missing functionality you are looking for is the retrieval of the result in batches with Apache Arrow in Turbodbc instead of the whole Table at once: https://github.com/blue-yonder/turbodbc/issues/133 You can either help with the implementation of this feature or use fetchnumpybatches to retrieve the result in a chunked fashion meanwhile.
In general, I would recommend you to not export the data as one big Parquet file but as many smaller ones, this will make working with them much easier. Mostly all engines/artifacts that can consume Parquet will be able to handle multiple files as one big dataset. You can then also split your query into multiple ones that write out the Parquet files in parallel. If you limit the export to chunks that are smaller than your total main memory, you should also be able to use fetchallarrow to write to Parquet at once.
I think the odbc2parquet command line utility might be what you are looking for.
Utilizes odbc bulk queries to retrieve data from SQL Server fast (like turbodbc).
It only keeps one batch in memory at a time, so you can write parquet
files which are larger than your system memory.
Allows you to split the result into multiple files if desired.
Full disclosure, I am the author, so I might be biased towards the tool.

Automated file import with SSIS package

I am very new to SSIS and its capabilities. I am busy building a new project that will upload files to a database. The problem I am facing is that files and tables differentiate from one another.
So what was done is I created a table that will map each file's columns to the specific table's column the data needs to be stored in, in a separate table. I want the user to manage this part when they receive a new file or the file layout changes some how.
As far as I know about SSIS is that you can map each file to a table and it can be scheduled as task.
My question is will SSIS be able to handle this or should I handle this process in code?
Many thanks in advance
I would say it all depends on the amount of data that would be imported into your SQL server, for large data sets (Normally 10000+ Rows) it becomes a necessity to utilize the SSIS as you would receive performance gains in your application. Here is a simple example of creating a SSIS package using code. For smaller data operations I would suggest using a combination of this and this. Or to Create a dynamic table on your SQL server based on the file format, look at this
SSIS can be very picky about file formats, so if the files are completely different, then it probably isnt the tool for the job. For flat files, SSIS requires the ordering of columns to be the same.
If you know that your files will only ever arrive in one of 5 formats (for example), it wouldn't be much trouble to write 5 packages to import them. If any new file could have a totally different schema, I dont think SSIS would be the right tool for the job.

Concatenate VARBINARY Data in MSSQL

We have an application that stores Word and PDF documents in a share on a server. I'm looking into the possibility of storing these as BLOBs in the associated Microsoft SQL database instead, which seems like it's probably a good idea.
Separately, an idea which I'm investigating is the possibility of allowing users to easily view all of the documents in the share associated with a case (let's imagine they're grouped into folders by case) as one continuous stream on a tablet, as if they were all one big PDF file.
I think I've worked out how to do the latter, running a web service to convert the Word documents to PDFs and then concatenate them and the extant PDFs. But that's if we continue to store the documents as files in an NTFS share. What if we stored the documents as BLOBs in MSSQL instead?
Is there a way to concatenate BLOB data so that for every, say, 10 BLOB records (which might represent Word or PDF files), I could create an 11th record which was a concatenation of the other 10 as one giant PDF?
SQL Server Blobs are not an effective way of storing files. SQL 2008 bought about a better mechanism for this called FILESTREAM ( http://technet.microsoft.com/en-us/library/gg471497.aspx) which can store the files directly on the file system but managed by SQL.
As for the files you would not be able to simply concatenate the PDF files to form one continuous file but there are several libraries that you could use to do this, potentially on the fly. This would remove the need to store the concatenated document as well.

Export images from SQL Server

I'm building a huge inventory and sales management program on top of Dynamics CRM 2011. I've got a lot done but I'm kinda stuck on one part:
Images are stored in the database encoded as base64 with a MimeType column. I'm wondering how I might extract those images programmatically on a schedule to be sent as part of a data transfer to synchronize another DB.
I have a SQL Server Agent job that exports a view I created. I'm thinking about writing a program that will take that resultant CSV and use it to get a list of products we need to pull images for, and then it queries the DB and saves the files as say productserial-picnum.ext
Is that the best way to do that? Is there an easier way to pull the images out of the DB and into files?
I'm hoping it will be able to only export images that have changed since say a Last Modified column or something.
I don't know C# at all, VB, PHP and JavaScript enough to do some damage though..
you should be able to achieve this in TSQL itself
OPEN cursor with qualifying records (where now>lastmodified etc)
For Each Record
Select Binary Data into "#BinaryData
Convert "#BinaryData to #VarcharData (Something like below will work)
SET #VarcharData = CAST(N'' AS XML).value('xs:base64Binary(xs:hexBinary(sql:variable("#BinaryData")))', 'VARCHAR(MAX)')
Write #VarcharData to file (on server or a network drive if the agent is configured to write out)
Close File
Next Record

Save Access Report as PDF/Binary

I am using Access 2007 (VBA - adp) front end with a SQL Server 2005 Backend. I have a report that I want to save a copy as a PDF as a binary file in the SQL Server.
Report Opened
Report Closed - Closed Event Triggered
Report Saved as PDF and uploaded into SQL Server table as Binary File
Is this possible and how would I achieve this?
There are different opinions if it's a good idea to store binary files in database tables or not. Some say it's ok, some prefer to save the files in the file system and only store the location of the file in the DB.
I'm one of those who say it's ok - we have a >440 GB SQL Server 2005 database in which we store PDF files and images. It runs perfectly well and we don't have any problems with it (for example with speed...that's usually one main argument of the "file system" people).
If you don't know how to save the files in the database, google "GetChunk" and "AppendChunk" and you will find examples like this one.
Concerning database design:
It's best if you make two tables: one only with an ID and the blob field (where the PDF files are stored in) and one with the ID and additional fields for filtering.
If you do it this way, all the searching and filtering will happen on the small table, and only when you know the ID of the file you want to load, you hit the big table exactly one time and load the file.
We do it like this and like I said before - the database contains nearly 450 GB of files, and we have no speed problems at all.
The easiest way to do this is to save the report out to disk as a PDF (if you don't know how to do that, I recommend this thread on the MSDN forums). After that, you'll need to use ADO to import the file using OLE embedding into a binary type of field. I'm rusty on that, so I can't give specifics, but Google searching has been iffy so far.
I'd recommend against storing PDF files in Access databases -- Jet has a strict limit to database size, and PDFs can fill up that limit if you're not careful. A better bet is to use OLE linking to the file, and retrieving it from disk each time the user asks for it.
The last bit of advice is to use an ObjectFrame to show the PDF on disk, which MSDN covers very well here.

Resources