I'm trying to implement a file system using zookeeper with a mongo backend to store the data.
I'm trying to wrap my head around what all should be stored in zookeeper. ACL's ,file metadata, database locations?
Related
I am looking for best way to move 2 TB of data from on-premises to snowflake. Data is in zipped files of size ~150 MB each and similar files will be generated on going basis. As we don't have cloud account (only have snowflake account) so cannot use cloud native storage like S3 or Azure BLOB. Also we want to use public internet to establish connectivity from on-premises network to Snowflake DB on the cloud. (no VPN or direct connect available or 3rd party tool is to be used)
How can we best ensure that data while in-transit from on premises to snowflake DB on the cloud is secure.
And without using S3 or Azure BLOB storage the data is loaded into snowflake.
So you do not have external CLoud Storage Accounts to Store these files into ; i could see one option and that is with regard to use SnowSQL to upload files into Snowflake Storage, the internal stage location using PUT command , have look at PUT Command of SQL at following URL
https://docs.snowflake.com/en/sql-reference/sql/put.html
It can upload the file to Snowflake's internal Stage as well as User & Table internal Stage.
There is optional parameter PARALLEL which Specifies the number of threads to use for uploading files, Increasing the number of threads can improve performance when uploading large files. Larger files are automatically split into chunks, staged concurrently, and reassembled in the target stage. A single thread can upload multiple chunks.
Uploaded files are automatically encrypted with 128-bit or 256-bit keys. The CLIENT_ENCRYPTION_KEY_SIZE account parameter specifies the size key used to encrypt the files.
Given the 2TBs of File upload , you should experiment with multiple files of small sizes.
You can use any of the Snowflake connectors to move data directly from your on-premises servers to Snowflake. https://docs.snowflake.com/en/user-guide/conns-drivers.html
You can also start simply with the command-line interface snowsql using put commands. https://docs.snowflake.com/en/user-guide/snowsql.html
All traffic to/from Snowflake is encrypted in-transit with SSL. https://resources.snowflake.com/snowflake/automatic-encryption-of-data
I can create / restore Solr backups from Solr via CollectionAdminRequest.Backup and CollectionAdminRequest.Restore.
Looks like it's possible via http api, e.g.:
http://localhost:8983/solr/gettingstarted/replication?command=details&wt=xml
But is it possible to list all backups and drop one by name from SolrJ?
I'm using Solr 7.5.
From what I found, it's not possible to do it from SolrJ directly.
So I've ended up working with HDFS directly. I've configured Solr to use HDFS as backup storage. And from my code I'm accessing it via HDFS client - I can list and remove backups from it.
I am running solr on 5 different instance. Making change in schema/dataconfig file is a big task as I need to make changes on each server.
Can I load schema file from a server? So that same path can be defined in each solrconfig and changes will be reflected on each solr instance.
If you're running Solr on multiple instances, you should really consider moving to a cluster based installation (i.e. SolrCloud). This will give you a common schema across servers and allow you to easily create collections and make changes across all nodes in your network at the same time.
You can use a shared file system, but it'll still require you to access each server (which you can do through the core API if you want to automate it) to reload the core to make it pick up any changes to the schema.
I'm looking for a tutorial or something that allow me to learn Presto step by step.
The idea is to start integrating file's and MSSQL, which is my knowledge area.
Unfortunately, since it is a relatively new area, I didn't find anything more than Facebook page or the Presto.io page, however it is not good enough for someone that want to start knowing the big data world from scratch.
I will appreciate your help and/or orientation in this area.
Presto has 2 primary use cases:
querying data stored in a cluster (on Hadoop's HDFS) or in a cloud (e.g. Amazon S3)
data federation, i.e. querying (and joining) data from multiple data sources (e.g. HDFS, S3, traditional RDBMS like PostgreSQL or SQL Server)
As far as SQL Server support is concerned -- Presto supports connecting to SQL Server since https://github.com/prestosql/presto/commit/072440cbb2c8df2a689c4c903dd325013eae41a0.
When it comes to querying files -- Presto uses Hive's Metastore to keep track of metadata (everything besides actually reading the data). Thus the files must reside on HDFS or S3 to be accessible (other cloud data stores like Azure's Blob are, AFAIK, not supported yet).
I have a standard WinForms application that connects to a SQL Server. The application allows users to upload documents which are currently stored in the database, in a table using an image column.
I need to change this approach so the documents are stored as files and a link to the file is stored in the database table.
Using the current approach - when the user uploads a document they are shielded from how this is stored, as they have a connection to the database they do not need to know anything about where the files are stored, no special directory permissions etc are required. If I set up a network share for the documents I want to avoid any IT issues such as the users having to have access to this directory to upload to or access existing documents.
What are the options available to do this? I thought of having a temporary database where the documents are uploaded to in the same way as the current approach and then a process running on the server to save these to the file store. This database could then be deleted and recreated to reclaim any space. Are there any better approaches?
ADDITIONAL INFO: There is no web server element to my application so I do not think a WCF service is possible
Is there a reason why you want to get the files out of the database in the first place?
How about still saving them in SQL Server, but using a FILESTREAM column instead of IMAGE?
Quote from the link:
FILESTREAM enables SQL Server-based applications to store unstructured
data, such as documents and images, on the file system. Applications
can leverage the rich streaming APIs and performance of the file
system and at the same time maintain transactional consistency between
the unstructured data and corresponding structured data.
FILESTREAM integrates the SQL Server Database Engine with an NTFS file
system by storing varbinary(max) binary large object (BLOB) data as
files on the file system. Transact-SQL statements can insert, update,
query, search, and back up FILESTREAM data. Win32 file system
interfaces provide streaming access to the data.
FILESTREAM uses the NT system cache for caching file data. This helps
reduce any effect that FILESTREAM data might have on Database Engine
performance. The SQL Server buffer pool is not used; therefore, this
memory is available for query processing.
So you would get the best out of both worlds:
The files would be stored as files on the hard disk (probabl faster compared to storing them in the database), but you don't have to care about file shares, permissions etc.
Note that you need at least SQL Server 2008 to use FILESTREAM.
I can tell you how I implemented this task. I wrote a WCF service which is used to send archived files. So, if I were you, I would create such a service which should be able to save files and send them back. This is easy and you also must be sure that the user under which context the WCF service works has permission to read write files.
You could just have your application pass the object to a procedure (CLR maybe) in the database which then writes the data out to the location of your choosing without storing the file contents. That way you still have a layer of abstraction between the file store and the application but you don't need to have a process which cleans up after you.
Alternatively a WCF/web service could be created which the application connects to. A web method could be used to accept the file contents and write them to the correct place, it could return the path to the file or some file identifier.