Preprocessing large data in databricks community edition - dataset

I have 16 GB dataset and want to use it in databricks. However, in community edition DBFS limit is 10 GB.
May you please assist me to preprocess the data to be able to move it from driver to DBFS.

The simplest way for that is not to use DBFS (it's designed only for temporary data), but host data & results in your own environment, like, AWS S3 bucket or ADLS (could be a higher transfer costs).
If you can't use it, then solution depends on other factors - what is the input file format, like, is it compressed/uncompressed, etc.

Related

Which is the most secure and fast way of transferring large amount of data (SQL Server backup files) to Azure Blob or Azure VM hosting SQL

I am considering options apart from Azure Import Export.
The data is more than 2Tb. but divided into multiple files. I would have 8-10 files ranging from 30Gb to 500Gb And we can use certificates for any web service that can be utilized for this transfer.
AzCopy is usually recommended for transfer of data upto 100GB or less.
Azure Data Factory, Data management Gateway by creating a pipeline and gateway and transferring data.
Please do suggest if there is any other way to do this transfer.
Based on my experience, Azcopy will provide you the best performance and is recommended even when data size is greater than 20 TB. Read this following KB article for more information.
One thing you should determine is which datacenter provides the lower BLOB storage latency based on your location. Use this URL for that purpose. The lower the values you get on that URL, the better.

SQL server scalability question

We are trying to build an application which will have to store billions of records. 1 trillion+
a single record will contain text data and meta data about the text document.
pl help me understand about the storage limitations. can a databse SQL or oracle support this much data or i have to look for some other filesystem based solution ? What are my options ?
Since the central server has to handle incoming load from many clients, how will parallel insertions and search scale ? how to distribute data over multiple databases or tables ? I am little green to database specifics for such scaled environment.
initally to fill the database the insert load will be high, later as the database grows, search load will increase and inserts will reduce.
the total size of data will cross 1000 TB.
thanks.
1 trillion+
a single record will contain text data
and meta data about the text document.
pl help me understand about the
storage limitations
I hope you have a BIG budget for hardware. This is big as in "millions".
A trillion documents, at 1024 bytes total storage per document (VERY unlikely to be realistic when you say text) is a size of about 950 terabyte of data. Storage limitations means you talk high end SAN here. Using a non-redundant setup of 2tb discs that is 450 discs. Make the maths. Adding redundancy / raid to that and you talk major hardware invesment. An this assumes only 1kb per document. If you have on average 16kg data usage, this is... 7200 2tb discs.
THat is a hardware problem to start with. SQL Server does not scale so high, and you can not do that in a single system anyway. The normal approach for a docuemnt store like this would be a clustered storage system (clustered or somehow distributed file system) plus a central database for the keywords / tagging. Depending on load / inserts possibly with replciations of hte database for distributed search.
Whatever it is going to be, the storage / backup requiments are terrific. Lagre project here, large budget.
IO load is gong to be another issue - hardware wise. You will need a large machine and get a TON of IO bandwidth into it. I have seen 8gb links overloaded on a SQL Server (fed by a HP eva with 190 discs) and I can imagine you will run something similar. You will want hardware with as much ram as technically possible, regardless of the price - unless you store the blobs outside.
SQL row compression may come in VERY handy. Full text search will be a problem.
the total size of data will cross 1000
TB.
No. Seriously. It will be a bigger, I think. 1000tb would assume the documents are small - like the XML form of a travel ticket.
According to the MSDN page on SQL Server limitations, it can accommodate 524,272 terabytes in a single database - although it can only accommodate 16TB per file, so for 1000TB, you'd be looking to implement partitioning. If the files themselves are large, and just going to be treated as blobs of binary, you might also want to look at FILESTREAM, which does actually keep the files on the file system, but maintains SQL Server notions such as Transactions, Backup, etc.
All of the above is for SQL Server. Other products (such as Oracle) should offer similar facilities, but I couldn't list them.
In the SQL Server space you may want to take a look at SQL Server Parallel Data Warehouse, which is designed for 100s TB / Petabyte applications. Teradata, Oracle Exadata, Greenplum, etc also ought to be on your list. In any case you will be needing some expert help to choose and design the solution so you should ask that person the question you are asking here.
When it comes to database its quite tricky and there can be multiple components involved to get performance like Redis Cache, Sharding, Read replicas etc.
Bellow post describes simplified DB scalability.
http://www.cloudometry.in/2015/09/relational-database-scalability-options.html

Best strategy for storing documents in SQL Server 2008

One of our teams is going to be developing an application to store records in a SQL2008 database and each of these records will have an associated PDF file. There is currently about 340GB of files, with most (70%) being about 100K, but some are several Megabytes in size. Data is mostly inserted and read, but the files are updated on occasion. We are debating between the following options:
Store the files as BLOBs in the database.
Store the files outside the database and store the paths in the database.
Use SQL2008's Filestream feature to store the files.
We have read the Micrsoft best practices regarding filestream data, but since the files vary in size, we are not sure which path to choose. We are leaning toward option 3 (filestream), but have some questions:
Which architecture would you choose given the amount of data and file sizes noted above?
Data access will be done using SQL authentication, not Windows authentication, and the web server will likely not be able to access the files using Windows API. Would this make filstream perform worse than the other two options?
Since the SQL backups include the filestream data, this would lead to very large database backups. How do others handle backing up databases with a large amount of filestream data?
OK, here we go. Option 2 is a really bad idea - you end up with untestable integrity constraints and backups that are not guaranteed to be consistent per definition because you can not take point in time backups. Not a problem in MOST scenarios, it turns into one the moment you have a more complicated (point in time) recovery.
Options 1 and 3 are pretty equal, albeit with some implications.
Filestream can use a lot more disc space. Basically, every version has a guid, if you make updates the old files stay around until the next backup.
OTOH the files do not count as db size (express edition - not against the 10gb limit should you use it) and access is further down possible using a file share. This is added flexibility.
In database has the most limited options regarding access (no way for the web server to just open the file after getting the path from the sql - it has to funnel the complete file through the sql protocol layer) but has advantages in regards of having less files (numbers). Putting the blobs into a separate table and that one a separate set of spindles may be strategically a good idea.
Regarding your questions:
1: I would go with in database storage. Try out both - filestream and not. As you use the same API anyway, this is a simple change in the table definition.
2: Yes, worse than direct file access, but it would be more protected than direct file access. Otherwise I do not think filestream and blob make a significant difference.
3: where do you have a huge backup here? Sorry to ask, but your 340gb is not exactly a large database. And you need to back it up ANYWAY. Better do it in one consistent state, which is what you achieve with db storage. Plus integrity (no one accidentally deleting unused documents without cleaning up the database). The DB is not significantly larger than doing that split, and it is a simple one place backup.
At the end, the question is db integrity and ease of backing things up. Win for SQL Server unless you get large - and this means 360 terabyte of data.
Store the files outside the database and store the paths in the database.
because it takes too much space to store files in the database.
I would definitely recommend (3) - this is the sort of scenario that this feature is specifically built to handle, and it is handled very well in my opinion.
This white paper has lots of useful information - http://msdn.microsoft.com/en-us/library/cc949109(SQL.100).aspx - and from a security point of view mentions that...
There are two security requirements for using the FILESTREAM feature. Firstly, SQL Server must be configured for integrated security. Secondly, if remote access will be used, then the SMB port (445) must be enabled through any firewall systems.
With regard to Backups, see the accepted answer to this question - SQL Server FILESTREAM limitation
I've used a Index/Content method that you haven't listed but it might help. You have a table of files that are stored as a blob of binary code with a unique id or row number. The next SQL table will provide the index, the name of the file, the path to it, keywords, file type, file size, check sum... what ever you need. This is the best I have have seen to store files for working with thousands of uploaded documents. The index is required to view the file as it would just be binary text to the user if they have no idea what the file type is. We store the data in 2 separate databases to allow the index on one server and the file store on multiple servers for easy expansion. At that point the index table/database contains the name or key to the server the file is on. If the user has access to read that particular index table, then they have access to the file.
This scenario is easy: the FILESTREAM recomendation said that is best when the files are (on average) larger than 1MB, wich is not your case, for smaller objects, storing varbinary(max) BLOBs in the database often provides better streaming performance.
Since you will be accesing the files directly from SQL Server and not from filesystem then you should store it using BLOBs.
Read When to Use FILESTREAM: http://technet.microsoft.com/en-us/library/bb933993%28v=sql.105%29.aspx
Have you looked at RBS (Remote Blob Storage) solution? If you use the Filestream RBS provider, it will internally keep your blobs as Filestream files or varbinary(max) values, depending on what gets better performances based on the blob size.
Remote BLOB Store Provider Library Implementation Specification
SQL Remote Blob Storage Team Blog

Database that consumes less disk space

I'm looking at solutions to store a massive quantity of information consuming the less possible disk space.
The information structure is very simple and the queries will also be very simple.
I've looked at solutions like Apache Cassandra and relations databases but couldn't find a comparison where disk usage is mentioned.
Any ideas on this would be great.
Speaking about Apache Cassandra - it's just a disk space hog. 200 MB of logs resulted in 1.2 GB files produced by Cassandra - and the keyspace was just 4 columns with 200 length strings.
Take a look at Oracle Berkeley DB - very simple robust database (key/value):
"Berkeley DB enables the development of custom data management solutions, without the overhead traditionally associated with such custom projects. Berkeley DB provides a collection of well-proven building-block technologies that can be configured to address any application need from the handheld device to the datacenter, from a local storage solution to a world-wide distributed one, from kilobytes to petabytes."
Redis might worth a check if you can store your data in key-value
Newest version of Microsoft's SQL Server (2008) supports several levels of compression (row compression and page compression, in addition to backup compression). Might be worth investigating.
Some relevant resources:
Linchi Shea shows that compression can sometimes improve performance
Official MS Best Pracices doc for SQL 2008 compression

Using SQL Server as Image store

Is SQL Server 2008 a good option to use as an image store for an e-commerce website? It would be used to store product images of various sizes and angles. A web server would output those images, reading the table by a clustered ID. The total image size would be around 10 GB, but will need to scale. I see a lot of benefits over using the file system, but I am worried that SQL server, not having an O(1) lookup, is not the best solution, given that the site has a lot of traffic. Would that even be a bottle-neck? What are some thoughts, or perhaps other options?
10 Gb is not quite a huge amount of data, so you can probably use the database to store it and have no big issues, but of course it's best performance wise to use the filesystem, and safety-management wise it's better to use the DB (backups and consistency).
Happily, Sql Server 2008 allows you to have your cake and eat it too, with:
The FILESTREAM Attribute
In SQL Server 2008, you can apply the FILESTREAM attribute to a varbinary column, and SQL Server then stores the data for that column on the local NTFS file system. Storing the data on the file system brings two key benefits:
Performance matches the streaming performance of the file system.
BLOB size is limited only by the file system volume size.
However, the column can be managed just like any other BLOB column in SQL Server, so administrators can use the manageability and security capabilities of SQL Server to integrate BLOB data management with the rest of the data in the relational database—without needing to manage the file system data separately.
Defining the data as a FILESTREAM column in SQL Server also ensures data-level consistency between the relational data in the database and the unstructured data that is physically stored on the file system. A FILESTREAM column behaves exactly the same as a BLOB column, which means full integration of maintenance operations such as backup and restore, complete integration with the SQL Server security model, and full-transaction support.
Application developers can work with FILESTREAM data through one of two programming models; they can use Transact-SQL to access and manipulate the data just like standard BLOB columns, or they can use the Win32 streaming APIs with Transact-SQL transactional semantics to ensure consistency, which means that they can use standard Win32 read/write calls to FILESTREAM BLOBs as they would if interacting with files on the file system.
In SQL Server 2008, FILESTREAM columns can only store data on local disk volumes, and some features such as transparent encryption and table-valued parameters are not supported for FILESTREAM columns. Additionally, you cannot use tables that contain FILESTREAM columns in database snapshots or database mirroring sessions, although log shipping is supported.
Check out this white paper from MS Research (http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2006-45)
They detail exactly what you're looking for. The short version is that any file size over 1 MB starts to degrade performance compared to saving the data on the file system.
I doubt that O(log n) for lookups would be a problem. You say you have 10GB of images. Assuming an average image size of say 50KB, that's 200,000 images. Doing an indexed lookup in a table for 200K rows is not a problem. It would be small compared to the time needed to actually read the image from disk and transfer it through your app and to the client.
It's still worth considering the usual pros and cons of storing images in a database versus storing paths in the database to files on the filesystem. For example:
Images in the database obey transaction isolation, automatically delete when the row is deleted, etc.
Database with 10GB of images is of course larger than a database storing only pathnames to image files. Backup speed and other factors are relevant.
You need to set MIME headers on the response when you serve an image from a database, through an application.
The images on a filesystem are more easily cached by the web server (e.g. Apache mod_mmap), or could be served by leaner web server like lighttpd. This is actually a pretty big benefit.
For something like an e-commerce web site, I would be moe likely to go with storing the image in a blob store on the database. While you don't want to engage in premature optimization, just the benefit of having my images be easily organized alongside my data, as well as very portable, is one automatic benefit for something like ecommerce.
If the images are indexed then lookup won't be a big problem. I'm not sure but I don't think the lookup for file system is O(1), more like O(n) (I don't think the files are indexed by the file system).
What worries me in this setup is the size of the database, but if managed correctly that won't be a big problem, and a big advantage is that you have only one thing to backup (the database) and not worry about files on disk.
Normally a good solution is to store the images themselves on the filesystem, and the metadata (file name, dimensions, last updated time, anything else you need) in the database.
Having said that, there's no "correct" solution to this.

Resources