My boss wants me to send a large SQL Server query result. The result is around 60 Million lines. Is there any advice on how to send this? File format? Compression? Thanks for any help!
What is the row size?
Typically when I need to export large amounts of data from SQL, I use BoostSSMS scripting templates. You could also use SSIS to generate the file, especially if this is an ETL process (since you can debug and step through the process). I find SSMS 'results to file' to be totally inadequate compared to BoostSSMS in terms of speed. My preferred format is tab delimited and single quote text identifiers.
If you're sending the data somewhere, it's probably best to compress it (which will drastically reduce the file size). Which method to use is going to depend on your target and their requirements but I've never had any problems with plain old ZIP compression.
Edit - your boss says he just wants to view it. Ask him what exactly he's looking for (whether that's some specific combination of values, or doing analysis on the data). The reason that's important is because I can guarantee he isn't looking at this data line-by-line for 60 million rows. If he's looking for something specific, filter your query down using those conditions. If he's doing analysis on it, do the aggregation in your SQL and give him the results.
Related
I am working with an organisation that is using Excel as their data storage center right now, their data might scale up in the near future, so I was wondering when the data size is too large for Excel to handle efficiently. The data stored is just strings and integers of limited lengths (below 50 char and integer values between 1 and 100000).
Is there some kind of general guideline to how big an Excel sheet can be before it becomes inefficient and you should use a database like SQL Server instead?
I have usually found that once the Excel file size crosses 30MB it becomes very slow and unresponsive. In order to manipulate higher data size i started using a add-on called Powerpivot which is very very fast ! See instructions on how to enable it here - https://support.office.com/en-us/article/Start-the-Power-Pivot-in-Microsoft-Excel-add-in-a891a66d-36e3-43fc-81e8-fc4798f39ea8
Efficiency is subjective and depends on your goals. The thing is that an RDBMS like SQL Server will always be more efficient, because you will get optimized searches instead of reading a textual file. Also, relations between tables are handled well and if you design your DB scheme well, then you will have benefits like handling concurrent access, avoiding redundancy and inconsistency. It is worth the effort on the long run.
Currently I am storing JSON in my database as VARCHAR(max) that contains some transformations. One of our techs is asking to store the original JSON it was transformed from.
I'm afraid that if I add another JSON column it is going to bloat the page size and lead to slower access times. On the other hand this table is not going to be real big (about 100 rows max with each JSON column taking 4-6 K bytes) and could get accessed as much as 4 or 5 times a minute.
Am I being a stingy gatekeeper mercilessly abusing our techs or a sagacious architect keeping the system scalable?
Also, I'm curious about the (relatively) new filestream/BLOBs type. From what I've read I get the feeling that BLOBs are stored in some separate place such that relational queries aren't slowed down at all. Would switching varchar to filestream help?
Generally BLOB is preferred for Objects that are being stored are, on average, larger than 1 MB.
I think you should be good with keeping them in same database. 100 rows are not much for a database.
Also, what is the usecase of keeping the original as well as transformed JSON. If original JSON is not going to be used as part of normal processing and is just needed to keep for references, I would suggest keep a separate table and dump original JSON there with a reference key and use original only when needed.
Your use case doesn't sound to have too much demand. 4-6KB and less than 100 or even 1000 rows for that matter is still pretty light. Though I know expected use case almost never ends up being actual use case. If people use the table for things other than the JSON field you might not want them pulling back the JSON because of the potential size and unnecessary bloat.
Good thing SQL has some other lesser complex options to help us out. https://msdn.microsoft.com/en-us/library/ms173530(v=sql.100).aspx
I would suggest looking at the table option of Large Value Types out of Row as it is future compatible and the text in row option is deprecated. Essentially these options store those large text fields off of the primary page, allowing the correct data to live where it needs to live and the extra STUFF to have a different home.
I'm importing a huge amount of data using BCP, and I was curious about what options it has to improve the speed of this operation.
I read about parallel loading in a couple of places, and also saw that you can tell it to not bother to sort the data or check constraints (which are all viable options for me since the source is another database with good integrity).
I haven't seen examples of these options being used though (as in, I don't know what command line switch enables parallel loading or contraint checking disabling).
Does anyone know a good resource for learning this, or can someone give me a couple trivial examples? And please don't point me to the BCP parameters help page, I couldn't make heads or tails of it with regard to these specific options.
Any help is greatly appreciated!
You need to read The Data Loading Performance Guide. There is no magical command line switch 'load faster', is a very complicated balance of doing the right thing in the right context. It depends on whether you load a heap or a B-Tree, whether there is already data or the table is empty, whether it has secondary indexes, whether minimally logging is possible in the database recovery model, whether the table is partitioned or not, whether the data is pre-sorted or not and that is just the surface. The linked white paper has all the details.
It looks like the parallel loading you are talking about is just running multiple instances of the BCP utility against the same table. You would be responsible for partitioning the data before hand. You use it by specifying the TABLOCK table hint. From MSDN:
Bulk update (BU) locks allow processes to bulk copy data concurrently into the same table while preventing other processes that are not bulk copying data from accessing the table.
So it's really just a special lock for BCP.
To further increase performance, you may read further into the -a flag on the BCP parameters page.
-a allows to to specify a larger packet size (between 4096 and 65535) to increase the amount of data sent at one time to the server per network packet.
I would also suggest using the -e flag with an error if you intend to run multiple BCP processes to help keep track of any errors encountered.
I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.
The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.
As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.
Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.
There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.
I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:
Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).
This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.
You should be able to reduce your storage space and should not need twice as much storage space, as you state.
Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.
One of these nosql dbs might work. I highly doubt any are configurable to sit on top of flat, delimited files. You might look at one of the open source projects and write your own database layer.
Scalability begins at a point beyond tab-separated ASCII.
Just be practical - don't academicise it - convention frees your fingers as well as your mind.
I would upvote Jason's recommendation if I had the reputation. My only add is that if you do not store it in a different format like the database Jason was suggesting you pay the parsing cost on every operation instead of just once when you initially process it.
You can do this with LINQ to Objects if you are in a .NET environment. Streaming/deferred execution, functional programming model and all of the SQL operators. The joins will work in a streaming model, but one table gets pulled in so you have to have a large table joined to a smaller table situation.
The ease of shaping the data and the ability to write your own expressions would really shine in a scientific application.
LINQ against a delimited text file is a common demonstration of LINQ. You need to provide the ability to feed LINQ a tabular model. Google LINQ for text files for some examples (e.g., see http://www.codeproject.com/KB/linq/Linq2CSV.aspx, http://www.thereforesystems.com/tutorial-reading-a-text-file-using-linq/, etc.).
Expect a learning curve, but it's a good solution for your problem. One of the best treatments on the subject is Jon Skeet's C# in depth. Pick up the "MEAP" version from Manning for early access of his latest edition.
I've done work like this before with large mailing lists that need to be cleansed, dedupped and appended. You are invariably IO bound. Try Solid State Drives, particularly Intel's "E" series which has very fast write performance, and RAID them as parallel as possible. We also used grids, but had to adjust the algorithms to do multi-pass approaches that would reduce the data.
Note I would agree with the other answers that stress loading into a database and indexing if the data is very regular. In that case, you're basically doing ETL which is a well understood problem in the warehouseing community. If the data is ad-hoc however, you have scientists that just drop their results in a directory, you have a need for "agile/just in time" transformations, and if most transformations are single pass select ... where ... join, then you're approaching it the right way.
You can do this with VelocityDB. It is is very fast at reading tab seperated data into C# objects and databases. The entire Wikipedia text is a 33GB xml file. This file takes 18 minutes to read in and persist as objects (1 per Wikipedia topic) and store in compact databases. Many samples are shown for how to read in tab seperated text files as part of the download.
The question's already been answered, and I agree with the bulk of the statements.
At our centre, we have a standard talk we give, "so you have 40TB of data", as scientists are newly finding themselves in this situation all the time now. The talk is nominally about visualization, but primarly about managing large amounts of data for those that are new to it. The basic points we try to get across:
Plan your I/O
Binary files
As much as possible, large files
File formats that can be read in parallel, subregions extracted
Avoid zillions of files
Especially avoid zillions of files in single directory
Data Management must scale:
Include metadata for provenance
Reduce need to re-do
Sensible data management
Hierarchy of data directories only if that will always work
Data bases, formats that allow metadata
Use scalable, automatable tools:
For large data sets, parallel tools - ParaView, VisIt, etc
Scriptable tools - gnuplot, python, R, ParaView/Visit...
Scripts provide reproducability!
We have a fair amount of stuff on large-scale I/O generally, as this is an increasingly common stumbling block for scientists.
I have a system which is receiving log files from different places through http (>10k producers, 10 logs per day, ~100 lines of text each).
I would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ...
My question is : what's the best way to store them ?
Flat text files (with proper locking), one file per uploaded file, one directory per day/producer
Flat text files, one (big) file per day for all producers (problem here will be indexing and locking)
Database Table with text (MySQL is preferred for internal reasons) (pb with DB purge as delete can be very long !)
Database Table with one record per line of text
Database with sharding (one table per day), allowing simple data purge. (this is partitioning. However the version of mysql I have access to (ie supported internally) does not support it)
Document based DB à la couchdb or mongodb (problem could be with indexing / maturity / speed of ingestion)
Any advice ?
(Disclaimer: I work on MongoDB.)
I think MongoDB is the best solution for logging. It is blazingly fast, as in, it can probably insert data faster than you can send it. You can do interesting queries on the data (e.g., ranges of dates or log levels) and index and field or combination of fields. It's also nice because you can randomly add more fields to logs ("oops, we want a stack trace field for some of these") and it won't cause problems (as it would with flat text files).
As far as stability goes, a lot of people are already using MongoDB in production (see http://www.mongodb.org/display/DOCS/Production+Deployments). We just have a few more features we want to add before we go to 1.0.
I'd pick the very first solution.
I don't see why would you need DB at all. Seems like all you need is to scan through the data. Keep the logs in the most "raw" state, then process it and then create a tarball for each day.
The only reason to aggregate would be to reduce the number of files. On some file systems, if you put more than N files in a directory, the performance decreases rapidly. Check your filesystem and if it's the case, organize a simple 2-level hierarchy, say, using the first 2 digits of producer ID as the first level directory name.
I would write one file per upload, and one directory/day as you first suggested. At the end of the day, run your processing over the files, and then tar.bz2 the directory.
The tarball will still be searchable, and will likely be quite small as logs can usually compress quite well.
For total data, you are talking about 1GB [corrected 10MB] a day uncompressed. This will likely compress to 100MB or less. I've seen 200x compression on my log files with bzip2. You could easily store the compressed data on a file system for years without any worries. For additional processing you can write scripts which can search the compressed tarball and generate more stats.
Since you would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... You're expecting 100,000 files a day, at a total of 10,000,000 lines:
I'd suggest:
Store all the files as regular textfiles using the following format : yyyymmdd/producerid/fileno.
At the end of the day, clear the database, and load all the textfiles for the day.
After loading the files, it would be easy to get the stats from the database, and post them in any format needed. (maybe even another "stats" database). You could also generate graphs.
To save space ,you could compress the daily folder. Since they're textfiles, they would compress well.
So you would only be using the database to be able to easily aggregate the data. You could also reproduce the reports for an older day if the process didn't work, by going through the same steps.
To my experience, single large table performs much faster then several linked tables if we talk about database solution. Particularly on write and delete operations. For example, splitting one table into three linked tables decreases performance 3-5 times. This is very rough, of course it depends on details, but generally this is the risk. It gets worse when data volumes get very large. Best way, IMO, to store log data is not in a flat text, but rather in a structured form, so that you can do efficient queries and formatting later. Managing log files could be pain, especially when there are lots of them and coming from many sources and locations. Check out our solution, IMO it can save you lots of development time.