NZ_Migrate -status seems incorrect - netezza

quick question,
One of the arguments for NZ_Migrate is -status which shows you the bytes moved, an average, and time elapsed.
However, I have seen this multiple times now and am witnessing it right now, I am moving a table from our production server to our dev server and the table size is 75GB used, 76GB allocated. I am watching NZ_Migrate scroll right now with status updates and it is still inserting into the table but it is showing a total of 242,580,717,568 moved so far, which is way over the table size. Am I missing something? Does it move something over besides just the table?

The data size reported for a table by "used" and "allocated" represent data that has been compressed.
There are two modes of transfer for nz_migrate (-format ascii|binary ) with ascii being the default.
In ascii mode the 75GB of data will be transferred uncompressed, so if we go with the Netezza rule of thumb for compression you might see 75*4 GB actually transferred.
If I run nz_migrate on a sample table (4,839 MB used) local to my system as in binary mode, -stats reports about what you'd expect ( 4,931,452,928 bytes ) right before it ends.
If I do this with ascii mode, status reports well over 11,612,192,768 bytes along the way).
As a side note, if your prod and dev server are the same size then you may greatly benefit from using binary mode.

Related

Why does SQLite store hundreds of null bytes?

In a database I'm creating, I was curious why the size was so much larger than the contents, and checked out the hex code. In a 4 kB file (single row as a test), there are two major chunks that are roughly 900 and 1000 bytes, along with a couple smaller ones that are all null bytes 0x0
I can't think of any logical reason it would be advantageous to store thousands of null bytes, increasing the size of the database significantly.
Can someone explain this to me? I've tried searching, and haven't been able to find anything.
The structure of a SQLite database file (`*.sqlite) is described in this page:
https://www.sqlite.org/fileformat.html
SQLite files are partitioned into "pages" which are between 512 and 65536 bytes long - in your case I imagine the page size is probably 1KiB. If you're storing data that's smaller than 1KiB (as you are with your single test row, which I imagine is maybe 100 bytes long?) then that leaves 900 bytes left - and unused (deallocated) space is usually zeroed-out before (and after) use.
It's the same way computer working memory (RAM) works - as RAM also uses paging.
I imagine you expected the file to be very compact with a terse internal representation; this is the case with some file formats - such as old-school OLE-based Office documents but others (and especially database files) require a different file layout that is optimized simultaneously for quick access, quick insertion of new data, and is also arranged to help prevent internal fragmentation - this comes at the cost of some wasted space.
A quick thought-experiment will demonstrate why mutable (i.e. non-read-only) databases cannot use a compact internal file structure:
Think of a single database table as being like a CSV file (and CSVs themselves are compact enough with very little wasted space).
You can INSERT new rows by appending to the end of the file.
You can DELETE an existing row by simply overwriting the row's space in the file with zeroes. Note that you cannot actually "delete" the space by "moving" data (like using the Backspace key in Notepad) because that means copying all of the data in the file around - this is largely a bad idea.
You can UPDATE a row by checking to see if the new row's width will fit in the current space (and overwrite the remaining space with zeros), or if not, then append a new row at the end and overwrite the existing row (a-la INSERT then DELETE)
But what if you have two database tables (with different columns) and need to store them in the same file? One approach is to simply mix each table's rows in the same flat file - but for other reasons that's a bad idea. So instead, inside your entire *.sqlite file, you create "sub-files", that have a known, fixed size (e.g. 4KiB) that store only rows for a single table until the sub-file is full; they also store a pointer (like a linked-list) to the next sub-file that contains the rest of the data, if any. Then you simply create new sub-files as you need more space inside the file and set-up their next-file pointers. These sub-files are what a "page" is in a database file, and is how you can have multiple read/write database tables contained within the same parent filesystem file.
Then in addition to these pages to store table data, you also need to store the indexes (which is what allows you to locate a table row near-instantly without needing to scan the entire table or file) and other metadata, such as the column-definitions themselves - and often they're stored in pages too. Relational (tabular) database files can be considered filesystems in their own right (just encapsulated in a parent filesystem... which could be inside a *.vhd file... which could be buried inside a varbinary database column... inside another filesystem), and even the database systems themselves have been compared to operating-systems (as they offer an environment for programs (stored procedures) to run, they offer IO services, and so on - it's almost circular if you look at the old COBOL-based mainframes from the 1970s when all of your IO operations were restricted to just computer record management operations (insert, update, delete).

SSIS 2012, reading (small) nvarchar(max), sluggish performance

I'm reading some data from a SQL Server 2012. The source table has some 600 million rows, and two of the headache columns are nvarchar(max).
Problem: The performance drops from 100k rows per second to 800 per second when including these blob columns. Is there some way in make things go faster with blobs in SSIS?
Max datalength for the varchar(max) is 250 bytes and avg datalength is 50 bytes. The content is XML stored as text. I could possibly cast them to a nvarchar(2000), but I need later to work with a table with bigger xml-text objects.
The destination is simply a RowCount.
What I see is
SSIS works on the data in chunks. While SSIS is working, the source server is idle: No CPU, IO or Pagelookups. I don't know what SSIS is doing internally.
I have 40 GB of RAM on the server, SSD disks, should be enough resources.
I've tried to set the Blob and Buffer Storage Path to a explicit folder, but I can't see anything going on there.
The working set of the SSIS package fine, nothing going on there. It consumes only some 300-400 MB, that's it.
No paging going on.
I'm using the SQLNCLI11.1 provider, 32K packet size.
Networking is fine, I've ruled out that part. The waitstats on the source server says vaguly it's waiting for network, but that's simple because the client app is not consuming anything. Again, if I leave out the blob column the network delivers 400 mbpbs constantly (1 gb network ).
It seems to me as SSIS simply doesn't handle blobs very well. It's very very conservative (because it cannot know how much space the data will occupy from the buffer). From the doc I can read the SSIS handles blobs differently from ordinairy columns, that they may be stored on disk, read back to memory again and then sent down the pipeline. Whatever happens, the performance is simply a disaster.
I've played around with the buffer size and maxbufferrows, it doesn't really make that much difference. The average performance is crap whatever I choose to do.
I've googled "everything", this is my last resort. What have I missed so far?

How to efficiently utilize 10+ computers to import data

We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.
My question
How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.
Does anyone have suggestion?
EDIT:
The following is a simplification of the CSV files:
"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue"
"bvalue";"evenanothervalue"
"avalue";"anothervalue"
After importing, the tables look like this:
dimension_table1
id name
1 "avalue"
2 "bvalue"
dimension_table2
id name
1 "anothervalue"
2 "evenanothervalue"
Fact table
dimension_table1_ID dimension_table2_ID
1 1
2 2
1 2
1 2
2 2
1 1
You could consider using a 64bit hash function to produce a bigint ID for each string, instead of using sequential IDs.
With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.
This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.
You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.
See:
http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
http://code.google.com/p/smhasher/wiki/MurmurHash
http://www.partow.net/programming/hashfunctions/index.html
Loading CSV data into a database is slow because it needs to read, split and validate the data.
So what you should try is this:
Setup a local database on each computer. This will get rid of the network latency.
Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.
Dump the data with the DB tools
Load all dumps into a single DB
Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.
That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE
If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.
[EDIT] After checking with your edits, here is what you should try:
Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:
cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.
Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.
If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.
Split the huge file into 40 chunks and load each of them on each machine.
If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.
Use backup/restore or remote tables to merge the results.
Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.
See here for an introduction.
This is a very generic question and does not take the database backend into account. Firing with 40 or 1000 machines on a database backend that can not handle the load will give you nothing. Such a problem is truly to broad to answer it in a specific way..you should get in touch with people inside your organization with enough skills on the DB level first and then come back with a more specific question.
Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.
Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)
To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):
Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.
It contains lists of all X files and all N-1 (not counting itself) “worker” databases
Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)
When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)
Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s).
When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.
Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.
At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.
For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.
Again, repeat the above step for each worker database, until all files have been uploaded.
Advantages:
Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.
Ideally, little work (“second stage”, merging datasets) is left for the master database
Limitations:
Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network
Master database is a possible chokepoint. Everything has to go through here.
Shortcuts:
It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).
SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.
Other shortcuts are likely, but it totally depends upon the business rules being implemented.
Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?
The simplest thing is to make one computer responsible for handing out new dimension item id's. You can have one for each dimension. If the dimension handling computers are on the same network, you can have them broadcast the id's. That should be fast enough.
What database did you plan on using with a 23-dimensional starscheme? Importing might not be the only performance bottleneck. You might want to do this in a distributed main-memory system. That avoids a lot of the materalization issues.
You should investigate if there are highly correlating dimensions.
In general, with a 23 dimensional star scheme with large dimensions a standard relational database (SQL Server, PostgreSQL, MySQL) is going to perform extremely bad with datawarehouse questions. In order to avoid having to do a full table scan, relational databases use materialized views. With 23 dimensions you cannot afford enough of them. A distributed main-memory database might be able to do full table scans fast enough (in 2004 I did about 8 million rows/sec/thread on a Pentium 4 3 GHz in Delphi). Vertica might be an other option.
Another question: how large is the file when you zip it? That provides a good first order estimate of the amount of normalization you can do.
[edit] I've taken a look at your other questions. This does not look like a good match for PostgreSQL (or MySQL or SQL server). How long are you willing to wait for query results?
Rohita,
I'd suggest you eliminate a lot of the work from the load by sumarising the data FIRST, outside of the database. I work in a Solaris unix environment. I'd be leaning towards a korn-shell script, which cuts the file up into more managable chunks, then farms those chunks out equally to my two OTHER servers. I'd process the chunks using a nawk script (nawk has an efficient hashtable, which they call "associative arrays") to calculate the distinct values (the dimensions tables) and the Fact table. Just associate each new-name-seen with an incrementor-for-this-dimension, then write the Fact.
If you do this through named pipes you can push, process-remotely, and readback-back the data 'on the fly' while the "host" computer sits there loading it straight into tables.
Remember, No matter WHAT you do with 200,000,000 rows of data (How many Gig is it?), it's going to take some time. Sounds like you're in for some fun. It's interesting to read how other people propose to tackle this problem... The old adage "there's more than one way to do it!" has never been so true. Good luck!
Cheers. Keith.
On another note you could utilize Windows Hyper-V Cloud Computing addon for Windows Server:http://www.microsoft.com/virtualization/en/us/private-cloud.aspx
It seems that your implementation is very inefficient as it's loading at the speed of less than 1 MB/sec (50GB/15hrs).
Proper implementation on a modern single server (2x Xeon 5690 CPUs + RAM that's enough for ALL dimensions loaded in hash tables + 8GB ) should give you at least 10 times better speed i.e at least 10MB/sec.

How to instantly query a 64Go database

Ok everyone, I have an excellent challenge for you. Here is the format of my data :
ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP
ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.
My problem is as follow:
I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.
First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.
Second is to store everything in memory into a huge hashtable. RAM is the limit.
Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?
When I say real-time, I mean about 100-200 query/second.
A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.
I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.
Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.
Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.
Given the way the data is used, you should do the following:
Create a record structure (fixed size) that is large enough to contain one full row of data
Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.
Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.

How do databases deal with data tables that cannot fit in memory?

Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.

Resources