Why is PostgreSQL eating up all my precious HD space?

Why is PostgreSQL eating up all my precious HD space? - database

I just finished transferring as much link-structure data concerning wikipedia (English) as I could. Basically, I downloaded a bunch of SQL dumps from wikipedia's latest dump repository. Since I am using PostgreSQL instead of MySQL, I decided to load all these dumps into my db using pipeline shell commands.
Anyway, one of these tables has 295 million rows: the pagelinks table; it contains all intra-wiki hyperlinks. From my laptop, using pgAdmin III, I sent the following command to my database server (another computer):
SELECT pl_namespace, COUNT(*) FROM pagelinks GROUP BY (pl_namespace);
Its been at it for an hour or so now. The thing is that the postmaster seems to be eating up more and more of my very limited HD space. I think it ate about 20 GB as of now. I had previously played around with the postgresql.conf file in order to give it more performance flexibility (i.e. let it use more resources) for it is running with 12 GB of RAM. I think I basically quadrupled most bytes and such related variables of this file thinking it would use more RAM to do its thing.
However, the db does not seem to use much RAM. Using the Linux system monitor, I am able to see that the postmaster is using 1.6 GB of shared memory (RAM). Anyway, I was wondering if you guys could help me better understand what it is doing for it seems that I really do not understand how PostgreSQL uses HD resources.
Concerning the metastructure of wikipedia databases, they provide a good schema that may be of use or even but of interest to you.
Feel free to ask me for more details, thx.

It's probably the GROUP BY that's causing the problem. In order to do grouping, the database has to sort the rows to put duplicate items together. An index probably won't help. A back-of-the-envelope calculation:
Assuming each row takes 100 bytes of space, that's 29,500,000,000 bytes, or about 30GB of storage. It can't fit all that in memory, so your system is thrashing, which slows operations down by a factor of 1000 or more. Your HD space may be disappearing into swap space, if it's using swap files.
If you only need to do this calculation once, try breaking it apart into smaller subsets of the data. Assuming pl_namespace is numeric and ranges from 1-295million, try something like this:
SELECT pl_namespace, COUNT(*)
FROM pagelinks
WHERE pl_namespace between 1 and 50000000
GROUP BY (pl_namespace);
Then do the same for 50000001-100000000 and so forth. Combine your answers together using UNION or simply tabulate the results with an external program. Forget what I wrote about an index not helping GROUP BY; here, an index will help the WHERE clause.

What exactly is claiming that it's only taking 9.5MB of RAM? That sounds unlikely to me - the shared memory almost certainly is RAM which is being shared between different Postgres processes. (From what I remember, each client ends up as a separate process, although it's been a while so I could be very wrong.)
Do you have an index on the pl_namespace column? If there's an awful lot of distinct results, I could imagine that query being pretty heavy on a 295 million row table with no index. Having said that, 10GB is an awful lot to swallow. Do you know which files it's writing to?

Ok so here is the gist of it:
the GROUP BY clause made the index' invalid, so the postmaster (postgresql server process) decided to create a bunch of tables (23GB of tables) that were located in the directory $PGDATA/base/16384/pgsql_tmp.
When modifying the postgresql.conf file, I had given permission to postgreSQL to use 1.6 GB of RAM (which I will now double for it has access to 11.7 GB of RAM); the postmaster process was indeed using up 1.6 GB of RAM, but that wasn't enough, thus the pgsql_tmp directory.
As was pointed out by Barry Brown, since I was only executing this SQL command to get some statistical information about the distribution of the links among the pagelinks.namespaces, I could have queried a subset of the 296 million pagelinks (this is what they do for surveys).
When the command returned the result set, all temporary tables were automatically deleted as if nothing had happened.
Thx for your help guys!

Related

How to store 7.3 billion rows of market data (optimized to be read)?

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.
Most (99.9%) of the time I will perform only read requests.
What is the best way to store this data in a db?
1 big table with 7.3B rows?
1000 tables (one for each stock symbol) with 7.3M rows each?
any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)
I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.
Edit:
This is a sample row:
'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0
Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.
Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
About the hardware: I plan to use Amazon RDS so I'm flexible on that

So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:
Prepare a C/C++ struct to hold the record format:
struct StockPrice
{
char ticker_code[2];
double stock_price;
timespec when;
etc
};
Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.
Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:
//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);
Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.
StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);
You can now mmap it again read-only from any program and your data will be readily available:
file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);
// do stuff with stocks;
So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.
If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.

I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only read requests.
Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.
This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.
InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:
Optimizing for time series involved certain tradeoffs. For example:
Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.
Pro: Restricting access to updates allows for increased query and write performance
Con: Update functionality is significantly restricted
In open sourced benchmarks,
InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.
Queries are also very simple. If your rows look like <symbol, timestamp, open, high, low, close, volume>, with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:
SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'
There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.

Tell us about the queries, and your hardware environment.
I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.
Update
Okay, why?
First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because
My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.
If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.
If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because
you can better exploit parallelism on relatively small commodity hardware.
you can also better implement high reliability and redundancy
many of those problems lend themselves naturally to the MapReduce paradigm.
But the fact is, until we know about your workload, it's impossible to say anything definitive.

Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data really easily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.
I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).
Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have two servers for any particular stock, which would be helpful).
Note that you don't need to store the stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long (or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.
You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...
EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is very reasonable.
EDIT: I think I've missed out the explanation of why this approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once (as the filename) and leaving the date/time entirely implicit in the position of the data, you're removing a whole bunch of work. It's a bit like the difference between a String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.

It is my understanding that HDF5 was designed specifically with the time-series storage of stock data as one potential application. Fellow stackers have demonstrated that HDF5 is good for large amounts of data: chromosomes, physics.

I think any major RDBMS would handle this. At the atomic level, a one table with correct partitioning seems reasonable (partition based on your data usage if fixed - this is ikely to be either symbol or date).
You can also look into building aggregated tables for faster access above the atomic level. For example if your data is at day, but you often get data back at the wekk or even month level, then this can be pre-calculated in an aggregate table. In some databases this can be done though a cached view (various names for different DB solutions - but basically its a view on the atomic data, but once run the view is cached/hardened intoa fixed temp table - that is queried for subsequant matching queries. This can be dropped at interval to free up memory/disk space).
I guess we could help you more with some idea as to the data usage.

Here is an attempt to create a Market Data Server on top of the Microsoft SQL Server 2012 database which should be good for OLAP analysis, a free open source project:
http://github.com/kriasoft/market-data

First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.
k_equity_id
k_date
k_minute
and for the EOD table (that will be viewed 1000x over the other data)
k_equity_id
k_date
Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.

Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,
"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"
would translate to something like:
?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]
Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).

You should compare the slow solutions with a simple optimized in memory model. Uncompressed it fits in a 256 GB ram server. A snapshot fits in 32 K and you just index it positionally on datetime and stock. Then you can make specialized snapshots, as open of one often equals closing of the previous.
[edit] Why do you think it makes sense to use a database at all (rdbms or nosql)? This data doesn't change, and it fits in memory. That is not a use case where a dbms can add value.

If you have the hardware, I recommend MySQL Cluster. You get the MySQL/RDBMS interface you are so familiar with, and you get fast and parallel writes. Reads will be slower than regular MySQL due to network latency, but you have the advantage of being able to parallelize queries and reads due to the way MySQL Cluster and the NDB storage engine works.
Make sure that you have enough MySQL Cluster machines and enough memory/RAM for each of those though - MySQL Cluster is a heavily memory-oriented database architecture.
Or Redis, if you don't mind a key-value / NoSQL interface to your reads/writes. Make sure that Redis has enough memory - its super-fast for reads and writes, you can do basic queries with it (non-RDBMS though) but is also an in-memory database.
Like others have said, knowing more about the queries you will be running will help.

You will want the data stored in a columnar table / database. Database systems like Vertica and Greenplum are columnar databases, and I believe SQL Server now allows for columnar tables. These are extremely efficient for SELECTing from very large datasets. They are also efficient at importing large datasets.
A free columnar database is MonetDB.

If your use case is to simple read rows without aggregation, you can use Aerospike cluster. It's in memory database with support of file system for persistence. It's also SSD optimized.
If your use case needs aggregated data, go for Mongo DB cluster with date range sharding. You can club year vise data in shards.

Database choice: High-write, low-read

I'm building a component for recording historical data. Initially I expect it to do about 30 writes/second, and less than 1 read/second.
The data will never be modified, only new data will be added. Reads are likely to be done with fresh records.
The demand is likely to increase rapidly, expecting around 80 writes/second in one year time.
I could choose to distribute my component and use a common database such as MySql, or I could go with a distributed database such as MongoDb. Either way, I'd like the database to handle writes very well.
The database must be free. Open source would be a plus :-)
Note: A record is plain text in variable size, typically 50 to 500 words.

Your question can be solved a few different ways, so let's break it down and look at the individual requirements you've laid out:
Writes - It sounds like the bulk of what you're doing is append only writes at a relatively low volume (80 writes/second). Just about any product on the market with a reasonable storage backend is going to be able to handle this. You're looking at 50-500 "words" of data being saved. I'm not sure what constitutes a word, but for the sake of argument let's assume that a word is an average of 8 characters, so your data is going to be some kind of metadata, a key/timestamp/whatever plus 400-4000 bytes of words. Barring implementation specific details of different RDBMSes, this is still pretty normal, we're probably writing at most (including record overhead) 4100 bytes per record. This maxes out at 328,000 bytes per second or, as I like to put it, not a lot of writing.
Deletes - You also need the ability to delete your data. There's not a lot I can say about that. Deletes are deletes.
Reading - Here's where things get tricky. You mention that it's mostly primary keys and reads are being done on fresh data. I'm not sure what either of these mean, but I don't think that it matters. If you're doing key only lookups (e.g. I want record 8675309), then life is good and you can use just about anything.
Joins - If you need the ability to write actual joins where the database handles them, you've written yourself out of the major non-relational database products.
Data size/Data life - This is where things get fun. You've estimated your writes at 80/second and I guess at 4100 bytes per record or 328,000 bytes per second. There are 86400 seconds in a day, which gives us 28,339,200,000 bytes. Terrifying! That's 3,351,269.53125 KB, 27,026 MB, or roughly 26 GB / day. Even if you're keeping your data for 1 year, that's 9633 GB, or 10TB of data. You can lease 1 TB of data from a cloud hosting provider for around $250 per month or buy it from a SAN vendor like EqualLogic for about $15,000.
Conclusion: I can only think of a few databases that couldn't handle this load. 10TB is getting a bit tricky and requires a bit of administration skill, and you might need to look at certain data lifecycle management techniques, but almost any RDBMS should be up to this task. Likewise, almost any non-relational/NoSQL database should be up to this task. In fact, almost any database of any sort should be up to the task.
If you (or your team members) already have skills in a particular product, just stick with that. If there's a specific product that excels in your problem domain, use that.
This isn't the type of problem that requires any kind of distributed magical unicorn powder.

Ok for MySQL I would advice you to use InnoDB without any indexes, expect on primary keys, even then, if you can skip them it would be good, in order to make input flow uninterrupted.
Indexes optimize reading, but descrease the writing capabilities.
You also may use PostgreSQL. Where you also need to skip indexes as well but you wont have a engine selection and its capabilities are also very strong for writing.
This approach you want is actually used in some solutions, but with two db servers, or at least two databases. The first is receiving a lot of new data (your case), while the second communicates with the first and store it in a well-structured database (with indexes, rules, etc). And then when you need to read or make a snapshot of the data you refer the second server (or second database), where you can use transactions and so on.
You should take a look and refer at Oracle Express (I think this was its name) and SQL Server Express Edition. The last two have better performance, but also some limitations. To have more detailed picture.

How to efficiently utilize 10+ computers to import data

We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.
My question
How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.
Does anyone have suggestion?
EDIT:
The following is a simplification of the CSV files:
"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue"
"bvalue";"evenanothervalue"
"avalue";"anothervalue"
After importing, the tables look like this:
dimension_table1
id name
1 "avalue"
2 "bvalue"
dimension_table2
id name
1 "anothervalue"
2 "evenanothervalue"
Fact table
dimension_table1_ID dimension_table2_ID
1 1
2 2
1 2
1 2
2 2
1 1

You could consider using a 64bit hash function to produce a bigint ID for each string, instead of using sequential IDs.
With 64-bit hash codes, you can store 2^(32 - 7) or over 30 million items in your hash table before there is a 0.0031% chance of a collision.
This would allow you to have identical IDs on all nodes, with no communication whatsoever between servers between the 'dispatch' and the 'merge' phases.
You could even increase the number of bits to further lower the chance of collision; only, you would not be able to make the resultant hash fit in a 64bit integer database field.
See:
http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
http://code.google.com/p/smhasher/wiki/MurmurHash
http://www.partow.net/programming/hashfunctions/index.html

Loading CSV data into a database is slow because it needs to read, split and validate the data.
So what you should try is this:
Setup a local database on each computer. This will get rid of the network latency.
Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.
Dump the data with the DB tools
Load all dumps into a single DB
Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.
That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE
If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.
[EDIT] After checking with your edits, here is what you should try:
Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:
cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.
Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.
If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.
Split the huge file into 40 chunks and load each of them on each machine.
If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.
Use backup/restore or remote tables to merge the results.
Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.
See here for an introduction.

This is a very generic question and does not take the database backend into account. Firing with 40 or 1000 machines on a database backend that can not handle the load will give you nothing. Such a problem is truly to broad to answer it in a specific way..you should get in touch with people inside your organization with enough skills on the DB level first and then come back with a more specific question.

Assuming N computers, X files at about 50GB files each, and a goal of having 1 database containing everything at the end.
Question: It takes 15 hours now. Do you know which part of the process is taking the longest? (Reading data, cleansing data, saving read data in tables, indexing… you are inserting data into unindexed tables and indexing after, right?)
To split this job up amongst the N computers, I’d do something like (and this is a back-of-the-envelope design):
Have a “central” or master database. Use this to mangae the overall process, and to hold the final complete warehouse.
It contains lists of all X files and all N-1 (not counting itself) “worker” databases
Each worker database is somehow linked to the master database (just how depends on RDBMS, which you have not specified)
When up and running, a "ready" worker database polls the master database for a file to process. The master database dolls out files to worker systems, ensuring that no file gets processed by more than one at a time. (Have to track success/failure of loading a given file; watch for timeouts (worker failed), manage retries.)
Worker database has local instance of star schema. When assigned a file, it empties the schema and loads the data from that one file. (For scalability, might be worth loading a few files at a time?) “First stage” data cleansing is done here for the data contained within that file(s).
When loaded, master database is updated with a “ready flagy” for that worker, and it goes into waiting mode.
Master database has it’s own to-do list of worker databases that have finished loading data. It processes each waiting worker set in turn; when a worker set has been processed, the worker is set back to “check if there’s another file to process” mode.
At start of process, the star schema in the master database is cleared. The first set loaded can probably just be copied over verbatim.
For second set and up, have to read and “merge” data – toss out redundant entries, merge data via conformed dimensions, etc. Business rules that apply to all the data, not just one set at a time, must be done now as well. This would be “second stage” data cleansing.
Again, repeat the above step for each worker database, until all files have been uploaded.
Advantages:
Reading/converting data from files into databases and doing “first stage” cleansing gets scaled out across N computers.
Ideally, little work (“second stage”, merging datasets) is left for the master database
Limitations:
Lots of data is first read into worker database, and then read again (albeit in DBMS-native format) across the network
Master database is a possible chokepoint. Everything has to go through here.
Shortcuts:
It seems likely that when a workstation “checks in” for a new file, it can refresh a local store of data already loaded in the master and add data cleansing considerations based on this to its “first stage” work (i.e. it knows code 5484J has already been loaded, so it can filter it out and not pass it back to the master database).
SQL Server table partitioning or similar physical implementation tricks of other RDBMSs could probably be used to good effect.
Other shortcuts are likely, but it totally depends upon the business rules being implemented.
Unfortunately, without further information or understanding of the system and data involved, one can’t tell if this process would end up being faster or slower than the “do it all one one box” solution. At the end of the day it depends a lot on your data: does it submit to “divide and conquer” techniques, or must it all be run through a single processing instance?

The simplest thing is to make one computer responsible for handing out new dimension item id's. You can have one for each dimension. If the dimension handling computers are on the same network, you can have them broadcast the id's. That should be fast enough.
What database did you plan on using with a 23-dimensional starscheme? Importing might not be the only performance bottleneck. You might want to do this in a distributed main-memory system. That avoids a lot of the materalization issues.
You should investigate if there are highly correlating dimensions.
In general, with a 23 dimensional star scheme with large dimensions a standard relational database (SQL Server, PostgreSQL, MySQL) is going to perform extremely bad with datawarehouse questions. In order to avoid having to do a full table scan, relational databases use materialized views. With 23 dimensions you cannot afford enough of them. A distributed main-memory database might be able to do full table scans fast enough (in 2004 I did about 8 million rows/sec/thread on a Pentium 4 3 GHz in Delphi). Vertica might be an other option.
Another question: how large is the file when you zip it? That provides a good first order estimate of the amount of normalization you can do.
[edit] I've taken a look at your other questions. This does not look like a good match for PostgreSQL (or MySQL or SQL server). How long are you willing to wait for query results?

Rohita,
I'd suggest you eliminate a lot of the work from the load by sumarising the data FIRST, outside of the database. I work in a Solaris unix environment. I'd be leaning towards a korn-shell script, which cuts the file up into more managable chunks, then farms those chunks out equally to my two OTHER servers. I'd process the chunks using a nawk script (nawk has an efficient hashtable, which they call "associative arrays") to calculate the distinct values (the dimensions tables) and the Fact table. Just associate each new-name-seen with an incrementor-for-this-dimension, then write the Fact.
If you do this through named pipes you can push, process-remotely, and readback-back the data 'on the fly' while the "host" computer sits there loading it straight into tables.
Remember, No matter WHAT you do with 200,000,000 rows of data (How many Gig is it?), it's going to take some time. Sounds like you're in for some fun. It's interesting to read how other people propose to tackle this problem... The old adage "there's more than one way to do it!" has never been so true. Good luck!
Cheers. Keith.

On another note you could utilize Windows Hyper-V Cloud Computing addon for Windows Server:http://www.microsoft.com/virtualization/en/us/private-cloud.aspx

It seems that your implementation is very inefficient as it's loading at the speed of less than 1 MB/sec (50GB/15hrs).
Proper implementation on a modern single server (2x Xeon 5690 CPUs + RAM that's enough for ALL dimensions loaded in hash tables + 8GB ) should give you at least 10 times better speed i.e at least 10MB/sec.

how many rows can a be executed at one time unless to get time out

i work on sql 2005 server
I have almost 350 000 insert scripts.. The insert script has 10 columns to be inserted.
So how many rows should I select to be executed at one click. "Execute" click..
Please tell me an average number according to an average system configuration..
Win XP
Cor 2 Duo
3,66 Gb ram

Ok, lets get some things straight here:
Win XP Cor 2 Duo 3,66 Gb ram
Not average but outdated. On top it totally missed the most important nubmer for a db, which is speed/number of discs.
i work on sql 2005 server I have
almost 350 000 insert scripts..
I seriously doubt you haver 350.000 insert SCRIPTS. THIs would be 350.000 FILES that contain insert commands. This is a lot of files.
The insert script has 10 columns to be
inserted.
I order a pizza. How much fuel does my car require per km? Same relation. 10 columns is nice, but you dont say how many insert commands your scripts contain.
So, at the end the only SENSIBLE interpretation is you have to insert 350.000 rows, and try to do it from a program (i.e. there are no scripts to start with), but this is pretty much absolutely NOT what you say.
So how many rows should I select to be
executed at one click
How many pizzas should I order with one telephone? THe click here is irrelevant. It woud also not get faster when you use a command line program to do the isnerts.
The question is how to get inserts into the db fastest.
For normal SQL:
Batch the inserts. Like 50 or 100 into one statement (yes, you can write more than one insert into one command).
Submit them interleaved async, preparing the next statement while the prevous one executes.
This is very flexible as yuo can do real sql etc.
for real mass inserts:
Forget the idea of writing insertsstatements. Prepare the data properly as per table structure, use SqlBulkCopy to mass insert them.
Less flexible - but a LOT faster.
The later approach on my SMALL (!) database computer would handle this in about 3-5 seconds when the fields are small (a field dan be a 2gb binary data thing, you know). I handle about 80.000 row isnerts per second without a lot of optimization, but i have small and a little less fields. This is 4 processor cores (irrelvant, they never get busy), 8gb RAM (VERY small for a datbase server, irrelevant as well in this context), and 6 vlociraptors for the data in a Raid 10 (again, a small configuration for a database, b ut very relevant). I get a peak insert in the 150mb per second range here in activity monitor. I will do a lot of optimization here, as i open /close a db connection at the moment every 20.000 items... bad batching.
But then, you dont seem to have a database system at all, just a database installed on a low end workstation, an this means your IO is going to be REALLY slow compared to database servers, and insert speed / update speed is IO bound. Desktop discs suck, and you have data AND logs on the same discs.
But.... at the end you dont really say us anything about your problem.
And... the timeout CAN be set programmatically on the connection object.

I'm pretty sure the timeout can be set by the user by going Server Properties -> Connections -> Remote query timeout. If you set this sufficiently high (or to 0 which should mean it never times out) then you can run as many scripts as you like.
Obviously this is only ok if the database is not yet live - and you're simply needing to populate. If the data is coming from another MS SQL Server however you might just want to take a Full backup and restore - this will be both simpler and quicker.
This may be of help.

The general rule of thumb is to not exceed 0.1 seconds per UI operation for excellent performance. You are going to need to benchmark to find out what that is.

Column Stores: Comparing Column Based Databases

I've really been struggling to make SQL Server into something that, quite frankly, it will never be. I need a database engine for my analytical work. The DB needs to be fast and does NOT need all the logging and other overhead found in typical databases (SQL Server, Oracle, DB2, etc.)
Yesterday I listened to Michael Stonebraker speak at the Money:Tech conference and I kept thinking, "I'm not really crazy. There IS a better way!" He talks about using column stores instead of row oriented databases. I went to the Wikipedia page for column stores and I see a few open source projects (which I like) and a few commercial/open source projects (which I don't fully understand).
My question is this: In an applied analytical environment, how do the different column based DB's differ? How should I be thinking about them? Anyone have practical experience with multiple column based systems? Can I leverage my SQL experience with these DBs or am I going to have to learn a new language?
I am ultimately going to be pulling data into R for analysis.
EDIT: I was requested for some clarification in what exactly I am trying to do. So, here's an example of what I would like to do:
Create a table that has 4 million rows and 20 columns (5 dims, 15 facts). Create 5 aggregation tables that calculate max, min, and average for each of the facts. Join those 5 aggregations back to the starting table. Now calculate the percent deviation from mean, percent deviation of min, and percent deviation from max for each row and add it to the original table. This table data does not get new rows each day, it gets TOTALLY replaced and the process is repeated. Heaven forbid if the process must be stopped. And the logs... ohhhhh the logs! :)

The short answer is that for analytic data, a column store will tend to be faster, with less tuning required.
A row store, the traditional database architecture, is good at inserting small numbers of rows, updating rows in place, and querying small numbers of rows. In a row store, these operations can be done with one or two disk block I/Os.
Analytic databases typically load thousands of records at a time; sometimes, as in your case, they reload everything. They tend to be denormalized, so have a lot of columns. And at query time, they often read a high proportion of the rows in the table, but only a few of these columns. So, it makes sense from an I/O standpoint to store values of the same column together.
Turns out that this gives the database a huge opportunity to do value compression. For instance, if a string column has an average length of 20 bytes but has only 25 distinct values, the database can compress to about 5 bits per value. Column store databases can often operate without decompressing the data.
Often in computer science there is an I/O versus CPU time tradeoff, but in column stores the I/O improvements often improve locality of reference, reduce cache paging activity, and allow greater compression factors, so that CPU gains also.
Column store databases also tend to have other analytic-oriented features like bitmap indexes (yet another case where better organization allows better compression, reduces I/O, and allows algorithms that are more CPU-efficient), partitions, and materialized views.
The other factor is whether to use a massively parallel (MMP) database. There are MMP row-store and column-store databases. MMP databases can scale up to hundreds or thousands of nodes, and allow you to store humungous amounts of data, but sometimes have compromises like a weaker notion of transactions or a not-quite-SQL query language.
I'd recommend that you give LucidDB a try. (Disclaimer: I'm a committer to LucidDB.) It is open-source column store database, optimized for analytic applications, and also has other features such as bitmap indexes. It currently only runs on one node, but utilizes several cores effectively and can handle reasonable volumes of data with not much effort.

4 million rows times 20 columns times 8 bytes for a double is 640 mb. Following the rule of thumb that R creates three temporary copies for every object, we get to around 2 gb. That is not a lot by today's standard.
So this should be doable in memory on a suitable 64-bit machine with a 'decent' amount of ram (say 8 gb or more). Installing Ubuntu or Debian (possibly in the server version) can be done in a few minutes.

I have some experience with Infobright Community edition --- column-or. db, based on mysql.
Pro:
you can use mysql interfaces/odbc mysql drivers, from R too
fast enough queries on big chunks of data selection (because of KnowledgeGrid & data packs)
very fast native data loader and connectors for ETL (talend, kettle)
optimized exactly that operations what I (and I think most of us) use (selection by factor levels, joining etc)
special "lookup" option for optimized storing R factor variables ;) (ok, char/varchar variables with relatively small levels number/rows number)
FOSS
paid support option
?
Cons:
no insert/update operations in Community edition (yet?), data loading only via native data loader/ETL connectors
no utf-8 official support (collation/sort etc), planned for q3 2009
no functions in aggregate queries f.e. select month (date) from ...) yet, planned for July(?) 2009, but because of column storage, I prefer simply create date columns for every aggregation levels (week number, month, ...) I need
cannot installed on existing mysql server as storage engine (because of own optimizer, if I understood correctly), but you may install Infobright & mysql on different ports if you need
?
Resume:
Good FOSS solution for daily analytical tasks, and, I think, your tasks as well.

Here is my 2 cents: SQL server does not scale well. We attempted to use SQL server to store financial data in real time (i.e. prices ticks coming in for 100 symbols). It worked perfectly for the first 2 weeks - then it went slower and slower as the database size increased, and finally ground to a halt, too slow to insert each price as it was received. We tried to work around it by moving data from the active database to offline storage every night, but ultimately the project was abandoned as it just didn't work.
Bottom line: if you're planning on storing a lot of data ( >1GB) you need something that scales properly, and that probably means a column database.

It looks like an implementation change (2-D array in column-major order, instead of row-major order), rather than an interface change.
Think "strategy" pattern, rather than being an entire paradigm shift. Of course, I've never used these products, so they may in fact force a paradigm shift down your throat. I don't know why, though.

We might be better able to help you reach an informed decision if you described [1] your specific goal and [2] the issues you're running into with SQL Server.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight