Hypothetical scenario - I have a 10 node Greenplum cluster with 100 TB of data in 1000 tables that needs to be offloaded to S3 for reasons. Ideally, the final result is a .csv file that corresponds to each table in the database.
I have three possible approaches, each with positives and negatives.
COPY - There is a question that already answers the how, but the problem with psql COPY in a distributed architecture, is it all has to go through the master, creating a bottleneck for the movement of 100TB of data.
gpcrondump - This would create 10 files per table and the format is TAB Delimited, which would require some post-gpcrondump ETL to unite the files into a single .csv, but it takes full advantage of the distributed architecture and automatically logs successful/failed transfers.
EWT - Takes advantage of the distributed architecture and writes each table to a single file without holding it in local memory until the full file is built, yet will probably be the most complicated of the scripts to write because you need to implement the ETL, you can't do it separately, after the dump.
All the options are going to have different issues with table locks as we move through the database and figuring out which tables failed so we can re-address them for a complete data transfer.
Which approach would you use and why?
I suggest you use the S3 protocol.
http://www.pivotalguru.com/?p=1503
http://gpdb.docs.pivotal.io/43160/admin_guide/load/topics/g-s3-protocol.html
Related
Am writing some processes to pre-format certain data for another downstream process to consume. The pre-formatting essentially involves gathering data from several permanent tables in one DB, applying some logic, and saving the results into another DB.
The problem i am running into is the volume of data. the resulting data set that i need to commit has about 132.5million rows. The commit itself takes almost 2 hours. I can cut that by changing the logging to simple, but it's still quite substantial (seeing as the generating of the 132.5 million rows into a temp table only takes 9 mins).
I have been reading on best methods to migrate large data, but most of the solutions implicitly assumes that the source data already resides in a single file/data table (which is not the case here). Some solutions like using SSMS task option makes it difficult to embed some of the logic applications that i need.
Am wondering if anyone here has some solutions.
Assuming you're on SQL Server 2014 or later the temp table is not flushed to disk immediately. So the difference is probably just disk speed.
Try making the target table a Clustered Columnstore to optimize for compression and minimize IO.
Relative newcomer to sql and pg here so this is a relatively open question regarding backing up daily data from a stream. Specific commands / scripts would be appreciated if it's simple, otherwise I'm happy to be directed to more specific articles/tutorials on how to implement what needs to be done.
Situation
I'm logging various data streams from some external servers on the amount of a few GB/day every day. I want to be able to store this data onto larger harddrives which will then be used to pull information from for analysis at a later date.
Hardware
x1 SSD (128GB) (OS + application)
x2 HDD (4TB each) (storage, 2nd drive for redundancy)
What needs to be done
The current plan is to have the SSD store a temporary database consisting of the daily logged data. When server load is low (early morning), dump the entire temporary database onto two separate backup instances on each of the two storage disks. The motivation for storing a temp db is to reduce the load on the harddrives. Furthermore, the daily data is small enough that it will be able to copy over to the storage drives before server load picks up.
Questions
Is this an acceptable method?
Is it better/safer to just push data directly to one of the storage drives, consider that the primary database, and automate a scheduled backup from that drive to the 2nd storage drive?
What specific commands would be required to do this to ensure data integrity (i.e. while a backup is in progress, new data will still be being logged)
At a later date when budget allows the hardware will be upgraded but the above is what's in place for now.
thanks!
First rule when building a backup system - do the simplest thing that works for you.
Running pg_dump will ensure data integrity. You will want to pay attention to what the last item backed up is to make sure you don't delete anything newer than that. After deleting the data you may well want to run a CLUSTER or VACUUM FULL on various tables if you can afford the logging.
Another option would be to have an empty template database and do something like:
Halt application + disconnect
Rename database from "current_db" to "old_db"
CREATE DATABASE current_db TEMPLATE my_template_db
Copy over any other bits you need (sequence numbers etc)
Reconnect the application
Dump old_db + copy backups to other disks.
If what you actually want is two separate live databases, one small quick one and one larger for long-running queries then investigate tablespaces. Create two tablespaces - the default on the big disks and the "small" one on your SSD. Put your small DB on the SSD. Then you can just copy from one table to another with foreign-data-wrappers (FDW) or dump/restore etc.
We have a product that uses a MySQL database as the data-store. The data-store holds large amount of data. The problem we are facing is that the response time of the application is very slow. The database queries are very basic with very simple joins, if any. The root cause for the slow response time according to some senior employees is the database operations on the huge data-store.
Another team in our company had worked on a project in the past where they processed large fixed-format files using Hadoop and dumped the contents of these files into database tables. Borrowing from this project, some of the team members feel that we can migrate from using a MySQL database to simple fixed-format files that will hold the data instead. There will be one file corresponding to each table in the database instead. We can then build another data interaction layer that provides interfaces for performing DML operations on the contents in these files. This layer will be developed using Hadoop and the MapReduce programming model.
At this point, several questions come to my mind.
1. Does the problem statement fit into the kind of problems that are solved using Hadoop?
2. How will the application ask the data interaction layer to fetch/update/delete the required data? As far as my understanding goes, the files containing the data will reside on HDFS. We will spawn a Hadoop job that will process the required file (similar to a table in the db) and fetch the required data. This data will be written to an outout file on HDFS. We will have to parse this file to get the required content.
3. Will the approach of using fixed format-files and processing them with Hadoop truly solve the problem?
I have managed to set up a simple node cluster with two Ubuntu machines but after playing around with Hadoop for a while, I feel that the problem statement is not a good fit for Hadoop. I could be completely wrong and therefore want to know whether Hadoop fits into this scenario or is it just a waste of time as the problem statement is not in line with what Hadoop is meant for?
I would suggest go straight to Hive (http://hive.apache.org/). It is SQL engine / datawarehouse build on top of the Hadoop MR.
In a nutshell - it get Hadoop scalability and hadoop high latency.
I would consider storing bulk of data there, do all required transformation and only summarized data move to MySQL to serve queries. Usually it is not good idea to translate user requests to the hive queries - they are too slow, capability to run jobs in parallel is not trivial.
If you are planning to update data more often then storing directly in hadoop may not be a good option for you. To update a file in hadoop you may have to rewrite the file and then delete old file and copy a new file in hdfs.
However if you are just searching and joining the data then its good option. If you use hive then you could make some queries like sql.
In hadoop your work flow could be something described below:
You will run a hadoop job for your queries.
Your hadoop program will parse query and execute some job to join
and read files based on your queries and input parameters.
Your output will be generated in hdfs.
You will copy the output to local file system. Then show the output to your program.
We have a number of files generated from a test, with each file having almost 60,000 lines of data. Requirement is to calculate number of parameters with the help of data present in these files. There could be two ways of processing the data :
Each file is read line-by-line and processed to obtain required parameters
The file data is bulk copied into the database tables and required parameters are calculated with the help of aggregate functions in the stored procedure.
I was trying to figure out the overheads related to both the methods. As a database is meant to handle such situations, I am concerned with overheads which may be a problem when database grows larger.
Will it affect the retrieval rate from the tables, consequently making the calculations slower? Thus will file processing be a better solution taking into account the database size? Should database partitioning solve the problem for large database?
Did you consider using map-reduce (say under Hadoop maybe with HBase) to perform these tasks? If you're looking for high-throughput with big data volumes this is a very scaleable approach. Of course, not every problem can be addressed effectively using this paradigm and I don't know the details of your calculation.
If you set up indexes correctly you won't suffer performance issues. Additionally, there is nothing stopping you loading the files into a table and running the calculations and then moving the data into an archive table or deleting it altogether.
you can run a querty directly agianst the text file from SQL
SELECT * FROM OPENROWSET('MSDASQL',
'Driver={Microsoft Text Driver (*.txt; *.csv)};DefaultDir=C:\;',
'SELECT * FROM [text.txt];')
The distributed queries needs to be enabled to run this.
Or as you mentioned you can load the data data to a table (using SSIS, BCP, the query above ..). You did not mentioned what does it mean that the database will be larger. 60k of lines for a table is not so much (meaning that it will perform well).
What is the fastest method to fill a database table with 10 Million rows? I'm asking about the technique but also about any specific database engine that would allow for a way to do this as fast as possible. I"m not requiring this data to be indexed during this initial data-table population.
Using SQL to load a lot of data into a database will usually result in poor performance. In order to do things quickly, you need to go around the SQL engine. Most databases (including Firebird I think) have the ability to backup all the data into a text (or maybe XML) file and to restore the entire database from such a dump file. Since the restoration process doesn't need to be transaction aware and the data isn't represented as SQL, it is usually very quick.
I would write a script that generates a dump file by hand, and then use the database's restore utility to load the data.
After a bit of searching I found FBExport, that seems to be able to do exactly that - you'll just need to generate a CSV file and then use the FBExport tool to import that data into your database.
The fastest method is probably running an INSERT sql statement with a SELECT FROM. I've generated test data to populate tables from other databases and even the same database a number of times. But it all depends on the nature and availability of your own data. In my case i had enough rows of collected data where a few select/insert routines with random row selection applied half-cleverly against real data yielded decent test data quickly. In some cases where table data was uniquely identifying i used intermediate tables and frequency distribution sorting to eliminate things like uncommon names (eliminated instances where a count with group by was less than or equal to 2)
Also, Red Gate actually provides a utility to do just what you're asking. It's not free and i think it's Sql Server-specific but their tools are top notch. Well worth the cost. There's also a free trial period.
If you don't want to pay or their utility you could conceivably build your own pretty quickly. What they do is not magic by any means. A decent developer should be able to knock out a similarly-featured though alpha/hardcoded version of the app in a day or two...
You might be interested in the answers to this question. It looks at uploading a massive CSV file to a SQL server (2005) database. For SQL Server, it appears that a SSIS DTS package is the fastest way to bulk import data into a database.
It entirely depends on your DB. For instance, Oracle has something called direct path load (http://download.oracle.com/docs/cd/B10501_01/server.920/a96652/ch09.htm), which effectively disables indexing, and if I understand correctly, builds the binary structures that will be written to disk on the -client- side rather than sending SQL over.
Combined with partitioning and rebuilding indexes per partition, we were able to load a 1 billion row (I kid you not) database in a relatively short order. 10 million rows is nothing.
Use MySQL or MS SQL and embedded functions to generate records inside the database engine. Or generate a text file (in cvs like format) and then use Bulk copy functionality.