relational PostgreSQL database from tab delimited - database

Trying to create a relational database from a healthcare related tsv in PostgreSQL. I created a database and created the 3 tables I need and that worked, so now I have three tables that I need to import the relevant data into.
I primarily work in Python, so I'm not great at SQL. I tried importing the entire text file just to see if it worked. It said it completed but when I use SELECT * FROM on any of the tables they're all blank.
So I then tried breaking up the tsv into the relevant columns of each table and saving them as csvs and importing them that way. That also didn't work.
I've tried googling around but have had no luck. I am on a mac if it matters.
Any help is greatly appreciated!

Related

Extract data from thousands of Excel files into database

We use SharePoint 2013 as a library to hold thousands of Excel files, with almost never consistent formatting, to manage projects occurring on servers. Somewhere in these maybe formatted as table objects is a common set of server names.
Somehow, without being able to change this process in the short term, I need to pull data from all these files to identify how many projects are targeting a particular server.
I've got access to SQL Server 2016 enterprise, and wondering if something like PolyBase could help with this? I also wonder about SSIS but I don't expect any tables to look exactly like another one.
Other tools may be an option, but I'm not sure what can handle this scale and variety. I think daily updates to the data would be enough, but even so it's still a mess.
How do I pull thousands of varied excel tables into a database? Is this even possible?
Any longer term solution that doesn't allow them to format and annotate like excel is unlikely to actually be adopted.
The less you know in advance, the more difficult it will be...
Some ideas:
Technology
read about FROM OPENROWSET which allows to read from an Excel
read about linked server
Use Excel and its great abilities through VBA to iterate through all your Excel-Sheets, open them, analyse them and fill proper tables. Within Excel you know most about your messy data...
Target structure
You might create thousands of tables, each representing one single sheet in all your Excel files. You could query these tables with dynamically created SQL (using meta-data of INFORMATION_SCHEMA) or think about Full-Text-Search
You might import each sheet into one single XML-structure (SELECT * ... FOR XML PATH('...')). In this case you'd need a target table with columns for Path and name of your Excel, Name of the sheet and an XML column for your data. Another approach was to represent each File on one XML and include all sheets there. Try to define common naming for all your data. Querying XML allows to query columns without knowing their actual names (XQuery with XPath using *).
If your Excels are xlsx already, you might open them with UNZIP and take the existing XML as-is.
To be honest: I do not think that any tool can do the magic to import such a wide range of mess automatically...

Best way to import large excel file into SQL Server

We are trying to devise an optimal method for importing very large Excel files into SQL database. Using SSIS is somewhat troublesome because it scans top X records to determine the format of the file, but rows further down may be different, so it takes a lot of trial and error, with us having to bring the unusual columns to the top so SSIS can "learn".
When we get new file formats to import, they conform to specification in terms of row formatting etc - so we can say we know the schema in advance. The SQL destination tables have the same schema, with couple of extra columns such as date inserted and original filename.
Is there an easier way to create format definitions for new files we are going to insert? We don't have to use SSIS, we are open to any other tool, with a view for as much automation as possible. There's a question of testing the sanity of data we will import, we were planning on doing basic queries against staging datasets such as "less than 1% of records can miss postal code" etc.
Many thanks
Maybe you can import data as text and after that you can convert that using Derived Column transformation. You can read data from Excel as Text using IMEX option in Connection String. More information about this parameter you can find here.

SqlServer to Postgres DDL migration/conversion

We're looking to migrate from MSSQL to Postgres. I'm intending on using sql servers bcp tool for generating csv that we'll import into postgres with the bulk copy features. We are however, having trouble getting the DDL migrated. We've. I've gotten it to work by massaging the DDL generated by MMSQL by hand but We need something automated since we have a moving target (still adding tables, columns etc.) and will need to do this more than once.
We're open to commercial and open source products but have not found anything that does everything we need: Tables, serial columns, indexes (unique, multi column etc), defaults and foreign key constraints.
Check this link http://dbconvert.com/convert-mssql-to-postgre-pro.php?DB=6
Specifically look at the "SQL Azure to PostgreSQL" feature. Hopefully that will handle your table DDL.
NOTE: I have not used this just ran across it at http://www.postgresql.org/ in the latest news section a few days ago.
Finally settled on using http://www.enterprisedb.com/products-services-training/products-overview/postgres-plus-solution-pack/migration-toolkit. It's the most thorough and does data and ddl. There have been some issues with escaping backslashes that are followed by numbers since postgres thinks these are unicode escape sequences.

Run a query from two data sets programmatically

I am trying to reconcile data from a website and a database programmatically. Right now my process is manual. I download data from the website, download data from my database, and reconcile using an Excel vlookup. Within Excel, I am only reconciling 1 date for many items.
I'd like to programmatically reconcile the data for multiple dates and multiple items. The problem is that I have to download the data from the website manually. I have heard of people doing "outer joins" and "table joins" but I do not know where to begin. Is this something that I code in VBA or notepad?
Generally I do this by bulk inserting the website data into a staging table and then write select statments to join that table to my data in the database. You may need to do clean up first to be able to match the records if they are stored differently.
Python is a scripting language. http://www.python.org
There are tools to allow you to read Excel spreadsheets. For example:
http://michaelangela.wordpress.com/2008/07/06/python-excel-file-reader/
You can also use Python to talk to your database server.
http://pymssql.sourceforge.net/
http://www.oracle.com/technology/pub/articles/devlin-python-oracle.html
http://sourceforge.net/projects/pydb2/
Probably the easiest way to automate this is to save the excel files you get to disk, and use Python to read them, comparing that data with what is in your database.
This will not be a trivial project, but it is very flexible and straight forward. Trying to do it all in SQL will be, IMHO, a recipe for frustration, especially if you are new to SQL.
Alternatively:
You could also do this by using VBA to read in your excel files and generate SQL INSERT statements that are compatible with your DB schema. Then use SQL to compare them.

MaxDB Data and Schema Export to SQL Server 2005/8

I am tasked with exporting the data contained inside a MaxDB database to SQL Server 200x. I was wondering if anyone has gone through this before and what your process was.
Here is my idea but its not automated.
1) Export data from MaxDB for each table as a CSV.
2) Clean the CSV to remove ? (which it uses for nulls) and fix the date strings.
3) Use SSIS to import the data into tables in SQL Server.
I was wondering if anyone has tried linking MaxDB to SQL Server or what other suggestions or ideas you have for automating this.
Thanks.
AboutDev.
I managed to find a solution to this. There is an open source MaxDB library that will allow you to connect to it through .Net much like the SQL provider. You can use that to get schema information and data, then write a little code to generate scripts to run in SQL Server to create tables and insert the data.
MaxDb Data Provider for ADO.NET
If this is a one time thing, you don't have to have it all automated.
I'd pull the CSVs into SQL Server tables, and keep them forever, will help with any questions a year from now. You can prefix them all the same, "Conversion_" or whatever. There are no constraints or FKs on these tables. You might consider using varchar for every column (or the ones that cause problems, or not at all if the data is clean), just to be sure there are no data type conversion issues.
pull the data from these conversion tables into the proper final tables. I'd use a single conversion stored procedure to do everything (but I like tsql). If the data isn't that large millions and millions of rows or less, just loop through and build out all the tables, printing log info as necessary, or inserting into exception/bad data tables as necessary.

Resources