Convert Mapinfo files into postgis tables - postgis

I have files:
Postcode Area Boundaries.MID
"AB","Aberdeen"
POLYGON((383707 866131,383465 866240,383302 866387,383202 866505,383089 866586,.....))
"AL","St Albans"
POLYGON((526248 203348,....))
.......
Postcode Area Boundaries.MIF
VERSION 300
DELIMITER ","
COORDSYS EARTH PROJECTION 8,79,"m",-2,49,0.99960127,400000,-100000
COLUMNS 2
POSTCODEAREA CHAR (0)
POSTCODEAREANAME CHAR (0)
DATA
POINT 384222.000 819125.000
POINT 518661.000 209809.000
POINT 407763.000 285514.000
POINT 371106.000 147748.000
.... as many points as many was areas in MID file (about 126)
Does anyone know how I can import it into PostGIS? What tables should I create?

If you don't specify a target table ogr2ogr will create the target table for you. This should create a new table in your db.
ogr2ogr -f PostgreSQL PG:"<dbconn>" "Postcode Area Boundaries.MIF"

Related

Split data from strings into columns

I have a column with a long string. The data needs split into columns and there are variable lengths of strings with not always the same amount of columns. Not exactly sure how to do this so was looking for some advice here.
Lets say I have this string:
VS5~MedCond1~35.4|VS4~MedCond2~16|VS1~MedCond3~155|VS2~MedCond4~70|SPO2~MedCond5~100|VS3~MedCond6~64|FiO2~MedCond7~21|MAP~MedCond8~98|
And in some cases the string might not have all the medical conditions just some of them.
I need to split into columns where the column name is in between the tilds i.e. MedCond1 and the value would be the value to the right of the tild but before the pipe and end up like this:
MedCond1 MedCond2 MedCond3 MedCond4 MedCond5 MedCond6 MedCond7 MedCond8
======== ======== ======== ======== ======== ======== ======== ========
35.1 24 110 64 100 88 21 79
I need to do this for a lot of rows within a large table and as I said not all the columns are always present but they will not be different names, you might have med cond 1- 8, then in another set have med cond 3, 4, 7.
Here is a query I created that is kind of what I want but not dynamic so it is picking up the values with some extra bits of the string
select MainCol, case when charindex('MedCond1', MainCol) > 0 then
substring(MainCol, charindex('MedCond1', MainCol) + 9, 4) end as [MedCond1]
from MedTable
Will return
MedCond1
========
35.3
40.2
33.6
33|V <--- Problem
As you can see the numeric value is sometimes picked up with additional part of the string due to hard coding of the charindex number. The value is sometimes 4 characters long with a decimal place, sometimes 2 long with no decimal place. I would like to make this dynamic. The pipe defines the end of the data I need and the start is defined by the tild at the end of the column name.
Thanks for any thoughts on making this dynamic
Andrew
This data looks like a table itself. It could have been stored in SQL Server as xml. SQL Server supports xml fields and allows querying them. In fact, one could try to convert this string to XML, then try to query it:
declare #medTable table (item nvarchar(2000))
insert into #medTable
values ('VS5~MedCond1~35.4|VS4~MedCond2~16|VS1~MedCond3~155|VS2~MedCond4~70|SPO2~MedCond5~100|VS3~MedCond6~64|FiO2~MedCond7~21|MAP~MedCond8~98|');
-- Step 1: Replace `|` with <item> tags and `~` with `tag` tags
-- This will return an xml value for each medTable row
with items as (
select xmlField= cast('<item><tag>'
+ replace(
replace(item,'|','</tag></item><item><tag>'),
'~','</tag><tag>' )
+ '</tag></item>' as xml)
from #medTable
)
-- Step 2: Select different tags and display them as fields
select
y.item.value('(tag/text())[1]','nvarchar(20)'),
y.item.value('(tag/text())[2]','nvarchar(20)'),
y.item.value('(tag/text())[3]','nvarchar(20)')
from items outer apply xmlField.nodes('item') as y(item)
The result is :
-------------------- -------------------- -------
VS5 MedCond1 35.4
VS4 MedCond2 16
VS1 MedCond3 155
VS2 MedCond4 70
SPO2 MedCond5 100
VS3 MedCond6 64
FiO2 MedCond7 21
MAP MedCond8 98
NULL NULL NULL
It would be better to perform this conversion when loading the data though. It's easier for example, to make the replacements in C# or SSIS and store a complete xml value in the database.
You can modify this query too, to generate the xml value and store it in the database:
declare #medTable2 table (xmlField xml)
with items as (
select xmlField= cast('<item><tag>' + replace(replace(item,'|','</tag></item><item><tag>'),'~','</tag><tag>' ) + '</tag></item>' as xml)
from #medTable
)
insert into #medTable2
select items.xmlField
from items
-- Query the new table from now on
select
y.item.value('(tag/text())[1]','nvarchar(20)'),
y.item.value('(tag/text())[2]','nvarchar(20)'),
y.item.value('(tag/text())[3]','nvarchar(20)')
from #medTable2 outer apply xmlField.nodes('item') as y(item)
OK, let me take a stab at this. The solution I'm outlining is not going to be purely SQL Server, however, it uses a round-trip via a text-file.
The approach uses the following steps:
Unpivot the data delimited by the pipe symbols (to create more than one line of output for each line of input)
Round-trip the data from SQL Server to a text file and back
Separate the data into columns on the tilde ~ symbol delimiter
Pivot the data back into columns
The key benefit of this approach is the unpivot operation, which allows you to handle missing columns like MedCond2 naturally by the absence of an equivalent row. It also eliminates nearly all string manipulation, save for the one REPLACE function in step 1 below.
Given a single row's contents like the following:
VS5~MedCond1~35.4|VS4~MedCond2~16|VS1~MedCond3~155|VS2~MedCond4~70|SPO2~MedCond5~100|VS3~MedCond6~64|FiO2~MedCond7~21|MAP~MedCond8~98|
Step 1 (Unpivot): Find and replace all instances of the pipe symbol with a newline character. So, REPLACE(column, '|', CHAR(13)) will give you the following lines of text (i.e. multiple lines of text in a single database row) for a single input row:
VS5~MedCond1~35.4
VS4~MedCond2~16
VS1~MedCond3~155
VS2~MedCond4~70
SPO2~MedCond5~100
VS3~MedCond6~64
FiO2~MedCond7~21
MAP~MedCond8~98
Step 2 (Round-trip): Write the above output to a text file, using your tool of choice (SSIS, SQLCMD, etc.) and ensure that the newline character defined is the same as that used in the REPLACE command in step 1.
The purpose of this step is to concatenate multiple lines within the same row with other lines in different rows.
Note that steps 1 can be eliminated by defining the row delimiter for steps 2 & 3 as the pipe symbol. I've put in the additional step 1 using newlines only to make it easier to understand and debug.
Step 3 (Separate columns): Import the text file back into SQL Server using the same tool, and define the column delimiter as the tilde ~ symbol, row delimiter same as in steps 1/2.
ColA MedCondTitle MedCondValue
------ ------------- -------------
VS5 MedCond1 35.4
VS4 MedCond2 16
VS1 MedCond3 155
VS2 MedCond4 70
SPO2 MedCond5 100
VS3 MedCond6 64
FiO2 MedCond7 21
MAP MedCond8 98
Step 4 (Pivot): Now you'd have a trivially simple step of pivoting rows to columns, which can be achieved with a statement of the form:
SUM(CASE WHEN MedCondTitle='MedCond1' THEN MedCondValue ELSE 0) as MedCond1

Getting Source only and mismatch data with tablediff utility

I'm using tablediff utility to transfert data from serval databases sources to a destination database and I get a result having all the differences between the source and destination databases with something like this
Dest. Only N'1027' N'799' N'91443' N'1'
Mismatch N'103A' N'799' N'13010' N'1' DATE_CURRENT DATE_OPERATION MATRICULE_UTILISATEUR QTE QTE_FINAL QTE_INIT QTE_OPERATION REFERENCE_DOCUMENT TYPE_DOCUMENT
Src. Only N'103A' N'310' N'30129' N'1'
so the generated sql file contain delete the Dest. Only rows, update the Mismatch rows and insert the Src. Only rows
My question is: Is there any way using tablediff to get the result of only Mismatch and Src. Only rows??
In the end of your tablediff tool command add the following
-dt -et DiffResults
It will drop an existing table with name DiffResults and create a new one in the destination server and database.
Then you can query the DiffResults table to get the desired rows.
In my test I run the following
SELECT * FROM DiffResults
WHERE MSdifftool_ErrorDescription in ('Mismatch','Src. Only')
or
SELECT * FROM DiffResults
WHERE MSdifftool_ErrorCode in (0,2) -- 0 is for 'Mismatch'; 1 is for 'Dest. Only' and 2 is for 'Src. Only'
Some more details can be found here - https://technet.microsoft.com/en-us/library/ms162843.aspx
If you want to use results from command line, you can pipe output with findstr:
tablediff <your parameters> | findstr /i "^Mismatch ^Src"

how to avoid re-inserting data into sql table while re-running SSIS package that loads data from flat file to SQL table?

I have a flat file with the following data
Id,FirstName,LastName,Address,PhoneNumber
1,ben,afflick,xyz Address,5014123746
3,christina,smith,test address,111000110
1,ben,afflick,xyz Address,5014123746
3,christina,smith,test address,111000110
4,nash,gordon,charlotte NC ADDRESS,111200110
I have created a SSIS package that has flat file source , and an aggregate function that makes sure that only unique rows are inserted and not duplicate records from the flat file , and SQL table as my destination.
Everything is fine when i run the package , i am getting below output in SQL table
Id FName LName Address phoneNumber
1 ben afflick xyz Address 5014123746
4 nash gordon charlotte NC ADDRESS 111200110
3 christina smith test address 111000110
But when i add some new data to the flat file as below
Id,FirstName,LastName,Address,PhoneNumber
1,ben,afflick,xyz Address,5014123746
3,christina,smith,test address,111000110
1,ben,afflick,xyz Address,5014123746
3,christina,smith,test address,111000110
4,nash,gordon,charlotte NC ADDRESS,111200110
5,abc,xyz,New York,9999988888
and re-run the package the data that is already present in the table is getting re-inserted as below
1 ben afflick xyz Address 5014123746
4 nash gordon charlotte NC ADDRESS 111200110
3 christina smith test address 111000110
1 ben afflick xyz Address 5014123746
5 abc xyz New York 9999988888
4 nash gordon charlotte NC ADDRESS 111200110
3 christina smith test address 111000110
But i DO NOT want this , i don't want data to be inserted that is already present.
I want only the newly added data to get inserted into the SQL table.
Can someone please help me to achieve this?
Another method is to load your file into a staging table in the database and then use a merge statement to insert data into your destination table.
In practice this would look like a data flow from your flat file to your staging table and then a execute sql task containing a merge statement.
You can then update any matching values as well, should you wish to.
merge into table_a
using stage_a
on stage_a.key = table_a.key
when not matched then insert (a,b,c,d) values ( a,b,c,d )
Your data flow task would look something like this. Here, the Flat file source reads the CSV file and then passes the data to Lookup transformation. This transformation will check for existing data in the destination table. If there are no matching records, then the data from CSV file will be sent to the OLE DB Destination otherwise, the data will be discarded only.
lookup Transformation link ---
http://www.codeproject.com/Tips/574437/Term-Lookup-Transformation-in-SSIS

Slow copying of virtual table rtree

From the file system the code is copied into an in-memory database.
After some measurement of the copying process the result was that the virtual table needs a lot of time. My assumption is that it has to do with the shadow tables.
First here the results:
Attached to database on file system in 7 milliseconds.
Created in-memory table-1 in 2 milliseconds.
Created in-memory table-2 in 2 milliseconds.
Copied table-1 in 2953 milliseconds.
Copied table-2 in 3086 milliseconds.
Create in-memory coordinates table in 4 milliseconds.
Copied coordinates table in 78813 milliseconds.
Detached from database on file system in 12 milliseconds.
Completed time was 84880 milliseconds.
The table scheme on the hard disk looks like this:
CREATE VIRTUAL TABLE coordinates USING rtree(
id,
min_latitude,
max_latitude,
min_longitude,
max_longitude,
);
The copying process looks like this in Java:
statement = connection.createStatement();
statement.execute("ATTACH '" + databasePath + "' AS fs");
statement.execute("CREATE VIRTUAL TABLE mCoordinates USING rtree(id, min_latitude, max_latitude, min_longitude, max_longitude)");
statement.execute("INSERT INTO mCoordinates (id, min_latitude, max_latitude, min_longitude, max_longitude) SELECT id, min_latitude, max_latitude, min_longitude, max_longitude FROM fs.coordinates");
Here now my question: "Is there a chance to get a better performance"?
Constructing 150,000 R-tree entries from scratch just takes some time.
However, if both databases have the same page size, you could just copy the contents of the shadow tables:
BEGIN;
DELETE FROM mCoordinates_node;
INSERT INTO mCoordinates_node SELECT * FROM fs.coordinates_node;
INSERT INTO mCoordinates_parent SELECT * FROM fs.coordinates_parent;
INSERT INTO mCoordinates_rowid SELECT * FROM fs.coordinates_rowid;
COMMIT;

How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:
insert overwrite directory '/home/output.csv' select books from table;
When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?
Although it is possible to use INSERT OVERWRITE to get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITE does, then I'll describe the method I use to get tsv files from Hive tables.
According to the manual, your query will store the data in a directory in HDFS. The format will not be csv.
Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.
A slight modification (adding the LOCAL keyword) will store the data in a local directory.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' select books from table;
When I run a similar query, here's what the output looks like.
[lvermeer#hadoop temp]$ ll
total 4
-rwxr-xr-x 1 lvermeer users 811 Aug 9 09:21 000000_0
[lvermeer#hadoop temp]$ head 000000_0
"row1""col1"1234"col3"1234FALSE
"row2""col1"5678"col3"5678TRUE
Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:
hive -e 'select books from table' > /home/lvermeer/temp.tsv
That gives me a tab-separated file that I can use. Hope that is useful for you as well.
Based on this patch-3682, I suspect a better solution is available when using Hive 0.11, but I am unable to test this myself. The new syntax should allow the following.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
If you want a CSV file then you can modify Lukas' solutions as follows (assuming you are on a linux box):
hive -e 'select books from table' | sed 's/[[:space:]]\+/,/g' > /home/lvermeer/temp.csv
This is most csv friendly way I found to output the results of HiveQL.
You don't need any grep or sed commands to format the data, instead hive supports it, just need to add extra tag of outputformat.
hive --outputformat=csv2 -e 'select * from <table_name> limit 20' > /path/toStore/data/results.csv
You should use CREATE TABLE AS SELECT (CTAS) statement to create a directory in HDFS with the files containing the results of the query. After that you will have to export those files from HDFS to your regular disk and merge them into a single file.
You also might have to do some trickery to convert the files from '\001' - delimited to CSV. You could use a custom CSV SerDe or postprocess the extracted file.
You can use INSERT … DIRECTORY …, as in this example:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted following the usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of reducers invoked.
If you are using HUE this is fairly simple as well. Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS.
I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.
To make the column data tsv safe, I replaced all \t chars in the column data with a space, and executed python code on the commandline to generate a csv file, as shown below:
hive -e 'tab_replaced_hql_query' | python -c 'exec("import sys;import csv;reader = csv.reader(sys.stdin, dialect=csv.excel_tab);writer = csv.writer(sys.stdout, dialect=csv.excel)\nfor row in reader: writer.writerow(row)")'
This created a perfectly valid csv. Hope this helps those who come looking for this solution.
You can use hive string function CONCAT_WS( string delimiter, string str1, string str2...strn )
for ex:
hive -e 'select CONCAT_WS(',',cola,colb,colc...,coln) from Mytable' > /home/user/Mycsv.csv
I had a similar issue and this is how I was able to address it.
Step 1 - Loaded the data from Hive table into another table as follows
DROP TABLE IF EXISTS TestHiveTableCSV;
CREATE TABLE TestHiveTableCSV
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' AS
SELECT Column List FROM TestHiveTable;
Step 2 - Copied the blob from Hive warehouse to the new location with appropriate extension
Start-AzureStorageBlobCopy
-DestContext $destContext
-SrcContainer "Source Container"
-SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
-DestContainer "Destination Container"
-DestBlob "CSV/TestHiveTable.csv"
hive --outputformat=csv2 -e "select * from yourtable" > my_file.csv
or
hive --outputformat=csv2 -e "select * from yourtable" > [your_path]/file_name.csv
For tsv, just change csv to tsv in the above queries and run your queries
The default separator is "^A". In python language, it is "\x01".
When I want to change the delimiter, I use SQL like:
SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table
Then, regard delimiter+"^A" as a new delimiter.
I tried various options, but this would be one of the simplest solution for Python Pandas:
hive -e 'select books from table' | grep "|" ' > temp.csv
df=pd.read_csv("temp.csv",sep='|')
You can also use tr "|" "," to convert "|" to ","
Similar to Ray's answer above, Hive View 2.0 in Hortonworks Data Platform also allows you to run a Hive query and then save the output as csv.
In case you are doing it from Windows you can use Python script hivehoney to extract table data to local CSV file.
It will:
Login to bastion host.
pbrun.
kinit.
beeline (with your query).
Save echo from beeline to a file on Windows.
Execute it like this:
set PROXY_HOST=your_bastion_host
set SERVICE_USER=you_func_user
set LINUX_USER=your_SOID
set LINUX_PWD=your_pwd
python hh.py --query_file=query.sql
Just to cover more following steps after kicking off the query:
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
In my case, the generated data under temp folder is in deflate format,
and it looks like this:
$ ls
000000_0.deflate
000001_0.deflate
000002_0.deflate
000003_0.deflate
000004_0.deflate
000005_0.deflate
000006_0.deflate
000007_0.deflate
Here's the command to unzip the deflate files and put everything into one csv file:
hadoop fs -text "file:///home/lvermeer/temp/*" > /home/lvermeer/result.csv
I may be late to this one, but would help with the answer:
echo "COL_NAME1|COL_NAME2|COL_NAME3|COL_NAME4" > SAMPLE_Data.csv
hive -e '
select distinct concat(COL_1, "|",
COL_2, "|",
COL_3, "|",
COL_4)
from table_Name where clause if required;' >> SAMPLE_Data.csv
This shell command prints the output format in csv to output.txt without the column headers.
$ hive --outputformat=csv2 -f 'hivedatascript.hql' --hiveconf hive.cli.print.header=false > output.txt
Use the command:
hive -e "use [database_name]; select * from [table_name] LIMIT 10;" > /path/to/file/my_file_name.csv
I had a huge dataset whose details I was trying to organize and determine the types of attacks and the numbers of each type. An example that I used on my practice that worked (and had a little more details) goes something like this:
hive -e "use DataAnalysis;
select attack_cat,
case when attack_cat == 'Backdoor' then 'Backdoors'
when length(attack_cat) == 0 then 'Normal'
when attack_cat == 'Backdoors' then 'Backdoors'
when attack_cat == 'Fuzzers' then 'Fuzzers'
when attack_cat == 'Generic' then 'Generic'
when attack_cat == 'Reconnaissance' then 'Reconnaissance'
when attack_cat == 'Shellcode' then 'Shellcode'
when attack_cat == 'Worms' then 'Worms'
when attack_cat == 'Analysis' then 'Analysis'
when attack_cat == 'DoS' then 'DoS'
when attack_cat == 'Exploits' then 'Exploits'
when trim(attack_cat) == 'Fuzzers' then 'Fuzzers'
when trim(attack_cat) == 'Shellcode' then 'Shellcode'
when trim(attack_cat) == 'Reconnaissance' then 'Reconnaissance' end,
count(*) from actualattacks group by attack_cat;">/root/data/output/results2.csv

Resources