Problem exporting from mongo and then importing to SQL Server - sql-server

Question: how do I export from mongo such that I can import into SQL Server if I use $unwind?
I need to use $unwind which means I can't use mongoexport.exe. Mongo.exe gives different output for json as shown below. Output I can't load into SQL Server. I would export as csv output, but my data includes commas. I would use $out to first copy my data to a new collection & then use mongoexport, but I'm querying a production server in the cloud where I only have read access.
To illustrate my problem, I created a collection with one record that has a date field "edited_on". You can see here that mongoexport output starts with ["_id:{$oid.... while mongo output starts with {"_id : ObjectID(….
*** MONGOEXPORT
The command:
mongoexport --quiet --host localhost:27017 --db "zzz" -c
"Test_Structures" --fields edited_on --type json --jsonArray --out
C:\export_test.json
The output:
[{"_id":{"$oid":"5aaa1d85b8078250f1000c0e"},"edited_on":{"$date":"2018-03-15T07:15:17.583Z"}}]
I can import this data into SQL with OPENROWSET along with OPENJSON.
Described here: https://www.mssqltips.com/sqlservertip/5295/different-ways-to-import-json-files-into-sql-server/
*** MONGO
The command:
mongo localhost/UW --quiet -eval "db.Test_Structures.aggregate( {
$project: { _id: 1 , edited_on: 1} } )" > C:\aggregate_test.json
The output:
{ "_id" : ObjectId("5aaa1d85b8078250f1000c0e"), "edited_on" :
ISODate("2018-03-15T07:15:17.583Z") }

Declare #JSON varchar(max)
My coworker answered my question. Use replace() to remove the text in the json file that are causing problems as follows.
SELECT #JSON = BulkColumn
FROM OPENROWSET (BULK 'C:\aggregate_test.json', SINGLE_CLOB) as j
SET #JSON = replace(replace(replace(#JSON,'objectid(',''),'isodate(',''),'")','"')
SELECT * FROM OPENJSON (#JSON) With (...)

Related

Query internal stage Snowflake

Following the steps in the documentation I created a stage and a file format in Snowflake, then staged a csv file with PUT
USE IA;
CREATE OR REPLACE STAGE csv_format_2;
CREATE OR REPLACE FILE FORMAT csvcol26 type='csv' field_delimiter='|';
PUT file://H:\\CSV_SWF_file_format_stage.csv #IA.public.csv_format_2
When I tried to query the staged object
SELECT a.$1 FROM #csv_format_2 (FORMAT=>'csvcol26', PATTERN=>'CSV_SWF_file_format_stage.csv.gz') a
I got:
SQL Error [2] [0A000]: Unsupported feature 'TABLE'.
Any idea on this error?
The first argument should be FILE_FORMAT instead of FORMAT:
SELECT a.$1
FROM #csv_format_2 (FILE_FORMAT=>'csvcol26',PATTERN=>'CSV_SWF_file_format_stage.csv.gz') a;
Related: Querying Data in Staged Files
Query staged data files using a SELECT statement with the following syntax:
SELECT [<alias>.]$<file_col_num>[.<element>] [ , [<alias>.]$<file_col_num>[.<element>] , ... ]
FROM { <internal_location> | <external_location> }
[ ( FILE_FORMAT => '<namespace>.<named_file_format>', PATTERN => '<regex_pattern>' ) ]
[ <alias> ]

Trying to Export Tables to CSVs from SQL Server

I ran the following script to try to get all tables in my DB exported (trying to backup the data in CSVs).
SELECT 'sqlcmd -S . -d '+DB_NAME()+' -E -s, -W -Q "SET NOCOUNT ON; SELECT * FROM '+table_schema+'.'+TABLE_name+'" > "C:\Temp\'+Table_Name+'.csv"'
FROM [INFORMATION_SCHEMA].[TABLES]
I saved the results as a batch file and ran the batch file as Administrator.
That runs without an error, but I get no data exported. All it does is create blank CSV files.
I ran this as well: 'EXEC sp_configure 'remote access',1 reconfigure'.
Still, nothing is exported. CSVs are created, but no data is exported...
Any thoughts?
I ended up using R to do the task...
library("RODBC")
conn <- odbcDriverConnect('driver={SQL Server};server=Server_Name;DB_Name;trusted_connection=true')
data <- sqlQuery(conn, "SELECT * FROM DB.dbo.TBL#1")
write.csv(data,file=paste("C:/Users/TBL#1.csv",sep=""),row.names=FALSE)
data <- sqlQuery(conn, "SELECT * FROM DB.dbo.TBL#2")
write.csv(data,file=paste("C:/Users/TBL#2.csv",sep=""),row.names=FALSE)
Gotta love the IT teams in corporate America...especially when they lock down your system so tight, you need to come up with all kinds of weird hacks just so you can do the job that you were hired to do...
Is there a word for negative synergy?

Issues using "-f" flag in CQLSH to run a query.cql file

I'm using cqlsh to add data to Cassandra with the BATCH query and I can load the data with a query using the "-e" flag but not from a file using the "-f" flag. I think that's because the file is local and Cassandra is remote. Details below:
This is a sample of my query (there are more rows to insert, obviously):
BEGIN BATCH;
INSERT INTO keyspace.table (id, field1) VALUES ('1','value1');
INSERT INTO keyspace.table (id, field1) VALUES ('2','value2');
APPLY BATCH;
If I enter the query via the "-e" flag then it works no problem:
>cqlsh -e "BEGIN BATCH; INSERT INTO keyspace.table (id, field1) VALUES ('1','value1'); INSERT INTO keyspace.table (id, field1) VALUES ('2','value2'); APPLY BATCH;" -u username -p password -k keyspace 99.99.99.99
But if I save the query to a text file (query.cql) and call as below, I get the following output:
>cqlsh -f query.cql -u username -p password -k keyspace 99.99.99.99
Using 3 child processes
Starting copy of keyspace.table with columns ['id', 'field1'].
Processed: 0 rows; Rate: 0 rows/s; Avg. rate: 0 rows/s
0 rows imported from 0 files in 0.076 seconds (0 skipped).
Cassandra obviously accepts the command but doesn't read the file, I'm guessing that's because the Cassandra is located on a remote server and the file is located locally. The Cassandra instance I'm using is a managed service with other users, so I don't have access to it to copy files into folders.
How do I run this query on a remote instance of Cassandra where I only have CLI access?
I want to be able to use another tool to build the query.cql file and have a batch job run the command with the "-f" flag but I can't work out how I'm going wrong.
You're executing a local cqlsh client so it should be able to access your local query.cql file.
Try to remove the BEGIN BATCH and APPLY BATCH and just let the 2 INSERT statements in the query.cql and retry again.
One other solution to insert data quickly is to provide a csv file and use the COPY command inside cqlsh. Read this blog post: http://www.datastax.com/dev/blog/new-features-in-cqlsh-copy
Scripting insert by generating one cqlsh -e '...' per line is feasible but it will be horribly slow

Read stored procedure select results into pandas dataframe

Given:
CREATE PROCEDURE my_procedure
#Param INT
AS
SELECT Col1, Col2
FROM Table
WHERE Col2 = #Param
I would like to be able to use this as:
import pandas as pd
import pyodbc
query = 'EXEC my_procedure #Param = {0}'.format(my_param)
conn = pyodbc.connect(my_connection_string)
df = pd.read_sql(query, conn)
But this throws an error:
ValueError: Reading a table with read_sql is not supported for a DBAPI2 connection. Use an SQLAlchemy engine or specify an sql query
SQLAlchemy does not work either:
import sqlalchemy
engine = sqlalchemy.create_engine(my_connection_string)
df = pd.read_sql(query, engine)
Throws:
ValueError: Could not init table 'my_procedure'
I can in fact execute the statement using pyodbc directly:
cursor = conn.cursor()
cursor.execute(query)
results = cursor.fetchall()
df = pd.DataFrame.from_records(results)
Is there a way to send these procedure results directly to a DataFrame?
Use read_sql_query() instead.
Looks like #joris (+1) already had this in a comment directly under the question but I didn't see it because it wasn't in the answers section.
Use the SQLA engine--apart from SQLAlchemy, Pandas only supports SQLite. Then use read_sql_query() instead of read_sql(). The latter tries to auto-detect whether you're passing a table name or a fully-fledged query but it doesn't appear to do so well with the 'EXEC' keyword. Using read_sql_query() skips the auto-detection and allows you to explicitly indicate that you're using a query (there's also a read_sql_table()).
import pandas as pd
import sqlalchemy
query = 'EXEC my_procedure #Param = {0}'.format(my_param)
engine = sqlalchemy.create_engine(my_connection_string)
df = pd.read_sql_query(query, engine)
https://code.google.com/p/pyodbc/wiki/StoredProcedures
I am not a python expert, but SQL Server sometimes returns counts for statement executions. For instance, a update will tell how many rows are updated.
Just use the 'SET NO COUNT;' at the front of your batch call. This will remove the counts for inserts, updates, and deletes.
Make sure you are using the correct native client module.
Take a look at this stack overflow example.
It has both a adhoc SQL and call stored procedure example.
Calling a stored procedure python
Good luck
This worked for me after added SET NOCOUNT ON thanks #CRAFTY DBA
sql_query = """SET NOCOUNT ON; EXEC db_name.dbo.StoreProc '{0}';""".format(input)
df = pandas.read_sql_query(sql_query , conn)
Using ODBC syntax for calling stored procedures (with parameters instead of string formatting) works for loading dataframes using pandas 0.14.1 and pyodbc 3.0.7. The following examples use the AdventureWorks2008R2 sample database.
First confirm expected results calling the stored procedure using pyodbc:
import pandas as pd
import pyodbc
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}', server='ServerInstance', database='AdventureWorks2008R2', trusted_connection='yes')
sql = "{call dbo.uspGetEmployeeManagers(?)}"
params = (3,)
cursor = connection.cursor()
rows = cursor.execute(sql, params).fetchall()
print(rows)
Should return:
[(0, 3, 'Roberto', 'Tamburello', '/1/1/', 'Terri', 'Duffy'), (1, 2, 'Terri', 'Duffy',
'/1/', 'Ken', 'Sánchez')]
Now use pandas to load the results into a dataframe:
df = pd.read_sql(sql=sql, con=connection, params=params)
print(df)
Should return:
RecursionLevel BusinessEntityID FirstName LastName OrganizationNode \
0 0 3 Roberto Tamburello /1/1/
1 1 2 Terri Duffy /1/
ManagerFirstName ManagerLastName
0 Terri Duffy
1 Ken Sánchez
EDIT
Since you can't update to pandas 0.14.1, load the results from pyodbc using pandas.DataFrame.from_records:
# get column names from pyodbc results
columns = [column[0] for column in cursor.description]
df = pd.DataFrame.from_records(rows, columns=columns)

How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:
insert overwrite directory '/home/output.csv' select books from table;
When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?
Although it is possible to use INSERT OVERWRITE to get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITE does, then I'll describe the method I use to get tsv files from Hive tables.
According to the manual, your query will store the data in a directory in HDFS. The format will not be csv.
Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.
A slight modification (adding the LOCAL keyword) will store the data in a local directory.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' select books from table;
When I run a similar query, here's what the output looks like.
[lvermeer#hadoop temp]$ ll
total 4
-rwxr-xr-x 1 lvermeer users 811 Aug 9 09:21 000000_0
[lvermeer#hadoop temp]$ head 000000_0
"row1""col1"1234"col3"1234FALSE
"row2""col1"5678"col3"5678TRUE
Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:
hive -e 'select books from table' > /home/lvermeer/temp.tsv
That gives me a tab-separated file that I can use. Hope that is useful for you as well.
Based on this patch-3682, I suspect a better solution is available when using Hive 0.11, but I am unable to test this myself. The new syntax should allow the following.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
If you want a CSV file then you can modify Lukas' solutions as follows (assuming you are on a linux box):
hive -e 'select books from table' | sed 's/[[:space:]]\+/,/g' > /home/lvermeer/temp.csv
This is most csv friendly way I found to output the results of HiveQL.
You don't need any grep or sed commands to format the data, instead hive supports it, just need to add extra tag of outputformat.
hive --outputformat=csv2 -e 'select * from <table_name> limit 20' > /path/toStore/data/results.csv
You should use CREATE TABLE AS SELECT (CTAS) statement to create a directory in HDFS with the files containing the results of the query. After that you will have to export those files from HDFS to your regular disk and merge them into a single file.
You also might have to do some trickery to convert the files from '\001' - delimited to CSV. You could use a custom CSV SerDe or postprocess the extracted file.
You can use INSERT … DIRECTORY …, as in this example:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted following the usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of reducers invoked.
If you are using HUE this is fairly simple as well. Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS.
I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.
To make the column data tsv safe, I replaced all \t chars in the column data with a space, and executed python code on the commandline to generate a csv file, as shown below:
hive -e 'tab_replaced_hql_query' | python -c 'exec("import sys;import csv;reader = csv.reader(sys.stdin, dialect=csv.excel_tab);writer = csv.writer(sys.stdout, dialect=csv.excel)\nfor row in reader: writer.writerow(row)")'
This created a perfectly valid csv. Hope this helps those who come looking for this solution.
You can use hive string function CONCAT_WS( string delimiter, string str1, string str2...strn )
for ex:
hive -e 'select CONCAT_WS(',',cola,colb,colc...,coln) from Mytable' > /home/user/Mycsv.csv
I had a similar issue and this is how I was able to address it.
Step 1 - Loaded the data from Hive table into another table as follows
DROP TABLE IF EXISTS TestHiveTableCSV;
CREATE TABLE TestHiveTableCSV
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' AS
SELECT Column List FROM TestHiveTable;
Step 2 - Copied the blob from Hive warehouse to the new location with appropriate extension
Start-AzureStorageBlobCopy
-DestContext $destContext
-SrcContainer "Source Container"
-SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
-DestContainer "Destination Container"
-DestBlob "CSV/TestHiveTable.csv"
hive --outputformat=csv2 -e "select * from yourtable" > my_file.csv
or
hive --outputformat=csv2 -e "select * from yourtable" > [your_path]/file_name.csv
For tsv, just change csv to tsv in the above queries and run your queries
The default separator is "^A". In python language, it is "\x01".
When I want to change the delimiter, I use SQL like:
SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table
Then, regard delimiter+"^A" as a new delimiter.
I tried various options, but this would be one of the simplest solution for Python Pandas:
hive -e 'select books from table' | grep "|" ' > temp.csv
df=pd.read_csv("temp.csv",sep='|')
You can also use tr "|" "," to convert "|" to ","
Similar to Ray's answer above, Hive View 2.0 in Hortonworks Data Platform also allows you to run a Hive query and then save the output as csv.
In case you are doing it from Windows you can use Python script hivehoney to extract table data to local CSV file.
It will:
Login to bastion host.
pbrun.
kinit.
beeline (with your query).
Save echo from beeline to a file on Windows.
Execute it like this:
set PROXY_HOST=your_bastion_host
set SERVICE_USER=you_func_user
set LINUX_USER=your_SOID
set LINUX_PWD=your_pwd
python hh.py --query_file=query.sql
Just to cover more following steps after kicking off the query:
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
In my case, the generated data under temp folder is in deflate format,
and it looks like this:
$ ls
000000_0.deflate
000001_0.deflate
000002_0.deflate
000003_0.deflate
000004_0.deflate
000005_0.deflate
000006_0.deflate
000007_0.deflate
Here's the command to unzip the deflate files and put everything into one csv file:
hadoop fs -text "file:///home/lvermeer/temp/*" > /home/lvermeer/result.csv
I may be late to this one, but would help with the answer:
echo "COL_NAME1|COL_NAME2|COL_NAME3|COL_NAME4" > SAMPLE_Data.csv
hive -e '
select distinct concat(COL_1, "|",
COL_2, "|",
COL_3, "|",
COL_4)
from table_Name where clause if required;' >> SAMPLE_Data.csv
This shell command prints the output format in csv to output.txt without the column headers.
$ hive --outputformat=csv2 -f 'hivedatascript.hql' --hiveconf hive.cli.print.header=false > output.txt
Use the command:
hive -e "use [database_name]; select * from [table_name] LIMIT 10;" > /path/to/file/my_file_name.csv
I had a huge dataset whose details I was trying to organize and determine the types of attacks and the numbers of each type. An example that I used on my practice that worked (and had a little more details) goes something like this:
hive -e "use DataAnalysis;
select attack_cat,
case when attack_cat == 'Backdoor' then 'Backdoors'
when length(attack_cat) == 0 then 'Normal'
when attack_cat == 'Backdoors' then 'Backdoors'
when attack_cat == 'Fuzzers' then 'Fuzzers'
when attack_cat == 'Generic' then 'Generic'
when attack_cat == 'Reconnaissance' then 'Reconnaissance'
when attack_cat == 'Shellcode' then 'Shellcode'
when attack_cat == 'Worms' then 'Worms'
when attack_cat == 'Analysis' then 'Analysis'
when attack_cat == 'DoS' then 'DoS'
when attack_cat == 'Exploits' then 'Exploits'
when trim(attack_cat) == 'Fuzzers' then 'Fuzzers'
when trim(attack_cat) == 'Shellcode' then 'Shellcode'
when trim(attack_cat) == 'Reconnaissance' then 'Reconnaissance' end,
count(*) from actualattacks group by attack_cat;">/root/data/output/results2.csv

Resources