Getting Source only and mismatch data with tablediff utility - sql-server

I'm using tablediff utility to transfert data from serval databases sources to a destination database and I get a result having all the differences between the source and destination databases with something like this
Dest. Only N'1027' N'799' N'91443' N'1'
Mismatch N'103A' N'799' N'13010' N'1' DATE_CURRENT DATE_OPERATION MATRICULE_UTILISATEUR QTE QTE_FINAL QTE_INIT QTE_OPERATION REFERENCE_DOCUMENT TYPE_DOCUMENT
Src. Only N'103A' N'310' N'30129' N'1'
so the generated sql file contain delete the Dest. Only rows, update the Mismatch rows and insert the Src. Only rows
My question is: Is there any way using tablediff to get the result of only Mismatch and Src. Only rows??

In the end of your tablediff tool command add the following
-dt -et DiffResults
It will drop an existing table with name DiffResults and create a new one in the destination server and database.
Then you can query the DiffResults table to get the desired rows.
In my test I run the following
SELECT * FROM DiffResults
WHERE MSdifftool_ErrorDescription in ('Mismatch','Src. Only')
or
SELECT * FROM DiffResults
WHERE MSdifftool_ErrorCode in (0,2) -- 0 is for 'Mismatch'; 1 is for 'Dest. Only' and 2 is for 'Src. Only'
Some more details can be found here - https://technet.microsoft.com/en-us/library/ms162843.aspx

If you want to use results from command line, you can pipe output with findstr:
tablediff <your parameters> | findstr /i "^Mismatch ^Src"

Related

SSIS foreach loop to group all unique customers in a table and write them to their own file

I have a table which stores all of my customers and their invoices (less than 5k total), I want to to use a foreach loop container to write each one of these (customers) to their own file listing their own invoices.
I have used a foreach loop container to read/load/write files before so I understand that part but how do I apply the foreach loop on the AccountNumber as the enumerator?
For each file, I only want that customers info.
My table:
AccountNumber InvoiceNumber OriginalCharge
A255 2017-11 225.00
A255 2017-12 13.50
A255 2018-01 25.00
D870 2017-09 7.25
D870 2017-10 10.00
R400 2016-12 100.00
R400 2017-03 5.00
R400 2017-04 7.00
R400 2017-09 82.00
So this would produce 3 files and would include the invoices/original charge for the given customers.
File 1 = Customer A255
File 2 = Customer D870
File 3 = Customer R400
Or should I approach this differently?
Environment: SQL Server 2014
SSIS-2012
Thanks!
You'll need to apply a few different recipes to make this work.
Dynamic file name
Source query parameterization
Shredding record set
Assumptions
You have three SSIS Variables:
CurrentAccountNumber String (initial value of A255)
rsAccountNumbers Object
FileNameOutput String EvaluateAsExpression = True "C:\\ssisdata\output\\" + #[User::CurrentAccountNumber] + ".txt"
The package would look something like
[Execute SQL Task] -> [Foreach (Ado.net) Enumerator] -> [Data Flow Task]
Execute SQL Task
Set the resultset type to Full
Your source query would be SELECT DISTINCT AccountNumber FROM dbo.Invoices;
In the Results tab, assuming OLE DB Connection Manager, click add result button and use a "name" of 0 and the variable becomes User::rsAccountNumbers
Foreach (Ado.net) Enumerator
Set your enumerator type as Ado.NET and single table. Use the variable User::rsAccountNumbers and assign the zeroeth element to our variable CurrentAccountNumber
Run the package as is to verify the Execute SQL Task is returning a resultset that the Foreach can shred. Observe that each loop in the enumerator results in the value of our Variable FileNameOutput changing (C:\ssisdata\output\A255.txt, C:\ssisdata\output\D870.txt, etc)
Data flow task
This a simple flow
[OLE DB Source] -> [Flat File Destination]
Configure your OLE DB Source to be a Query SELECT * FROM dbo.Invoices WHERE D.AccountNumber = ?;
Click the Parameter button. Configure the name 0 to be #[User::CurrentAccountNumber]
Flat File Destination - Connect the Source to the destination, create a new
Flat File Connection Manager and connect the columns.
Dynamic file name
The final piece will be to edit the Flat File Connection manager created above to use the variable FileNameOutput instead of the hard coded value you indicated. Right click on the Flat File Connection manager and select Properties. In the resulting properties window, find the Expressions property and click the ellipses (...) In the lefthand window, find ConnectionString and in the righthand window, use #[User::FileNameOutput]
F5 and the package should fire up and generate an output file per account number.

Use result of sql query to replace text on multiple xml files

i have a table on sql like this:
CD_MATERIAL | CD_IDENTIFICACAO
1 | 002323
2 | 00322234
... | ...
AND SO ON (5000+ lines)
I need to use that info to search and replace multiple external xml files on a folder (all the tags on those XML had numbers like the CD_IDENTIFICACAO from sql query, i need to replace with corresponding cd_material from sql query "ex.: 002323 becomes 1)
I used this query to extract all the cd_identificacao to use on Notepad++:
declare #result varchar(max)
select #result = COALESCE(#result + '', '') + CONCAT('(',CD_IDENTIFICACAO,')|') from TBL_MATERIAIS WHERE CD_IDENTIFICACAO <> '' ORDER BY CD_MATERIAL
select #result
That would bring me ex.:
(1TEC45D025)|(1TEC800039)|(999999999)|(542251)|(2TEC58426)|(234852)
and changed the parameters to get the replace ex.:
(? 2000)|(? 2001)|(? 2002)|(? 2003)|(? 2004)|(? 2005)
but i don't know how to add a number (increment) on front of "?" so notepad++ would understand it (search and replace would have 5000+ results, so it's not pratical to manually add the increment).
I was able to get a workaround for this. I've used this query to get all the the terms for find and replace i needed (1 per line)
select concat('<cProd>',cd_identificacao,'</cProd>'), concat('<cProd>',cd_material,'</cProd>') from tbl_materiais where cd_identificacao <> '' order by cd_material
That would result in:
<cProd>1TEC460054</cProd> <cProd>1</cProd>
<cProd>1TEC240035</cProd> <cProd>2</cProd>
(i added the tag too to make sure no other information could be replaced as there were many number combinations that could lead to incorrect replacement)
then pasted it on a txt and i used the notepad++ to replace the space between column 1 and 2 for /r/n wich would result in:
<cProd>1TEC460054</cProd>
<cProd>1</cProd>
<cProd>1TEC240035</cProd>
<cProd>2</cProd>
then i used "Ecobyte Replace Text" Tool, pasted my result file as new selection in bottom frame, loaded all my files on a new replace group on top frame (on properties of the group, u can change directory and options), then executed the replacement, it worked perfectly.
Thx.

how to avoid re-inserting data into sql table while re-running SSIS package that loads data from flat file to SQL table?

I have a flat file with the following data
Id,FirstName,LastName,Address,PhoneNumber
1,ben,afflick,xyz Address,5014123746
3,christina,smith,test address,111000110
1,ben,afflick,xyz Address,5014123746
3,christina,smith,test address,111000110
4,nash,gordon,charlotte NC ADDRESS,111200110
I have created a SSIS package that has flat file source , and an aggregate function that makes sure that only unique rows are inserted and not duplicate records from the flat file , and SQL table as my destination.
Everything is fine when i run the package , i am getting below output in SQL table
Id FName LName Address phoneNumber
1 ben afflick xyz Address 5014123746
4 nash gordon charlotte NC ADDRESS 111200110
3 christina smith test address 111000110
But when i add some new data to the flat file as below
Id,FirstName,LastName,Address,PhoneNumber
1,ben,afflick,xyz Address,5014123746
3,christina,smith,test address,111000110
1,ben,afflick,xyz Address,5014123746
3,christina,smith,test address,111000110
4,nash,gordon,charlotte NC ADDRESS,111200110
5,abc,xyz,New York,9999988888
and re-run the package the data that is already present in the table is getting re-inserted as below
1 ben afflick xyz Address 5014123746
4 nash gordon charlotte NC ADDRESS 111200110
3 christina smith test address 111000110
1 ben afflick xyz Address 5014123746
5 abc xyz New York 9999988888
4 nash gordon charlotte NC ADDRESS 111200110
3 christina smith test address 111000110
But i DO NOT want this , i don't want data to be inserted that is already present.
I want only the newly added data to get inserted into the SQL table.
Can someone please help me to achieve this?
Another method is to load your file into a staging table in the database and then use a merge statement to insert data into your destination table.
In practice this would look like a data flow from your flat file to your staging table and then a execute sql task containing a merge statement.
You can then update any matching values as well, should you wish to.
merge into table_a
using stage_a
on stage_a.key = table_a.key
when not matched then insert (a,b,c,d) values ( a,b,c,d )
Your data flow task would look something like this. Here, the Flat file source reads the CSV file and then passes the data to Lookup transformation. This transformation will check for existing data in the destination table. If there are no matching records, then the data from CSV file will be sent to the OLE DB Destination otherwise, the data will be discarded only.
lookup Transformation link ---
http://www.codeproject.com/Tips/574437/Term-Lookup-Transformation-in-SSIS

Attempting to use a SUB-SELECT within logparser to retrieve folder information

My end goal is to return data using log parser in a table similar to this
PATH QTY SIZE(KB)
C:\Path\dir 1200 150223
I can picture the query in my mind but I think I am missing something. (It's probably obvious). Here is my query as I have now:
C:\scripts>logparser -i:fs "SELECT f.* FROM (SELECT path FROM C:\DOWNLOADS\*.* WHERE ATTRIBUTES LIKE 'D%') f"
I receive this error: "Error: Syntax Error: : expecting FROM keyword instead of token '*'"
If I change my code a bit to the following, I get another curious error...
C:\scripts>logparser -i:fs "SELECT * FROM (SELECT path FROM C:\DOWNLOADS\*.* WHERE ATTRIBUTES LIKE 'D%')"
The error I receive is: "Cannot open : Error searching for files in folder C:\scripts(SELECT path FROM C:\DOWNLOADS: The filename, directory name, or volume label syntax is incorrect.
I'd like to return the size of the different sub directories below the c:\downloads path. I would adjust the wildcards to further narrow my results.
EDIT - More info
I am hoping to return data from a structure similar to this:
TopFolder
|_SubFolder
| |_SubSubFolder1
| |_SubSubFolder2
| |_SubSubFolder3
|_OtherFolder
Return a table or some form of data like this:
_FolderName___Qty____AvgSize____MaxSize____MinSize
SubSubFolder1 250 334533 45000 445
SubSubFolder2 123 4443 2233 344
....
Well, first of all LogParser does not support nested SELECT's. This said, it's not clear why you need a sub-select. If you want to limit the depth of the directory search to 2, for example, then you can use '-recurse:2'. On the other hand, if you want to sum up all files below a directory, you need to SELECT SUM(Size) and GROUP BY with EXTRACT_PREFIX, specifying the level; for example:
SELECT EXTRACT_PREFIX(Path, 2, '\\') AS MyFolder, SUM(Size)
FROM C:\Downloads\*.*
GROUP BY MyFolder
Taking the information provided by #Gabriele I came up with the following logparser query which fits my needs...
logparser "select EXTRACT_PREFIX(Path,2,'\') as FolderName,Count() AS ItemCount,DIV(sum(size),1024) as KBSize from c:\downloads*. WHERE Attributes NOT LIKE 'D%' group by FolderName HAVING ItemCount > 1" -i:fs
My end goal is to collect stats on network folders populated by automated processes to track progress. The "c:\downloads" source would be changed for another source.

How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:
insert overwrite directory '/home/output.csv' select books from table;
When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?
Although it is possible to use INSERT OVERWRITE to get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITE does, then I'll describe the method I use to get tsv files from Hive tables.
According to the manual, your query will store the data in a directory in HDFS. The format will not be csv.
Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.
A slight modification (adding the LOCAL keyword) will store the data in a local directory.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' select books from table;
When I run a similar query, here's what the output looks like.
[lvermeer#hadoop temp]$ ll
total 4
-rwxr-xr-x 1 lvermeer users 811 Aug 9 09:21 000000_0
[lvermeer#hadoop temp]$ head 000000_0
"row1""col1"1234"col3"1234FALSE
"row2""col1"5678"col3"5678TRUE
Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:
hive -e 'select books from table' > /home/lvermeer/temp.tsv
That gives me a tab-separated file that I can use. Hope that is useful for you as well.
Based on this patch-3682, I suspect a better solution is available when using Hive 0.11, but I am unable to test this myself. The new syntax should allow the following.
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
If you want a CSV file then you can modify Lukas' solutions as follows (assuming you are on a linux box):
hive -e 'select books from table' | sed 's/[[:space:]]\+/,/g' > /home/lvermeer/temp.csv
This is most csv friendly way I found to output the results of HiveQL.
You don't need any grep or sed commands to format the data, instead hive supports it, just need to add extra tag of outputformat.
hive --outputformat=csv2 -e 'select * from <table_name> limit 20' > /path/toStore/data/results.csv
You should use CREATE TABLE AS SELECT (CTAS) statement to create a directory in HDFS with the files containing the results of the query. After that you will have to export those files from HDFS to your regular disk and merge them into a single file.
You also might have to do some trickery to convert the files from '\001' - delimited to CSV. You could use a custom CSV SerDe or postprocess the extracted file.
You can use INSERT … DIRECTORY …, as in this example:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted following the usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of reducers invoked.
If you are using HUE this is fairly simple as well. Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS.
I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.
To make the column data tsv safe, I replaced all \t chars in the column data with a space, and executed python code on the commandline to generate a csv file, as shown below:
hive -e 'tab_replaced_hql_query' | python -c 'exec("import sys;import csv;reader = csv.reader(sys.stdin, dialect=csv.excel_tab);writer = csv.writer(sys.stdout, dialect=csv.excel)\nfor row in reader: writer.writerow(row)")'
This created a perfectly valid csv. Hope this helps those who come looking for this solution.
You can use hive string function CONCAT_WS( string delimiter, string str1, string str2...strn )
for ex:
hive -e 'select CONCAT_WS(',',cola,colb,colc...,coln) from Mytable' > /home/user/Mycsv.csv
I had a similar issue and this is how I was able to address it.
Step 1 - Loaded the data from Hive table into another table as follows
DROP TABLE IF EXISTS TestHiveTableCSV;
CREATE TABLE TestHiveTableCSV
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' AS
SELECT Column List FROM TestHiveTable;
Step 2 - Copied the blob from Hive warehouse to the new location with appropriate extension
Start-AzureStorageBlobCopy
-DestContext $destContext
-SrcContainer "Source Container"
-SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
-DestContainer "Destination Container"
-DestBlob "CSV/TestHiveTable.csv"
hive --outputformat=csv2 -e "select * from yourtable" > my_file.csv
or
hive --outputformat=csv2 -e "select * from yourtable" > [your_path]/file_name.csv
For tsv, just change csv to tsv in the above queries and run your queries
The default separator is "^A". In python language, it is "\x01".
When I want to change the delimiter, I use SQL like:
SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table
Then, regard delimiter+"^A" as a new delimiter.
I tried various options, but this would be one of the simplest solution for Python Pandas:
hive -e 'select books from table' | grep "|" ' > temp.csv
df=pd.read_csv("temp.csv",sep='|')
You can also use tr "|" "," to convert "|" to ","
Similar to Ray's answer above, Hive View 2.0 in Hortonworks Data Platform also allows you to run a Hive query and then save the output as csv.
In case you are doing it from Windows you can use Python script hivehoney to extract table data to local CSV file.
It will:
Login to bastion host.
pbrun.
kinit.
beeline (with your query).
Save echo from beeline to a file on Windows.
Execute it like this:
set PROXY_HOST=your_bastion_host
set SERVICE_USER=you_func_user
set LINUX_USER=your_SOID
set LINUX_PWD=your_pwd
python hh.py --query_file=query.sql
Just to cover more following steps after kicking off the query:
INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
In my case, the generated data under temp folder is in deflate format,
and it looks like this:
$ ls
000000_0.deflate
000001_0.deflate
000002_0.deflate
000003_0.deflate
000004_0.deflate
000005_0.deflate
000006_0.deflate
000007_0.deflate
Here's the command to unzip the deflate files and put everything into one csv file:
hadoop fs -text "file:///home/lvermeer/temp/*" > /home/lvermeer/result.csv
I may be late to this one, but would help with the answer:
echo "COL_NAME1|COL_NAME2|COL_NAME3|COL_NAME4" > SAMPLE_Data.csv
hive -e '
select distinct concat(COL_1, "|",
COL_2, "|",
COL_3, "|",
COL_4)
from table_Name where clause if required;' >> SAMPLE_Data.csv
This shell command prints the output format in csv to output.txt without the column headers.
$ hive --outputformat=csv2 -f 'hivedatascript.hql' --hiveconf hive.cli.print.header=false > output.txt
Use the command:
hive -e "use [database_name]; select * from [table_name] LIMIT 10;" > /path/to/file/my_file_name.csv
I had a huge dataset whose details I was trying to organize and determine the types of attacks and the numbers of each type. An example that I used on my practice that worked (and had a little more details) goes something like this:
hive -e "use DataAnalysis;
select attack_cat,
case when attack_cat == 'Backdoor' then 'Backdoors'
when length(attack_cat) == 0 then 'Normal'
when attack_cat == 'Backdoors' then 'Backdoors'
when attack_cat == 'Fuzzers' then 'Fuzzers'
when attack_cat == 'Generic' then 'Generic'
when attack_cat == 'Reconnaissance' then 'Reconnaissance'
when attack_cat == 'Shellcode' then 'Shellcode'
when attack_cat == 'Worms' then 'Worms'
when attack_cat == 'Analysis' then 'Analysis'
when attack_cat == 'DoS' then 'DoS'
when attack_cat == 'Exploits' then 'Exploits'
when trim(attack_cat) == 'Fuzzers' then 'Fuzzers'
when trim(attack_cat) == 'Shellcode' then 'Shellcode'
when trim(attack_cat) == 'Reconnaissance' then 'Reconnaissance' end,
count(*) from actualattacks group by attack_cat;">/root/data/output/results2.csv

Resources