How to fetch records count from Azure data lake analytics files (File like txt and CSV ) - analytics

adl://rs06ipadl01.azuredatalakestore.net/FIA/RDS/old/BANNER/2018/06/15/old_Banner.csv
i need to fetch the records from above file.

You can the built-in extractors like .Csv and .Text to get the file content then use COUNT to count the records. A simple example:
DECLARE #inputFile string = #"input/input124.csv";
DECLARE #outputFile string = #"output/output.csv";
// Get the file
#input =
EXTRACT col1 string,
col2 string,
col3 int
FROM #inputFile
USING Extractors.Csv( skipFirstNRows:1 ); // skip header row if you have one
// Count the records
#output = SELECT COUNT(*) AS records FROM #input;
// Output the result
OUTPUT #output
TO #outputFile
USING Outputters.Csv(quoting:false);

Related

Snowflake: copy into table from external stage: add file name and line number into the table

I have a table created
CREATE OR REPLACE TABLE EMPLOYEE_SKILL
(
FILE_NAME_FULL_S3_PATH VARCHAR(),
LINE_NUMBER VARCHAR(),
SKILL_ID NUMBER,
EMP_ID NUMBER,
SKILL_NAME VARCHAR(50),
SKILL_LEVEL VARCHAR(50)
);
Note that I have added two columns FILE_NAME_FULL_S3_PATH and LINE_NUMBER. I want these two details of the data ingestion also.
I am trying to ingest into the above table from a 1000 files inside an s3 bucket
I am using this command
copy into EMPLOYEE_SKILL
from s3://test_bucket/emp/ credentials=(aws_key_id='XXXXXXXXXXXXXX' aws_secret_key='YYYYYYYYYY')
file_format = (type = csv field_delimiter = '|'skip_header = 1)
on_error = 'continue';
How to make sure the first two columns are added automatically? The s3 full path of the file and also the line number
You would need to do something like this
copy into EMPLOYEE_SKILL(FILE_NAME_FULL_S3_PATH,
LINE_NUMBER,
SKILL_ID,
EMP_ID,
SKILL_NAME,
SKILL_LEVEL )
from (select metadata$filename, metadata$file_row_number, t.$1, t.$2,t.$3,t.$4 from s3://test_bucket/emp/ )t
credentials=(aws_key_id='XXXXXXXXXXXXXX' aws_secret_key='YYYYYYYYYY')
file_format = (type = csv field_delimiter = '|'skip_header = 1)
on_error = 'continue';

Searching for multiple patterns in a string in T-SQL

In t-sql my dilemma is that I have to parse a potentially long string (up to 500 characters) for any of over 230 possible values and remove them from the string for reporting purposes. These values are a column in another table and they're all upper case and 4 characters long with the exception of two that are 5 characters long.
Examples of these values are:
USFRI
PROME
AZCH
TXJS
NYDS
XVIV. . . . .
Example of string before:
"Offered to XVIV and USFRI as back ups. No response as of yet."
Example of string after:
"Offered to and as back ups. No response as of yet."
Pretty sure it will have to be a UDF but I'm unable to come up with anything other than stripping ALL the upper case characters out of the string with PATINDEX which is not the objective.
This is unavoidably cludgy but one way is to split your string into rows, once you have a set of words the rest is easy; Simply re-aggregate while ignoring the matching values*:
with t as (
select 'Offered to XVIV and USFRI as back ups. No response as of yet.' s
union select 'Another row AZCH and TXJS words.'
), v as (
select * from (values('USFRI'),('PROME'),('AZCH'),('TXJS'),('NYDS'),('XVIV'))v(v)
)
select t.s OriginalString, s.Removed
from t
cross apply (
select String_Agg(j.[value], ' ') within group(order by Convert(tinyint,j.[key])) Removed
from OpenJson(Concat('["',replace(s, ' ', '","'),'"]')) j
where not exists (select * from v where v.v = j.[value])
)s;
* Requires a fully-supported version of SQL Server.
build a function to do the cleaning of one sentence, then call that function from your query, something like this SELECT Col1, dbo.fn_ReplaceValue(Col1) AS cleanValue, * FROM MySentencesTable. Your fn_ReplaceValue will be something like the code below, you could also create the table variable outside the function and pass it as parameter to speed up the process, but this way is all self contained.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION fn_ReplaceValue(#sentence VARCHAR(500))
RETURNS VARCHAR(500)
AS
BEGIN
DECLARE #ResultVar VARCHAR(500)
DECLARE #allValues TABLE (rowID int, sValues VARCHAR(15))
DECLARE #id INT = 0
DECLARE #ReplaceVal VARCHAR(10)
DECLARE #numberOfValues INT = (SELECT COUNT(*) FROM MyValuesTable)
--Populate table variable with all values
INSERT #allValues
SELECT ROW_NUMBER() OVER(ORDER BY MyValuesCol) AS rowID, MyValuesCol
FROM MyValuesTable
SET #ResultVar = #sentence
WHILE (#id <= #numberOfValues)
BEGIN
SET #id = #id + 1
SET #ReplaceVal = (SELECT sValue FROM #allValues WHERE rowID = #id)
SET #ResultVar = REPLACE(#ResultVar, #ReplaceVal, SPACE(0))
END
RETURN #ResultVar
END
GO
I suggest creating a table (either temporary or permanent), and loading these 230 string values into this table. Then use it in the following delete:
DELETE
FROM yourTable
WHERE col IN (SELECT col FROM tempTable);
If you just want to view your data sans these values, then use:
SELECT *
FROM yourTable
WHERE col NOT IN (SELECT col FROM tempTable);

Read flat file having array in Azure Data Factory

I need to read a pipe delimited file where we have an array repeating 30 times. I need to access these array of elements and change the sequence and send in the output file.
E.g.
Tanya|1|Pen|2|Book|3|Eraser
Raj|11|Eraser|22|Bottle
In the above example, first field is the Customer name. After that we have an array of items ordered - Order ID and Item name.
Could you please suggest how to read these array elements individually to process these further?
You will be using copy activity in Azure Data Factory pipeline in which the source will be DelimitedText dataset and sink will be JSON dataset. If your source and destination files are located in Azure Blob Storage, create a Linked Service to connect the files with Azure Data Factory.
The source file dataset properties will look like this. You need to select Column Delimiter as Pipe (|).
The sink file with JSON type dataset settings will look like as shown in below. Just mention the output location path where the file will be saved after conversion.
In copy data activity sink tab, select the File pattern as Array of files. Trigger the pipeline.
The sample input and output shown below.
If you can control the output file format then you're better off using two (or three) delimiters, one for the person and the other for the order items. If you can get a third then use that to split the order item from the order line.
Assuming the data format:
Person Name | 1:M [Order]
And each Order is
order line | item name
You can simply ingest the entire row into a single nvarchar(max) column and then use SQL to break out the data you need.
The following is one such example.
declare #tbl table (d nvarchar(max));
insert into #tbl values('Tanya|1|Pen|2|Book|3|Eraser'),('Raj|11|Eraser|22|Bottle');
declare #base table (id int, person varchar(100),total_orders int, raw_orders varchar(max));
declare #output table (id int, person varchar(100),item_id int, item varchar(100));
with a as
(
select
CHARINDEX('|',d) idx
,d
,ROW_NUMBER() over (order by d) as id /*Or newid()*/
from #tbl
), b as
(
select
id
,SUBSTRING(d,0,idx) person
,SUBSTRING(d,idx+1,LEN(d)-idx+1) order_array
from a
), c as
(
select id, person, order_array
,(select count(1) from string_split(order_array,'|')) /2 orders
from b
)
insert into #base (id,person,total_orders,raw_orders)
select id,person,orders,order_array from c
declare #total_persons int = (select count(1) from #base);
declare #person_enu int = 1;
while #person_enu <= #total_persons
BEGIN
declare #total_orders int = (select total_orders from #base where id = #person_enu);
declare #raw_orders nvarchar(max) = (select raw_orders from #base where id = #person_enu);
declare #order_enu int = 1;
declare #i int = 1;
print CONCAT('Person ', #person_enu, '. Total orders: ', #total_orders);
while #order_enu <= #total_orders
begin
--declare #id int = (select value from string_split(#raw_orders,'|',1) where ordinal = #i);
--declare #val varchar(100) = (select value from string_split(#raw_orders,'|',1) where ordinal = #i+1);
--print concat('Will process order ',#order_enu);
--print concat('ID:',#i, ' Value:', #i+1)
--print concat('ID:',#id, ' Value:', #val)
INSERT INTO #output (id,person,item_id,item)
select b.id,b.person,n.value [item_id], v.value [item] from #base b
cross apply string_split(b.raw_orders,'|',1) n
cross apply string_split(b.raw_orders,'|',1) v
where b.id = #person_enu and n.ordinal = #i and v.ordinal = #i+1;
set #order_enu +=1;
set #i+=2;
end
set #person_enu += 1;
END
select * from #output;

Split contents of file into Array[String] in Scala

I have a SQL file that contains several queries. Since it is a SQL file, it has queries delimited with semicolon (;). I want to read the SQL file and have the queries as an Array[String] in Scala.
For example, I have queries.sql file that contains queries like:
select * from table1;
select col1,col2,col3 from table1 where col1 = ''
col2 = '';
select count(*) from table;
I want the output to look like this:
Array("select * from table1","select col1,col2,col3 from table1 where col1 = '' col2 =' '","select count(*) from table")
You might want to try this:
import scala.io.Source
val theArrayYouWant = Source.fromFile(<filename>).getLines.mkString.split(";")

SQL - Separate string into columns

I am bulk inserting a csv file into SQL Server 2012. The data is currently | pipe delimited as one long string for each row. I'd like to separate the data into the different columns at each pipe.
Here is how the data looks as its imported:
ID|ID2|Person|Person2|City|State
"1"|"ABC"|"Joe"|"Ben"|"Boston"|"MA"
"2"|"ABD"|"Jack"|"Tim"|"Nashua"|"NH"
"3"|"ADC"|"John"|"Mark"|"Hartford"|"CT"
I'd liek to separate the data into the columns at each pipe:
ID ID2 Person Person2 City State
1 ABC Joe Ben Boston MA
2 ABD Jack Tim Nashua NH
3 AFC John Mark Hartford CT
I'm finding it difficult to use charindex and substring functions because of the number of columns of the data also I've tried to use ParseName since that is a 2012 function but thats not working either as all the columns come out as NULL with ParseName
The file contains about 300k rows and I've found a solution using xmlname but it is very slow. ie: takes a minute to separate the data.
Here's the slow xml solution:
CREATE TABLE #tbl(iddata varchar(200))
DECLARE #i int = 0
WHILE #i < 100000
BEGIN
SET #i = #i + 1
INSERT INTO #tbl(iddata)
SELECT '"1"|"ABC"|"Joe"|"Ben"|"Boston"|"MA"'
UNION ALL
SELECT '"2"|"ABD"|"Jack"|"Tim"|"Nashua"|"NH"'
UNION ALL
SELECT '"3"|"AFC"|"John"|"Mark"|"Hartford"|"CT"'
END
;WITH XMLData
AS
(
SELECT idData,
CONVERT(XML,'<IDs><id>'
+ REPLACE(iddata,'|', '</id><id>') + '</id></IDs>') AS xmlname
FROM (
SELECT REPLACE(iddata,'"','') as iddata
FROM #tbl
)x
)
SELECT xmlname.value('/IDs[1]/id[1]','varchar(100)') AS ID,
xmlname.value('/IDs[1]/id[2]','varchar(100)') AS ID2,
xmlname.value('/IDs[1]/id[3]','varchar(100)') AS Person,
xmlname.value('/IDs[1]/id[4]','varchar(100)') AS Person2,
xmlname.value('/IDs[1]/id[5]','varchar(100)') AS City,
xmlname.value('/IDs[1]/id[6]','varchar(100)') AS State
FROM XMLData
This will do it for you.
CREATE TABLE #Import (
ID NVARCHAR(MAX),
ID2 NVARCHAR(MAX),
Person NVARCHAR(MAX),
Person2 NVARCHAR(MAX),
City NVARCHAR(MAX),
State NVARCHAR(MAX))
SET QUOTED_IDENTIFIER OFF
BULK INSERT #Import
FROM 'C:\MyFile.csv'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = '|',
ROWTERMINATOR = '\n',
ERRORFILE = 'C:\myRubbishData.log'
)
select * from #Import
DROP TABLE #Import
Unfortunately using BULK INSERT will not deal with text qualifiers, so you will end up with "ABC" rather than ABC.
Either remove the text qualifiers from the csv file, or run a replace on your table once the data has been imported.
To save you the pain and misery of having to deal with pipes, I would strongly recommend that you process your input file to convert those pipes into commas, and then use SQL Server's built-in capacity to parse CSV into a table.
If you are using Java, replacing the pipes would literally take just one line of code:
String line = "\"1\"|\"ABC\"|\"Joe\"|\"Ben\"|\"Boston\"|\"MA\"";
line = line.replaceAll("|", ",");
// then write this line back out to file
BULK INSERT YourTable
FROM 'input.csv'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
ERRORFILE = 'C:\CSVDATA\SchoolsErrorRows.csv',
TABLOCK
)
If you cannot work with the BULK suggestions (due to rights) you might speed your query up by about 30% with this:
SELECT AsXml.value('x[1]','varchar(100)') AS ID
,AsXml.value('x[2]','varchar(100)') AS ID2
,AsXml.value('x[3]','varchar(100)') AS Person
,AsXml.value('x[4]','varchar(100)') AS Person2
,AsXml.value('x[5]','varchar(100)') AS City
,AsXml.value('x[6]','varchar(100)') AS State
FROM #tbl
CROSS APPLY(SELECT CAST('<x>' + REPLACE(SUBSTRING(iddata,2,LEN(iddata)-2),'"|"','</x><x>') + '</x>' AS XML)) AS a(AsXml)

Resources