Tab based split on lines misses empty columns - Perl - database

I have a tab separated text file. I read line by line and column by column. I make few changes in each column and write the line to a new file. When I read each column using split function of perl
my #aLastOldElements = split(/\t/, $_);
I miss out on empty columns in the end. For example if file has 33 tab separated columns, out of which 10 in the end are empty. The split function creates array of size 23. I want to have all the columns. Because this way the header of file (33 columns) doesn't match the data (23 columns) and I get errors while writing the file to the database.

split accepts an optional third parameter for the maximum number of fields to return. If this is present, empty trailing fields will not be discarded:
perl -E '#arr = split(/ /, "foo bar ", 100); say scalar #arr'
14
So long as the tabs to separate the empty fields at the end of the line are present, this should always give you 33 fields in the array, even if the last 10 are empty. (In my example, there are 14 fields returned because the string contains 13 separators, even though the specified limit was 100.)
Edit: In answer to the question in the first comment:
perl -wE '#arr = split(/\t/, "foo\tbar\t\thello\t", 100); say $_ || "(empty field)" for #arr'
foo
bar
(empty field)
hello
(empty field)

If you know that the columns should be there, whether or not they have any data, you can just ensure the result yourself.
my #aLastOldElements = split(/\t/, $_);
my $short_fall = 33 - #aLastOldElements;
if ( $short_fall > 0 ) {
push #aLastOldElements => ( '' ) x $short_fall;
}

Related

How to loop assigning characters in a string to variable?

I need to take a string and assign each character to a new string variable for a Text To Speech engine to read out each character separately, mainly to control the speed at which it's read out by adding pauses in between each character.
The string contains a number which can vary in length from 6 digits to 16 digits, and I've put the below code together for 6 digits but would like something neater to handle any different character count.
I've done a fair bit of research but can't seem to find a solution, plus I'm new to Groovy / programming.
OrigNum= "12 34 56"
Num = OrigNum.replace(' ','')
sNum = Num.split("(?!^)")
sDigit1 = sNum[0]
sDigit2 = sNum[1]
sDigit3 = sNum[2]
sDigit4 = sNum[3]
sDigit5 = sNum[4]
sDigit6 = sNum[5]
Edit: The reason for needing a new variable for each character is the app that I'm using doesn't let the TTS engine run any code. I have to specifically declare a variable beforehand for it to be read out
Sample TTS input: "The number is [var:sDigit1] [pause] [var:sDigit2] [pause]..."
I've tried using [var:sNum[0]] [var:sNum[1]] to read from the map instead but it is not recognised.
Read this about dynamically creating variable names.
You could use a map in your stuation, which is cleaner and more groovy:
Map digits = [:]
OrigNum.replaceAll("\\s","").eachWithIndex { digit, index ->
digits[index] = digit
}
println digits[0] //first element == 1
println digits[-1] //last element == 6
println digits.size() // 6
Not 100% sure what you need, but to convert your input String to output you could use:
String origNum = "12 34 56"
String out = 'The number is ' + origNum.replaceAll( /\s/, '' ).collect{ "[var:$it]" }.join( ' [pause] ' )
gives:
The number is [var:1] [pause] [var:2] [pause] [var:3] [pause] [var:4] [pause] [var:5] [pause] [var:6]

Regex string with 2+ different numbers and some optional characters in Snowflake syntax

I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!
Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;
Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!

Find a list of max values in a text file using awk

I am new to awk and I cannot figure out the correct syntax for the task I am working on.
I have a text file which looks something like this (the content is always sorted but is not always the same, so I cannot hard code the index of the array):
27 abc123
27 abd333
27 dce123
23 adb234
21 abc789
18 bcd213
So apparently the max is 27. However, I want my output to be:
27 abc123
27 abd333
27 dce123
and not the first row only.
The second column is just there, my code always sorts the text file based on the first column.
My code right now set the max as the first value (27 for example), and as it reads through the lines, it stores only the rows with the max values in an array and eventually print out the output.
awk 'BEGIN {max=$1} {if(($1)==max) a[NR]=($0)} END {for (i in a) print a[i]}' file
You can't read fields in a BEGIN block, since it's executed before the file is read.
To find the first record, use the pattern NR == 1. NR is the number of the current record. To find the other records, just check whether $1 equals the max value.
NR == 1 { max = $1 }
$1 == max { print }
Since your input is always sorted, you can optimise this program by exiting after reading all the records with the max value:
$1 != max { exit }

Print words from the corresponding line numbers

Hello Everyone,
I have two files File1 and File2 which has the following data.
File1:
TOPIC:topic_0 30063951.0
2 19195200.0
1 7586580.0
3 2622580.0
TOPIC:topic_1 17201790.0
1 15428200.0
2 917930.0
10 670854.0
and so on..There are 15 topics and each topic have their respective weights. And the first column like 2,1,3 are the numbers which have corresponding words in file2. For example,
File 2 has:
1 i
2 new
3 percent
4 people
5 year
6 two
7 million
8 president
9 last
10 government
and so on.. There are about 10,470 lines of words. So, in short I should have the corresponding words in the first column of file1 instead of the line numbers. My output should be like:
TOPIC:topic_0 30063951.0
new 19195200.0
i 7586580.0
percent 2622580.0
TOPIC:topic_1 17201790.0
i 15428200.0
new 917930.0
government 670854.0
My Code:
import sys
d1 = {}
n = 1
with open("ap_vocab.txt") as in_file2:
for line2 in in_file2:
#print n, line2
d1[n] = line2[:-1]
n = n + 1
with open("ap_top_t15.txt") as in_file:
for line1 in in_file:
columns = line1.split(' ')
firstwords = columns[0]
#print firstwords[:-8]
if firstwords[:-8] == 'TOPIC':
print columns[0], columns[1]
elif firstwords[:-8] != '\n':
num = columns[0]
print d1[n], columns[1]
This code is running when I type print d1[2], columns[1] giving the second word in file2 for all the lines. But when the above code is printed, it is giving an error
KeyError: 10472
there are 10472 lines of words in the file2. Please help me with what I should do to rectify this. Thanks in advance!
In your first for loop, n is incremented with each line until reaching a final value of 10472. You are only setting values for d1[n] up to 10471 however, as you have placed the increment after you set d1 for your given n, with these two lines:
d1[n] = line2[:-1]
n = n + 1
Then on the line
print d1[n], columns[1]
in your second for loop (for in_file), you are attempting to access d1[10472], which evidently doesn't exist. Furthermore, you are defining d1 as an empty Dictionary, and then attempting to access it as if it were a list, such that even if you fix your increment you will not be able to access it like that. You must either use a list with d1 = [], or will have to implement an OrderedDict so that you can access the "last" key as dictionaries are typically unordered in Python.
You can either:
Alter your increment so that you do set a value for d1 in the d1[10472] position, or simply set the value for the last position after your for loop.
Depending on what you are attempting to print out, you could replace your last line with
print d1[-1], columns[1]
to print out the value for the final index position you currently have set.

Reading a text file in MATLAB line by line

I have a CSV file, I want to read this file and do some pre-calculations on each row to see for example that row is useful for me or not and if yes I save it to a new CSV file.
can someone give me an example?
in more details this is how my data looks like: (string,float,float) the numbers are coordinates.
ABC,51.9358183333333,4.183255
ABC,51.9353866666667,4.1841
ABC,51.9351716666667,4.184565
ABC,51.9343083333333,4.186425
ABC,51.9343083333333,4.186425
ABC,51.9340916666667,4.18688333333333
basically i want to save the rows that have for distances more than 50 or 50 in a new file.the string field should also be copied.
thanks
You could actually use xlsread to accomplish this. After first placing your sample data above in a file 'input_file.csv', here is an example for how you can get the numeric values, text values, and the raw data in the file from the three outputs from xlsread:
>> [numData,textData,rawData] = xlsread('input_file.csv')
numData = % An array of the numeric values from the file
51.9358 4.1833
51.9354 4.1841
51.9352 4.1846
51.9343 4.1864
51.9343 4.1864
51.9341 4.1869
textData = % A cell array of strings for the text values from the file
'ABC'
'ABC'
'ABC'
'ABC'
'ABC'
'ABC'
rawData = % All the data from the file (numeric and text) in a cell array
'ABC' [51.9358] [4.1833]
'ABC' [51.9354] [4.1841]
'ABC' [51.9352] [4.1846]
'ABC' [51.9343] [4.1864]
'ABC' [51.9343] [4.1864]
'ABC' [51.9341] [4.1869]
You can then perform whatever processing you need to on the numeric data, then resave a subset of the rows of data to a new file using xlswrite. Here's an example:
index = sqrt(sum(numData.^2,2)) >= 50; % Find the rows where the point is
% at a distance of 50 or greater
% from the origin
xlswrite('output_file.csv',rawData(index,:)); % Write those rows to a new file
If you really want to process your file line by line, a solution might be to use fgetl:
Open the data file with fopen
Read the next line into a character array using fgetl
Retreive the data you need using sscanf on the character array you just read
Perform any relevant test
Output what you want to another file
Back to point 2 if you haven't reached the end of your file.
Unlike the previous answer, this is not very much in the style of Matlab but it might be more efficient on very large files.
Hope this will help.
You cannot read text strings with csvread.
Here is another solution:
fid1 = fopen('test.csv','r'); %# open csv file for reading
fid2 = fopen('new.csv','w'); %# open new csv file
while ~feof(fid1)
line = fgets(fid1); %# read line by line
A = sscanf(line,'%*[^,],%f,%f'); %# sscanf can read only numeric data :(
if A(2)<4.185 %# test the values
fprintf(fid2,'%s',line); %# write the line to the new file
end
end
fclose(fid1);
fclose(fid2);
Just read it in to MATLAB in one block
fid = fopen('file.csv');
data=textscan(fid,'%s %f %f','delimiter',',');
fclose(fid);
You can then process it using logical addressing
ind50 = data{2}>=50 ;
ind50 is then an index of the rows where column 2 is greater than 50. So
data{1}(ind50)
will list all the strings for the rows of interest.
Then just use fprintf to write out your data to the new file
here is the doc to read a csv : http://www.mathworks.com/access/helpdesk/help/techdoc/ref/csvread.html
and to write : http://www.mathworks.com/access/helpdesk/help/techdoc/ref/csvwrite.html
EDIT
An example that works :
file.csv :
1,50,4.1
2,49,4.2
3,30,4.1
4,71,4.9
5,51,4.5
6,61,4.1
the code :
File = csvread('file.csv')
[m,n] = size(File)
index=1
temp=0
for i = 1:m
if (File(i,2)>=50)
temp = temp + 1
end
end
Matrix = zeros(temp, 3)
for j = 1:m
if (File(j,2)>=50)
Matrix(index,1) = File(j,1)
Matrix(index,2) = File(j,2)
Matrix(index,3) = File(j,3)
index = index + 1
end
end
csvwrite('outputFile.csv',Matrix)
and the output file result :
1,50,4.1
4,71,4.9
5,51,4.5
6,61,4.1
This isn't probably the best solution but it works! We can read the CSV file, control the distance of each row and save it in a new file.
Hope it will help!

Resources