How to Get SSIS Expression Leftover odd Rows right? - sql-server

I’m trying to come up with Expression in Conditional Split To Return Whatever leftOver Rows.
For example I have Flat File with 10 Rows and if want to split the file onto 2 Rows, I will have 5 files with 2 Rows in each file, but what if I want to split the file By 3 I will have only 3 files with 3 Rows each file, But Where is the Leftover 1 Row modulus remainder will go?. I tried the Conditional split and was fine But the LeftOver Rows Output DID NOT Come out Right. Please take look at my images and tell me if I have issues with my expression in Conditional Split and For Loop.
I have these variables:
#Counter = 0
#ReCount = total count that comes from RowCount Transformation
#SplitRow = Given Number that will Divided by
#EndCount = #ReCount / #SplitRow
I need to have Right expression for LeftOver or modulus remainder of any ‘Odds’ Number

Given that you already know the rownumber for each row as well as the rowcount of the entire file and the split number of rows in each destination file, the expression that will flag up any leftover rows from your splitting is any row in which the following returns true:
rownumber > (rowcount - (rowcount % split))

Related

Is there an Excel array formula to return 1 for the first occurrence of a value, and 0 for subsequent occurrences

I need an Excel formula to return 1 for every unique text values for a sequence and 0 for duplicate in the same sequence.
I included this screenshot. <--- correction
Desired Output--> 1 for every unique text value and 0 for its duplicate but for a the next number in ID the counting restart from zero
Thanks
Gianluca

MATLAB Extract all rows between two variables with a threshold

I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.

Link two tables based on conditions in matlab

I am using matlab to prepare my dataset in order to run it in certain data mining models and I am facing an issue with linking the data between two of my tables.
So, I have two tables, A and B, which contain sequential recordings of certain values in a certain timestamps and I want to create a third table, C, in which I will add columns of both A and B in the same rows according to some conditions.
Tables A and B don't have the same amount of rows (A has more measurements) but they both have two columns:
1st column: time of the recording (hh:mm:ss) and
2nd column: recorded value in that time
Columns of A and B are going to be added in table C when all the following conditions stand:
The time difference between A and B is more than 3 sec but less than 5 sec
The recorded value of A is the 40% - 50% of the recorded value of B.
Any help would be greatly appreciated.
For the first condition you need something like [row,col,val]=find((A(:,1)-B(:,1))>2sec && (A(:,1)-B(:,1))<5sec) where you do need to use datenum or equivalent to transform your timestamps. For the second condition this works the same, use [row,col,val]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2)
datenum allows you to transform your arrays, so do that first:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
you might need to check the documentation on datenum, regarding the format your string is in.
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
creates the datenums for 3 and 5 seconds. All combined:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
[row1,col1,val1]=find((A(:,1)-B(:,1))>time1(1)&& (A(:,1)-B(:,1))<time1(2));
[row2,col2,val2]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2);
The variables of row and col you might not need when you want only the values though. val1 contains the values of condition 1, val2 of condition 2. If you want both conditions to be valid at the same time, use both in the find command:
[row3,col3,val3]=find((A(:,1)-B(:,1))>time1(1)&& ...
(A(:,1)-B(:,1))<time1(2) && A(:,2)>0.4*B(:,2)...
&& A(:,2)<0.5*B(:,2);
The actual adding of your two arrays based on the conditions:
C = A(row3,2)+B(row3,2);
Thank you for your response and help! However for the time I followed a different approach by converting hh:mm:ss to seconds that will make the comparison easier later on:
dv1 = datevec(A, 'dd.mm.yyyy HH:MM:SS.FFF ');
secs = [3600,60,1];
dv1(:,6) = floor(dv1(:,6));
timestamp = dv1(:,4:6)*secs.';
Now I am working on combining both time and weight conditions in a piece of code that will run. Should I use an if condition inside a for loop or is a for loop not necessary?

Print words from the corresponding line numbers

Hello Everyone,
I have two files File1 and File2 which has the following data.
File1:
TOPIC:topic_0 30063951.0
2 19195200.0
1 7586580.0
3 2622580.0
TOPIC:topic_1 17201790.0
1 15428200.0
2 917930.0
10 670854.0
and so on..There are 15 topics and each topic have their respective weights. And the first column like 2,1,3 are the numbers which have corresponding words in file2. For example,
File 2 has:
1 i
2 new
3 percent
4 people
5 year
6 two
7 million
8 president
9 last
10 government
and so on.. There are about 10,470 lines of words. So, in short I should have the corresponding words in the first column of file1 instead of the line numbers. My output should be like:
TOPIC:topic_0 30063951.0
new 19195200.0
i 7586580.0
percent 2622580.0
TOPIC:topic_1 17201790.0
i 15428200.0
new 917930.0
government 670854.0
My Code:
import sys
d1 = {}
n = 1
with open("ap_vocab.txt") as in_file2:
for line2 in in_file2:
#print n, line2
d1[n] = line2[:-1]
n = n + 1
with open("ap_top_t15.txt") as in_file:
for line1 in in_file:
columns = line1.split(' ')
firstwords = columns[0]
#print firstwords[:-8]
if firstwords[:-8] == 'TOPIC':
print columns[0], columns[1]
elif firstwords[:-8] != '\n':
num = columns[0]
print d1[n], columns[1]
This code is running when I type print d1[2], columns[1] giving the second word in file2 for all the lines. But when the above code is printed, it is giving an error
KeyError: 10472
there are 10472 lines of words in the file2. Please help me with what I should do to rectify this. Thanks in advance!
In your first for loop, n is incremented with each line until reaching a final value of 10472. You are only setting values for d1[n] up to 10471 however, as you have placed the increment after you set d1 for your given n, with these two lines:
d1[n] = line2[:-1]
n = n + 1
Then on the line
print d1[n], columns[1]
in your second for loop (for in_file), you are attempting to access d1[10472], which evidently doesn't exist. Furthermore, you are defining d1 as an empty Dictionary, and then attempting to access it as if it were a list, such that even if you fix your increment you will not be able to access it like that. You must either use a list with d1 = [], or will have to implement an OrderedDict so that you can access the "last" key as dictionaries are typically unordered in Python.
You can either:
Alter your increment so that you do set a value for d1 in the d1[10472] position, or simply set the value for the last position after your for loop.
Depending on what you are attempting to print out, you could replace your last line with
print d1[-1], columns[1]
to print out the value for the final index position you currently have set.

How many times is the DBRecordReader getting created?

Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?
How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source
It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.
This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Resources