SSIS Text file processing subset of rows - sql-server

I have data in text files I am trying to process. When I connect to the data source using a flat file source it can only process a subset of the rows within the file. I am using SQL Server 2008 R2. Bellow is the sample data from a text file I am using.
PGH90961 Deep South Motors JAGUAR X-TYPE 2.2 D JAGUAR X400 JXX MZJAH511X6BE93589 R/C 2663 TNP990GP 55Q79510 1 0 16/02/2011MR D.J. JOHNSON NULL 0126675530 0827001220 E D.J. JOHNSON JOE De MAGGIO 178928
PGH90961 Deep South Motors FREELANDER DIESEL USED LANDROVER LUL MZLFA23B87H034028 7H034028 WCH432GP 55W74850 14 0 1
PGH90961 Deep South Motors JAGUAR X TYPE 3.0 Jaguar XT 3.0 JXT3.0 MZJAD53G25WE21791 R/C PBW641GP 55Q79630 1 0 16/02/2011MR D BUTCH 0116091630 0116301630 0834559627 E Mr.D BUTCH null JOE De MAGGIO 70091
PJP90961 Deep South Motors JAGUAR X TYPE 3.0 Jaguar XT 3.0 JXT3.0 MZJAD53G25WE21791 R/C PBW641GP 55Q77100 1 0 1
How can I possibly process all rows? Thank you.

Related

Mapping between XML:LANG and LCID in SQL Server full text search

I have searched the whole internet and books and cannot find the mapping between xml:lang and lcid in full text search for multilingual content in SQL Server.
Here is a sample, not confirmed in any Microsoft doc:
Lcid xml:lang
1025 ar
1026 bg
1027 ca
1028 zh-TW
1030 da
1031 de
1033 en
1036 fr
I found some unofficial sources but I cannot be assured it works:
https://github.com/henrylearn2rock/lcids-xml-lang
https://github.com/Apress/pro-full-text-search-in-sql-server-2008/blob/5916c12181cc23becf26b6c99152b1a62c28f6e2/iFTS_Samples/Sample%20Database/Populate_Tables1.sql
Thanks

How should i format/set up my dataset/dataframe? and factor ->numeric problems

New to R and new to this forum, tried searching, hope i dont embarass myself by failing to identify previous answers.
So i got my data, and i intend to do some kind of glmm's in the end but thats far away in the future, first im going to do some simple glm/lm's to learn what im doing
first about my data:
I have data sampled from 2 "general areas" on opposite sides of the country.
in these general areas there are roughly 50 trakts placed (in a grid, random staring point)
Trakts have been revisited each year for a duration of 4 years
A tract contains 16 sample plots, i intend to work on trakt-level so i use the means of the 16 sample plots for each trakt.
2x4x50 = 400 rows (actual number is 373 rows when i have removed trakts where not enough plots could be sampled due to terrain etc)
the data in my excel file is currently divided like this:
rows = trakts
Columns= the measured variable
i got 8-10 columns i want to use
short example how the data looks now:
V1 - predictor, 4 different columns
V2 - Response variable = proportional data, 1-4 columns depending on which hypothesis i end up testing,
the glmm in the end would look something like, (V2~V1+V1+V1,(area,year))
Area Year Trakt V1 V2
A 2015 1 25.165651 0
A 2015 2 11.16894652 0.1
A 2015 3 18.231 0.16
A 2014 1 3.1222 N/A
A 2014 2 6.1651 0.98
A 2014 3 8.651 1
A 2013 1 6.16416 0.16
B 2015 1 9.12312 0.44
B 2015 2 22.2131 0.17
B 2015 3 12.213 0.76
B 2014 1 1.123132 0.66
B 2014 2 0.000 0.44
B 2014 3 5.213265 0.33
B 2013 1 2.1236 0.268
How should i get started on this?
8 different files?
Nested by trakts ( do i start nesting now or later when i'm doing glmms?)
i load my data into r through the read.tables function
If i run: sapply(dataframe,class)
V1 and V2 are factors, everything else integer
if i run sapply(dataframe,mode)
everything is numeric
so finally to my actual problems, i have been trying to do normality tests (only trid shapiro so far) but i keep getting errors that imply my data is not numeric
also, when i run a normality test, do i only run one column and evaluate it before moving on to the next column or should i run several columns? the entire dataset?
should i in my case run independent normality tests for each of my areas and year?
hope it didnt end up to cluttered
best regards

Sequences in Graph Database

All,
I am new to the graph database area and want to know if this type of example if applicable to a graph database.
Say I am looking at a baseball game. When each player goes to bat, there are 3 possible outcomes: hit, strikeout, or walk.
For each batter and throughout the baseball season, what I want to figure out is the counts of the sequences.
For example, for batters that went to the plate n times, how many people had a particular sequence (e.g, hit/walk/strikeout or hit/hit/hit/hit), and if so, how many of the same batters repeated the same sequence indexed by time. To further explain, time would allow me know if a particular sequence (e.g. hit/walk/strikeout or hit/hit/hit/hit) occurred during the beginning of the season, in the mid, or later half.
For a key-value type database, the raw data would look as follows:
Batter Time Game Event Bat
------- ----- ---- --------- ---
Charles April 1 Hit 1
Charles April 1 strikeout 2
Charles April 1 Walk 3
Doug April 1 Walk 1
Doug April 1 Hit 2
Doug April 1 strikeout 3
Charles April 2 strikeout 1
Charles April 2 strikeout 2
Doug May 5 Hit 1
Doug May 5 Hit 2
Doug May 5 Hit 3
Doug May 5 Hit 4
Hence, my output would appear as follows:
Sequence Freq Unique Batters Time
----------------------- ---- -------------- ------
hit 5000 600 April
walk/strikeout 3000 350 April
strikeout/strikeout/hit 2000 175 April
hit/hit/hit/hit/hit 1000 80 April
hit 6000 800 May
walk/strikeout 3500 425 May
strikeout/strikeout/hit 2750 225 May
hit/hit/hit/hit/hit 1250 120 May
. . . .
. . . .
. . . .
. . . .
If this is feasible for a graph database, would it also scale? What if instead of 3 possible outcomes for a batter, there were 10,000 potential outcomes with 10,000,000 batters?
More so, the 10,000 unique outcomes would be sequenced in a combinatoric setting (e.g. 10,000 CHOOSE 2, 10,000 CHOOSE 3, etc.).
My question then is, if a graphing database is appropriate, how would you propose setting up a solution?
Much thanks in advance.

Calculating Average for specific column in a 2D array

I am new to Python and need your help. I need to calculate the average for a specific column in a very large array. I would like to use numpy.average function (open to any other suggestions) but can not figure out a way to select a column by its header (e.g. an average for a Flavor_Score column):
Beer_name Tester Flavor_Score Overall_Score
Coors Jim 2.0 3.0
Sam Adams Dave 4.0 4.5
Becks Jim 3.5 3.5
Coors Dave 2.0 2.2
Becks Dave 3.5 3.7
Do I have to transpose the array (it seems there are many functions for rows in pandas and numpy but relatively few for columns (I could be wrong, of course) to get average calculations for a column done?
Second question for the same array: is the best way to use the answer from the first question (calculating the average Flavor_Score) to calculate the average Flavor_Score for a specific beer (among different testers)?
Beer-test="Coors"
for i in Beer_Name():
if i=Beer_test: # recurring average calculation
else: pass
I would hope there is a built-in function for this.
Very much appreciate your help!
Ok here is an example of how to do that.
# Test data
df = pd.DataFrame({'Beer_name': ['Coors', 'Sam Adams', 'Becks', 'Coors','Becks'],
'Tester': ['Jim', 'Dave', 'Jim', 'Dave', 'Dave'],
'Flavor_Score': [2,4,3.5,2,3.5],
'Overall_Score': [3, 4.5, 3.5, 2.2, 3.7]})
# Simply call mean on the DataFrame
df.mean()
Flavor_Score 3.00
Overall_Score 3.38
Then you can use the groupby feature:
df.groupby('Beer_name').mean()
Flavor_Score Overall_Score
Beer_name
Becks 3.5 3.6
Coors 2.0 2.6
Sam Adams 4.0 4.5
Now you can even see how it looks like by tester.
df.groupby(['Beer_name','Tester']).mean()
Flavor_Score Overall_Score
Beer_name Tester
Becks Dave 3.5 3.7
Jim 3.5 3.5
Coors Dave 2.0 2.2
Jim 2.0 3.0
Sam Adams Dave 4.0 4.5
Good beer !

Counting Unique Values for Multiple Columns in Excel

I've attempted using Pivot tables and SUMPRODUCT & COUNTIF formulas after looking through possible solutions but haven't found anything positive yet. Below is the input data:
Level 1 Level 2 Level 3 Level 4 Level 5
Tom Liz
Tom Liz Mel
Tom Liz Dan
Tom Liz Dan Ian
Tom Liz Dan Ken
Tom Tim
Tom Tim Fab
Tom Tim Fab Ken
Tom Tim Fab Ken Jan
Eve
Expected output data is below. The intent is to not have to feed in a pre-loaded list of names. The expectation is that the program could determine the counts based on the input data alone:
Counts
-------
Tom: 9
Eve: 1
Liz: 5
Tim: 4
Mel: 1
Dan: 3
Fab: 3
Ian: 1
Ken: 3
Jan: 1
Any help towards this is appreciated....thanks!
UPDATE: A preloaded list with the list of Names CAN be used to generate the counts. The above description was updated accordingly.
First enter the following UDF in a standard module:
Public Function ListUniques(rng As Range) As Variant
Dim r As Range, ary(1 To 9999, 1 To 1) As Variant
Dim i As Long, C As Collection
Set C = New Collection
On Error Resume Next
For Each r In rng
v = r.Value
If v <> "" Then
C.Add v, CStr(v)
End If
Next r
On Error GoTo 0
For i = 1 To 9999
If i > C.Count Then
ary(i, 1) = ""
Else
ary(i, 1) = C.Item(i)
End If
Next i
ListUniques = ary
End Function
Then hi-light a section of a column, say G1 thru G50 and enter the Array Formula:
=listuniques(A2:E11)
Array formulas must be entered with Ctrl + Shift + Enter rather than just the Enter key.
If done correctly you should see something like:
Finally in H1 enter:
=COUNTIF($A$2:$E$11,G1)
and copy down
NOTE
User Defined Functions (UDFs) are very easy to install and use:
ALT-F11 brings up the VBE window
ALT-I
ALT-M opens a fresh module
paste the stuff in and close the VBE window
If you save the workbook, the UDF will be saved with it.
If you are using a version of Excel later then 2003, you must save
the file as .xlsm rather than .xlsx
To remove the UDF:
bring up the VBE window as above
clear the code out
close the VBE window
To use the UDF from Excel:
=myfunction(A1)
To learn more about macros in general, see:
http://www.mvps.org/dmcritchie/excel/getstarted.htm
and
http://msdn.microsoft.com/en-us/library/ee814735(v=office.14).aspx
and for specifics on UDFs, see:
http://www.cpearson.com/excel/WritingFunctionsInVBA.aspx
Macros must be enabled for this to work!

Resources