MAX Partitioning in ETL using SSIS

MAX Partitioning in ETL using SSIS - sql-server

I have a query in SQL Server where I am using MAX OVER Partition BY.
MAX(Duration) OVER (PARTITION BY [Variable8]
ORDER BY [RouterCallKeySequenceNumber] ASC) AS MaxDuration
I would like to implement this in ETL using SSIS.
To implement, I have tried to implement similar to how we can implement Row Number.
I have added a SORT transformation and sorted by Variable8 and RouterCallKeySequenceNumber and then I have added a Script transformation.
string _variable8 = "";
int _max_duration;
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
_max_duration = Row.Duration;
if (Row.Variable8 != _variable8)
{
_max_duration = Row.Duration;
Row.maxduration = _max_duration;
_variable8 = Row.Variable8;
}
else
{
if (Row.Duration >= _max_duration)
{
Row.maxduration = _max_duration;
}
}
}
This is the data that I have -
Variable8 RouterCallKeySequenceNumber Duration
153084-2490 0 265
153084-2490 1 161
153084-2490 2 197
The solution that I need is as below -
Variable8 RouterCallKeySequenceNumber Duration Max Duration
153084-2490 0 265 265
153084-2490 1 161 265
153084-2490 2 197 265
But this does not return the desired value.
I would appreciate if you can provide any help.
Thanks

Related

Stopping a custom SQLCLR aggregate early

Let's say I want I have a table with the following columns and values:
| BucketID | Value |
|:-----------|------------:|
| 1 | 3 |
| 1 | 2 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 5 |
Let's pretend I want to partition over BucketId and Multiply the values in the partition:
SELECT DISTINCT BucketId, MULT(Value) OVER (PARTITION BY BucketId)
Now, there is no built-in Aggregate MULT function so I am writing my own using SQL CLR:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
[Serializable]
[SqlUserDefinedAggregate(Format.Native, Name = "MULT")]
public struct MULT
{
private int runningSum;
public void Init()
{
runningSum = 1;
}
public void Accumulate(SqlInt32 value)
{
runnningSum *= (int)value;
}
public void Merge(OverallStatus other)
{
runnningSum *= other.runningSum;
}
public SqlInt32 Terminate()
{
return new SqlInt32(runningSum);
}
}
What my question boils down to is, let's assume I hit a 0 in Accumulate or Merge, there is no point continuing on. If that is the case, how can I return 0 as soon as I hit 0?

You cannot force a termination since there is no way to control the workflow. SQL Server will call the Terminate() method as each group finishes processing its set.
However, since the state of the UDA is maintained across each row processed in a group, you can simply check runningSum to see if it's already 0 and if so, skip any computations. This would save some slight amount of time processing.
In the Accumulate() method, the first step will be to check if runningSum is 0, and if true, simply return;. You should also check to see if value is NULL (something you are not currently checking for). After that, check the incoming value to see if it is 0 and if true, then set runningSum to 0 and return;.
public void Accumulate(SqlInt32 value)
{
if (runningSum == 0 || value.IsNull)
{
return;
}
if (value.Value == 0)
{
runningSum = 0;
return;
}
runningSum *= value.Value;
}
Note: Do not cast value to int. All Sql* types have a Value property that returns the expected native .NET type.
Finally, in the Merge method, check runnningSum to see if it is 0 and f true, simply return;. Then check other.runningSum to see if it is 0, and if true, then set runnningSum to 0.

Autofixture: Issue when creating value types

Debugging with xUnit.net the test-methods Test1 and Test2 of the following code and putting a breakpoint and the end of CreateValueAndReferenceType() you see that the variable valueType is the same in both runs, whereas the variable referenceType is altered. The former is for me surprising and an issue as well (I added the row with the string-type only for completeness).
public class MyFixture : Fixture
{
public void CreateValueAndReferenceType()
{
var valueType = this.Create<int>();
var referenceTye = this.Create<string>();
}
}
public class TestClass1
{
[Fact]
public void Test1()
{
var myFixture = new MyFixture();
myFixture.CreateValueAndReferenceType();
}
}
public class TestClass2
{
[Fact]
public void Test2()
{
var myFixture = new MyFixture();
myFixture.CreateValueAndReferenceType();
}
}

What you're seeing is, I think, a basic issue related to pseudo-random number generation in .NET (IIRC, other platforms have similar issues). In essence, System.Random is deterministic, but initialised with a random seed that, among other things, depend on the computer's current time. If you create instances of Random in a tight loop, the code executes faster than the precision of the system clock. Something like this:
for (int i = 0; i < 10; i++)
Console.Write(new Random().Next(0, 9));
will often produce output like this:
5555555555
Most values in AutoFixture are generated by various Random instances - the exception is the string type, of which values are generated by Guid.NewGuid().ToString().
I think that the reason you're seeing this is because of xUnit.net's parallel execution.
In order to pinpoint the problem, I rephrased the issue so that it doesn't rely on debugging or inheritance:
public static class Reporter
{
public static void CreateValueAndReferenceType(
IFixture fixture,
ITestOutputHelper #out)
{
var valueType = fixture.Create<int>();
var referenceTye = fixture.Create<string>();
#out.WriteLine("valueType: {0}", valueType);
#out.WriteLine("referenceType: {0}", referenceTye);
}
}
public class TestClass1
{
private readonly ITestOutputHelper #out;
public TestClass1(ITestOutputHelper #out)
{
this.#out = #out;
}
[Fact]
public void Test1()
{
Reporter.CreateValueAndReferenceType(new Fixture(), this.#out);
}
}
public class TestClass2
{
private readonly ITestOutputHelper #out;
public TestClass2(ITestOutputHelper #out)
{
this.#out = #out;
}
[Fact]
public void Test2()
{
Reporter.CreateValueAndReferenceType(new Fixture(), this.#out);
}
}
When you run this with the xUnit.net console runner, you can see the issue nicely reproduced:
$ packages/xunit.runner.console.2.1.0/tools/xunit.console 37925109/bin/Debug/Ploeh.StackOverflow.Q37925109.dll -diagnostics
-parallel all
xUnit.net Console Runner (64-bit .NET 4.0.30319.42000)
Discovering: Ploeh.StackOverflow.Q37925109 (app domain = on [shadow copy], method display = ClassAndMethod)
Discovered: Ploeh.StackOverflow.Q37925109 (running 2 test cases)
Starting: Ploeh.StackOverflow.Q37925109 (parallel test collections = on, max threads = 4)
Ploeh.StackOverflow.Q37925109.TestClass2.Test2 [PASS]
Output:
valueType: 246
referenceType: cc39f570-046a-4a0a-8adf-ab7deadd0e26
Ploeh.StackOverflow.Q37925109.TestClass1.Test1 [PASS]
Output:
valueType: 246
referenceType: 87455351-03f7-4640-99fb-05af910da267
Finished: Ploeh.StackOverflow.Q37925109
=== TEST EXECUTION SUMMARY ===
Ploeh.StackOverflow.Q37925109 Total: 2, Errors: 0, Failed: 0, Skipped: 0, Time: 0,429s
In the above example, you'll notice that I've explicitly invoked the runner with -parallel all, but I didn't have to do that, since it's the default.
If, on the other hand, you turn off parallelisation with -parallel none, you'll see that the values are different:
$ packages/xunit.runner.console.2.1.0/tools/xunit.console 37925109/bin/Debug/Ploeh.StackOverflow.Q37925109.dll -diagnostics
-parallel none
xUnit.net Console Runner (64-bit .NET 4.0.30319.42000)
Discovering: Ploeh.StackOverflow.Q37925109 (app domain = on [shadow copy], method display = ClassAndMethod)
Discovered: Ploeh.StackOverflow.Q37925109 (running 2 test cases)
Starting: Ploeh.StackOverflow.Q37925109 (parallel test collections = off, max threads = 4)
Ploeh.StackOverflow.Q37925109.TestClass2.Test2 [PASS]
Output:
valueType: 203
referenceType: 1bc75a33-5542-4d9f-b42d-57ed85dc418d
Ploeh.StackOverflow.Q37925109.TestClass1.Test1 [PASS]
Output:
valueType: 117
referenceType: 6a508699-dc35-4bcd-8a7b-15eba64b24b4
Finished: Ploeh.StackOverflow.Q37925109
=== TEST EXECUTION SUMMARY ===
Ploeh.StackOverflow.Q37925109 Total: 2, Errors: 0, Failed: 0, Skipped: 0, Time: 0,348s
What I think happens is that because of the parallelism, both Test1 and Test2 are executed in parallel, and essentially within the same tick.
One workaround is to place both tests in the same test class:
public class TestClass1
{
private readonly ITestOutputHelper #out;
public TestClass1(ITestOutputHelper #out)
{
this.#out = #out;
}
[Fact]
public void Test1()
{
Reporter.CreateValueAndReferenceType(new Fixture(), this.#out);
}
[Fact]
public void Test2()
{
Reporter.CreateValueAndReferenceType(new Fixture(), this.#out);
}
}
This produces two different integer values, because (IIRC) xUnit.net only run different test classes in parallel:
$ packages/xunit.runner.console.2.1.0/tools/xunit.console 37925109/bin/Debug/Ploeh.StackOverflow.Q37925109.dll -diagnostics
-parallel all
xUnit.net Console Runner (64-bit .NET 4.0.30319.42000)
Discovering: Ploeh.StackOverflow.Q37925109 (app domain = on [shadow copy], method display = ClassAndMethod)
Discovered: Ploeh.StackOverflow.Q37925109 (running 2 test cases)
Starting: Ploeh.StackOverflow.Q37925109 (parallel test collections = on, max threads = 4)
Ploeh.StackOverflow.Q37925109.TestClass1.Test2 [PASS]
Output:
valueType: 113
referenceType: e8c30ad8-f2c8-4767-9e9f-69b55c50e659
Ploeh.StackOverflow.Q37925109.TestClass1.Test1 [PASS]
Output:
valueType: 232
referenceType: 3eb60bf3-4d43-4a91-aef2-42f7e23e35b3
Finished: Ploeh.StackOverflow.Q37925109
=== TEST EXECUTION SUMMARY ===
Ploeh.StackOverflow.Q37925109 Total: 2, Errors: 0, Failed: 0, Skipped: 0, Time: 0,360s
This theory is also corroborated by the fact that if you repeat the experiment sufficiently many times, you'll see the numbers being different once in a while. Here are the integer results from 25 test runs:
33 33
92 92
211 211
13 13
9 9
160 160
55 55
155 155
137 137
161 161
242 242
183 183
237 237
151 151
104 104
254 254
123 123
244 244
144 144
223 9
196 196
126 126
199 199
221 221
132 132
Notice that all except one test run has equal numbers.

Shapefile quadtree in shapelib

In shapelib, I've noted that quite amount of code is meant to handle Shapefile quadtree. For instance, the tool shptreedump (in shapelib source code).
warmerda#gdal[207]% shptreedump -maxdepth 6 eg_data/polygon.shp (
SHPTreeNode Min = (471127.19,4751545.00) Max =
(489292.31,4765610.50) Shapes(0): ( SHPTreeNode
Min = (471127.19,4751545.00)
Max = (481118.01,4765610.50)
Shapes(0):
( SHPTreeNode
Min = (471127.19,4751545.00)
Max = (481118.01,4759281.03)
Shapes(0):
( SHPTreeNode
Min = (471127.19,4751545.00)
Max = (476622.14,4759281.03)
Shapes(0):
( SHPTreeNode
Min = (471127.19,4751545.00)
Max = (476622.14,4755799.81)
Shapes(0):
( SHPTreeNode
Min = (471127.19,4751545.00)
Max = (474149.41,4755799.81)
Shapes(6): 395 397 402 404 405 422
)
( SHPTreeNode
Min = (473599.92,4751545.00)
Max = (476622.14,4755799.81)
Shapes(10): 392 394 403 413 414 417 426 433 434 447
)
) ...
I think I've been quite familiar with shapefile format after reading the ESRI Shapefile Technical Description. But I can't see any internal tree structure itself. So my question is, what is the shapefile quadtree for? And if possible, with explanation of shapefile quadtree implementation.
Thanks.

If you look at the end of your quoted text, right where you stopped, lots of closing parenthesis...good old Lisp style encoding:
(R (st1 (st21 () () () ()) () () ()) (st2) (st3) (st4))
R stands for the root of the tree, then you have four subtrees in () plus the actual data st1, I denoted the 4 subtrees by st1...st4. st21 stands for the first subtree on the second level. The subtrees could be labeled, or if any of them empty denoted by (). It is easy to parse and print.

extjs problem calculation of rows

i have a table
Fields
class 1
class 2
class 3
class 4
a1
10
240
340
401
a2
12
270
340
405
a3
12
270
340
405
a4
15
270
360
405
a5
17
720
530
450
i have this in grid as well as in Json.store , what i have to do is perform mathematical calculation each time the grid is refreshed by "table name".reconfigure(..... , ....)
consider the column "class1" ,
value(a5) = ( value(a1)+ 2*value(a2) + 3*value(a3) ) /value(a4)
can anybody please help he on this problem ,
I will be very very Thankful for help :)

As I'm not sure what aspect of the problem you are having difficulty with, I'll address both at a high level.
Generally speaking you want to have your reconfigure method update the Ext Store, which will then trigger an event that the Grid should handle. Basically, change the Store and your Grid will be updated automatically.
As far as generating the correct new row... it seems fairly straightforward - a rough pass:
/*for each field foo_X through foo_N:*/
var lastElementIndex = store.data.size-1;
var total = 0;
for (var i=0; i<; i++) {
if (i != lastElementIndex) {
total += store.data[i].get(foo_X)*i;
} else {
total = total/store.data[i].get(foo_x);
}
}
/*construct your json object with the field foo*/
/*after looping through all your fields, create your record and add it to the Store*/

How to unpack a pyspark WrappedArray

I have a pyspark query which returns a WrappedArray:
det_port_arr =
vessel_arrival_depart_df.select(vessel_arrival_depart_df['DetectedPortArrival'])
det_port_arr.show(2, truncate=False)
det_port_arr.dtypes
The output is a DataFrame with a single column, but that column is a struct which contains an array of structs:
|DetectedPortArrival |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray([portPoi,5555,BEILUN [CNBEI],marinePort], [portPoi,5729,NINGBO [CNNBO],marinePort], [portPoi,5730,NINGBO PT [CNNBG],marinePort]),device,Moored]|
|null |
[('DetectedPortArrival',
'struct<poiMeta:array<struct<poiCategory:string,poiId:bigint,poiName:string,poiType:string>>,sourceType:string,statusType:string>')]
If I try to select the poiMeta member of the struct:
temp = vessel_arrival_depart_df.select(vessel_arrival_depart_df['DetectedPortArrival']['poiMeta'])
temp.show(truncate=False)
print type(temp)
I obtain
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|DetectedPortArrival.poiMeta |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[portPoi,5555,BEILUN [CNBEI],marinePort], [portPoi,5729,NINGBO [CNNBO],marinePort], [portPoi,5730,NINGBO PT [CNNBG],marinePort]] |
|null |
Here are the data types:
temp.dtypes
('DetectedPortArrival.poiMeta',
'array<struct<poiCategory:string,poiId:bigint,poiName:string,poiType:string>>')]
But here's the problem: I don't seem to be able to query that column DetectedPortArrival.poiMeta:
df2 = temp.selectExpr("DetectedPortArrival.poiMeta")
df2.show(2)
AnalysisExceptionTraceback (most recent call last)
<ipython-input-46-c7f0041cffe9> in <module>()
----> 1 df2 = temp.selectExpr("DetectedPortArrival.poiMeta")
2 df2.show(3)
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/pyspark/sql/dataframe.py in selectExpr(self, *expr)
996 if len(expr) == 1 and isinstance(expr[0], list):
997 expr = expr[0]
--> 998 jdf = self._jdf.selectExpr(self._jseq(expr))
999 return DataFrame(jdf, self.sql_ctx)
1000
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u"cannot resolve '`DetectedPortArrival.poiMeta`' given input columns: [DetectedPortArrival.poiMeta]; line 1 pos 0;\n'Project ['DetectedPortArrival.poiMeta]\n+- Project [DetectedPortArrival#268.poiMeta AS DetectedPortArrival.poiMeta#503]\n +- Project [asOf#263, vesselId#264, DetectedPortArrival#268, DetectedPortDeparture#269]\n +- Sort [asOf#263 ASC NULLS FIRST], true\n +- Project [smfPayloadData#1.paired.shipmentId AS shipmentId#262, smfPayloadData#1.timestamp.asOf AS asOf#263, smfPayloadData#1.paired.vesselId AS vesselId#264, smfPayloadData#1.paired.vesselName AS vesselName#265, smfPayloadData#1.geolocation.speed AS speed#266, smfPayloadData#1.geolocation.detectedPois AS detectedPois#267, smfPayloadData#1.events.DetectedPortArrival AS DetectedPortArrival#268, smfPayloadData#1.events.DetectedPortDeparture AS DetectedPortDeparture#269]\n +- Filter ((((cast(smfPayloadData#1.paired.vesselId as double) = cast(9776183 as double)) && isnotnull(smfPayloadData#1.paired.shipmentId)) && (length(smfPayloadData#1.paired.shipmentId) > 0)) && (isnotnull(smfPayloadData#1.paired.vesselId) && (isnotnull(smfPayloadData#1.events.DetectedPortArrival) || isnotnull(smfPayloadData#1.events.DetectedPortDeparture))))\n +- SubqueryAlias smurf_processed\n +- Relation[smfMetaData#0,smfPayloadData#1,smfTransientData#2] parquet\n"
Any suggestions as to how to query that column?

cant you just select the column based on their index? Something like
temp.select(temp.columns[0]).show()
Best Regards

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

MAX Partitioning in ETL using SSIS - sql-server

Related

Stopping a custom SQLCLR aggregate early

Autofixture: Issue when creating value types

Shapefile quadtree in shapelib

extjs problem calculation of rows

How to unpack a pyspark WrappedArray

Categories

Resources