I'm trying to build a hierarchy from two columns where by one column is a client identifier and the other is its direct parent but I have an issue because a client can have a parent which can have another parent.
At the moment I have a number of merge statements in a macro that
%macro hierarchy (level);
data sourcedata (drop = PARENT_ID);
merge sourcedata (in = a)
inputdata (in = b);
by CLIENT_ID;
if a;
length Parent_L&level $32.;
Parent_L&level = PARENT_ID;
if Parent_L&level = Parent_L%eval(&level-1) then Parent_L&level ="";
CLIENT_ID = Parent_L&level;
proc sort;
by Parent_L&level;
run;
%mend;
%hierarchy(2)
%hierarchy(3)
%hierarchy(4)
my output looks like
client_ID Parent_L1 Parent_L2 Parent_L3 Parent_L4
clientA clientB . . .
ClientE clientA clientB . .
what I'm looking for is a way to do this until the last parent_Ln is all blank as I'm not sure how many levels I need to go to
thanks
The length you need is the depth of the tree.
It is a bit cumbersome to compute that in SQL, since you really need a hash table, or at least arrays.
You could use a hash in data step or ds2 to access the parent from the child.
If you have SAS/OR, then it is easier:
you can solve a shortest path and produce your table from
the paths from the roots of the forest.
For this example I will use the dataset from the link Reeza provided:
data have;
infile datalines delimiter='|';
input subject1 subject2;
datalines;
2 | 4
2 | 5
2 | 6
2 | 8
4 | 7
4 | 11
6 | 9
6 | 10
6 | 12
10 | 15
10 | 16
13 | 14
16 | 17
;
proc optmodel printlevel=0;
set<num,num> REFERRALS;
set CLIENTS = union{<u,v> in REFERRALS} {u, v};
set ROOTS = CLIENTS diff setof{<u,v> in REFERRALS} v;
read data have into REFERRALS=[subject1 subject2];
/* source, sink, seq, tail, head. */
set<num,num,num,num,num> PATHS;
num len{ROOTS,CLIENTS} init .;
num longest = max(of len[*]);
solve with NETWORK / graph_direction = directed
links = ( include = REFERRALS )
out = ( sppaths = PATHS spweights = len )
shortpath = ( source = ROOTS )
;
num parent{ci in CLIENTS, i in 1 .. longest} init .;
for{<u, v, i, s, t> in PATHS}
parent[v, len[u,v] - i + 1] = s;
print parent;
create data want from [client]=CLIENTS
{i in 1 .. longest} <COL('Parent_L'||i) = parent[client,i]>;
quit;
Related
I have a data frame where passengerId and path are Strings. The path represents the flight path of the passenger so passenger 10096 started in country CO and traveled to country BM. I need to find out the longest amount of flights each passenger has without traveling to the UK.
+-----------+--------------------+
|passengerId| path|
+-----------+--------------------+
| 10096| co,bm|
| 10351| pk,uk|
| 10436| co,co,cn,tj,us,ir|
| 1090| dk,tj,jo,jo,ch,cn|
| 11078| pk,no,fr,no|
| 11332|sg,cn,co,bm,sg,jo...|
| 11563|us,sg,th,cn,il,uk...|
| 1159| ca,cl,il,sg,il|
| 11722| dk,dk,pk,sg,cn|
| 11888|au,se,ca,tj,th,be...|
| 12394| dk,nl,th|
| 12529| no,be,au|
| 12847| cn,cg|
| 13192| cn,tk,cg,uk,uk|
| 13282| co,us,iq,iq|
| 13442| cn,pk,jo,us,ch,cg|
| 13610| be,ar,tj,no,ch,no|
| 13772| be,at,iq|
| 13865| be,th,cn,il|
| 14157| sg,dk|
+-----------+--------------------+
I need to get it like this.
val data = List(
(1,List("UK","IR","AT","UK","CH","PK")),
(2,List("CG","IR")),
(3,List("CG","IR","SG","BE","UK")),
(4,List("CG","IR","NO","UK","SG","UK","IR","TJ","AT")),
(5,List("CG","IR"))
I'm trying to use this solution but I can't make this list of lists. It also seems like the input used in the solution has each country code as a separate item in the list, while my path column has the country codes listed as a single element to describe the flight path.
If the goal is just to generate the list of destinations from a string, you can simply use split:
df.withColumn("path", split('path, ","))
If the goal is to compute the maximum number of steps without going to the UK, you could do something like this:
df
// split the string on 'uk' and generate one row per sub journey
.withColumn("path", explode(split('path, ",?uk,?")))
// compute the size of each sub journey
.withColumn("path_size", size(split('path, ",")))
// retrieve the longest one
.groupBy("passengerId")
.agg(max('path_size) as "max_path_size")
I have the following data
100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello
there!
1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!
In a file, I have multiple lines of of the above formatted data. Each line is delimited by ///n and ///t. For each line, there are four records that are delimited by ///n. Inside each record, there are four columns that are delimited by ///t. Now, I need to parse this into a Dataframe. So basically for the above two lines; since each line has 2 records with 6 columns; there should be 12 records in the Dataframe. Each record follows the same format.
I tried parsing this using a combination of split and amp but did not get the correct output
You can process it using string transformations, like:
// Sample of input data
val str1 = "100///t1001///t2///t0.119///t2342342342///tHi\nthere!///n103///t1002///t2///t0.119///t2342342342///tHello there!"
val str2 = "1010///t10077///t2///t0.119///t2342342342///tHi\nthere!///n1044///t1003///t2///t0.119///t2342342342///tHello there!"
val df = Seq(str1, str2).toDF
// Process:
val output = df.as[String].flatMap(row=>{
val fields = row.split("///n").map(record=>{
val fields = record.split("///t").toList
(fields(0), fields(1), fields(2), fields(3), fields(4), fields(5))
}).toList
fields
}).toDF("column_1", "column_2", "column_3", "column_4", "column_5", "column_6")
Result:
+--------+--------+--------+--------+----------+------------+
|column_1|column_2|column_3|column_4| column_5| column_6|
+--------+--------+--------+--------+----------+------------+
| 100| 1001| 2| 0.119|2342342342| Hi |
| |there! |
| 103| 1002| 2| 0.119|2342342342|Hello there!|
| 1010| 10077| 2| 0.119|2342342342| Hi |
| | there!|
| 1044| 1003| 2| 0.119|2342342342|Hello there!|
+--------+--------+--------+--------+----------+------------+
I am not sure yet what the problem is, I am trying to go through a ResizeArray and matching the item with the data type, and depending on this, take away the value in a specific field (iSpace) from thespace(which is how much space the inventory has), before returning the final value.
A snippet of my code :
let spaceleft =
let mutable count = 0 //used to store the index to get item from array
let mutable thespace = 60 //the space left in the inventory
printf "Count: %i \n" inventory.Count //creates an error
while count < inventory.Count do
let item = inventory.[count]
match item with
|Weapon weapon ->
thespace <- (thespace - weapon.iSpace)
|Bomb bomb ->
thespace <-(thespace - bomb.iSpace)
|Potion pot ->
thespace <- (thespace - pot.iSpace)
|Armour arm ->
thespace <- (thespace - arm.iSpace)
count <- count+1
thespace
I get an error about Int32, that has to do with the
printf "Count: %i \n" inventory.Count
line
Another problem is that thespace doesn't seem to change, and always returns as 60, although I have checked and inventory is not empty, it always has at least two items, 1 weapon and 1 armour, so thespace should atleast decrease yet it never does.
Other snippets that may help:
let inventory = ResizeArray[]
let initialise =
let mutable listr = roominit
let mutable curroom = 3
let mutable dead = false
inventory.Add(Weapon weap1)
inventory.Add(Armour a1)
let spacetogo = spaceleft //returns 60, although it should not
Also, apart from the iniitialise function, other functions seem not to be able to add items to the inventory properly, eg:
let ok, input = Int32.TryParse(Console.ReadLine())
match ok with
|false ->
printf "The weapon was left here \n"
complete <- false
|true ->
if input = 1 && spaceleft>= a.iSpace then
inventory.Add(Weapon a)
printf "\n %s added to the inventory \n" a.name
complete <- true
else
printf "\n The weapon was left here \n"
complete <- false
complete
You have spaceLeft as a constant value. To make it a function you need to add unit () as a parameter. Here's that change including a modification to make it much simpler (I've included my dummy types):
type X = { iSpace : int }
type Item = Weapon of X | Bomb of X | Potion of X | Armour of X
let inventory = ResizeArray [ Weapon {iSpace = 2}; Bomb {iSpace = 3} ]
let spaceleft () =
let mutable thespace = 60 //the space left in the inventory
printf "Count: %i \n" inventory.Count
for item in inventory do
let itemSpace =
match item with
| Weapon w -> w.iSpace
| Bomb b -> b.iSpace
| Potion p -> p.iSpace
| Armour a -> a.iSpace
thespace <- thespace - itemSpace
thespace
spaceleft () // 55
The above code is quite imperative. If you want to make it more functional (and simpler still) you can use Seq.sumBy:
let spaceleft_functional () =
printf "Count: %i \n" inventory.Count
let spaceUsed =
inventory
|> Seq.sumBy (function
| Weapon w -> w.iSpace
| Bomb b -> b.iSpace
| Potion p -> p.iSpace
| Armour a -> a.iSpace)
60 - spaceUsed
Just adding to the accepted answer: you can also match against record labels, as long as your inner types are records. Combine with an intrinsic type extension on the outer DU:
type X = { iSpace : int }
type Y = { iSpace : int }
type Item = Weapon of X | Bomb of Y | Potion of X | Armour of X
let inventory = ResizeArray [ Weapon {iSpace = 2}; Bomb {iSpace = 3} ]
let itemSpace = function
| Weapon { iSpace = s } | Bomb { iSpace = s }
| Potion { iSpace = s } | Armour { iSpace = s } -> s
type Item with static member (+) (a, b) = a + itemSpace b
60 - (Seq.fold (+) 0 inventory)
// val it : int = 55
Otherwise, you could resort to member constraint invocation expressions.
let inline space (x : ^t) = (^t : (member iSpace : int) (x))
I tried below code for drop records that contains garbage value with multiple occurrences and multiple columns,But I want to remove garbage value form string with multiple occurrences in multiple columns.
Sample Code :-
filter_list = ['$','#','%','#','!','^','&','*','null']
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
In this example "original" is my original dataframe and "compulsory_fields" this is my array(it stores as multiple columns).
Sample Input :-
id name salary
# Yogita 1000
2 Neha ##
3 #Jay$deep## 8000
4 Priya 40$00&
5 Bhavana $$%&^
6 $% $$&&
Sample Output :-
id name salary
3 Jaydeep 8000
4 priya 4000
Your requirements are not completely clear to me, but it seems you want to output records that are valid after removing the "garbage" characters. You can achieve this by adding a clean_special_characters udf that removes the special characters before running your filter_udf:
import pyspark.sql.functions as f
from itertools import chain
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import BooleanType,StringType
rdd = sc.parallelize((
('#','Yogita','1000'),
('2', 'Neha', '##'),
('3', '#Jay$deep##','8000'),
('4', 'Priya', '40$00&'),
('5', 'Bhavana', '$$%&^'),
('6', '$%','$$&&'))
)
original = rdd.toDF(['id','name','salary'])
filter_list = ['$','#','%','#','!','^','&','*','null']
compulsory_fields = ['id','name','salary']
def clean_special_characters(input_string):
cleaned_input = input_string.translate({ord(c): None for c in filter_list if len(c)==1})
if cleaned_input == '':
return 'null'
return cleaned_input
clean_special_characters_udf = f.udf(clean_special_characters, StringType())
original = original.withColumn('name', clean_special_characters_udf(original.name))
original = original.withColumn('salary', clean_special_characters_udf(original.salary))
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
This outputs:
+---+-------+------+
| id| name|salary|
+---+-------+------+
| 3|Jaydeep| 8000|
| 4| Priya| 4000|
+---+-------+------+
I have a little problem with the performance of one of my applications, basically:
An external system gives me a big structure as an Object(,).
This structure only has an column per row.
MyData(0,0) = 'COL1-ROW1 | COL2-ROW1 | COL3-ROW1'
MyData(1,0) = 'COL1-ROW2 | COL2-ROW2 | COL3-ROW2'
MyData(2,0) = 'COL1-ROW3 | COL2-ROW3 | COL3-ROW3'
MyData(3,0) = 'COL1-ROW4 | COL2-ROW4 | COL3-ROW4'
MyData(0,1) ' Doesn't exists.
There are some method in LINQ to convert this structure to an one dimensional array of strings?
It would be awesome if you could divide by columns, given a specific character.
Something like this:
NewData(0,0) = COL1-ROW1
NewData(0,1) = COL2-ROW1
NewData(0,2) = COL3-ROW1
NewData(1,0) = COL1-ROW2
...
NewData(3,2) = COL3-ROW3
Seems I found the answer from myself; here my solution:
Dim vMyData(1000, 0) As Object
For x = 0 To 1000
vMyData(x, 0) = String.Format("ROW{0}COL1|ROW{0}COL2|ROW{0}COL3|ROW{0}COL4", x)
Next
Dim vQuery = From TempResult In vMyData
Select Value = TempResult.ToString.Split("|")
Dim vMyNewArray As New ArrayList(vQuery.ToArray)
Now; exists some method to trim each value of the Split("|")?
[UPDATE TO THE PREVIOUS QUESTION]:
From TempResult In vMyData Select Value = Array.ConvertAll(TempResult.ToString.Split("|"), Function(vVal) vVal.ToString.Trim)