F# remove duplicates from a string [] list - arrays

I have a program that results in an [] list, and I'm trying to remove near duplicated arrays from the list. An example of the list is...
[
[|
"Jackson";
"Stentzke";
"22";
"001"
|];
[|
"Jackson";
"Stentzke";
"22";
"002"
|];
[|
"Alec";
"Stentzke";
"18";
"003"
|]
]
Basically I'm trying to write a function that would read over the list and remove all examples of near identical data. So the final returned [] list should look like...
[
[|
"Alec";
"Stentzke";
"18";
"003"
|]
]
I've tried a number of functions to try and get this result or something close to it that can work with. My current attempt is this...
let removeDuplicates (arrayList: string[]list) =
let list = arrayList|> List.map(fun aL ->
let a = arrayList|> List.map(fun aL2 ->
try
match (aL.GetValue(0).Equals(aL2.GetValue(0))) && (aL.GetValue(2).Equals(aL2.GetValue(2))) && (aL.GetValue(3).Equals(aL2.GetValue(3))) with
| false -> aL2
| _ -> [|""|]
with
| ex -> [|""|]
)
a
)
list |> List.concat |> List.distinct
But all this returns is the a reversed version on the input []list.
Does anyone know how to remove near duplicated arrays from a list?

I believe your code and comments don't match up very well. Considering your comments "the first, second and third values are the same", I believe this can get you in the right track:
let removeDuplicates (arrayList: string[]list) =
arrayList |> Seq.distinctBy (fun elem -> (elem.[0] , elem.[1] , elem.[2]))
The result of this against your input data is a two element list containing:
[
[|
"Jackson";
"Stentzke";
"22";
"001"
|];
[|
"Alec";
"Stentzke";
"18";
"003"
|]
]

You should create a dictionary/map based on the fields you consider identical then just remove any duplicate occurance. Here's a simply and mechanical way, assuming xs is the List you specified above:
type DataRec = { key:string
fname:string
lname:string
id1:string
id2:string}
let dataRecs = xs |> List.map (fun x -> {key=x.[0]+x.[1]+x.[2];fname=x.[0];lname=x.[1];id1=x.[2];id2=x.[3]})
dataRecs |> Seq.groupBy (fun x -> x.key)
|> Seq.filter (fun x -> Seq.length (snd x) = 1)
|> Seq.collect snd
|> Seq.map (fun x -> [|x.fname;x.lname;x.id1;x.id2|])
|> Seq.toList
Output:
val it : string [] list = [[|"Alec"; "Stentzke"; "18"; "003"|]]
It basically creates a key from the first three items, groups by it, filters out anything over 2 counst, and then maps back to an array.

Using some Linq:
let comparer (atMost) =
{ new System.Collections.Generic.IEqualityComparer<string[]> with
member __.Equals(a, b) =
Seq.zip a b
|> Seq.sumBy (fun (a',b') -> System.StringComparer.InvariantCulture.Compare(a', b') |> abs |> min 1)
|> ((>=) atMost)
member __.GetHashCode(a) = 1
}
System.Linq.Enumerable.GroupBy(data, id, comparer 1)
|> Seq.choose (fun g -> match Seq.length g with | 1 -> Some g.Key | _ -> None)
The comparer allows for atMost : int number of differences between two arrays.

Related

F# CSV - for each row create an array from columns data

I have a CSV file, where the fst column is a title and next 700+ columns are some int data.
Title D1 D2 D3 D4 .. D700
Name1 0 1 7 5 48
I try to use CsvProvider to read the file and then convert data to my custom type
type DigitRecord = { Title:string; Digits:int[] }
The problem is I don't know how to put all column data (except the first one with a title) into a int[] array.
let dataRecords =
CSV.Rows
|> Seq.map (fun record -> {Title = record.Title; Digits = ???})
I want to get a record with Title=Name1 and Digits=[|0,1,7,5...48|]
I'm newbie in F#, I'd be grateful for any help!
I think the easiest way is to use CsvParser like this:
let readData (path : string) seps =
CsvFile.Load(path, seps).Rows
|> Seq.map
(fun row -> row.Columns.[0], row.Columns |> Array.skip 1 |> Array.map int)
|> Seq.map
(fun (title, digits) -> {Title = title; Digits = digits})

process subarrays asynchronously and reduce the results to a single array

Input If input is in the form of array of arrays.
let items = [|
[|"item1"; "item2"|]
[|"item3"; "item4"|]
[|"item5"; "item6"|]
[|"item7"; "item8"|]
[|"item9"; "item10"|]
[|"item11"; "item12"|]
|]
Asynchronous action that returns asynchronous result or error
let action (item: string) : Async<Result<string, string>> =
async {
return Ok (item + ":processed")
}
Attempt process one subarray at a time in parallel
let result = items
|> Seq.map (Seq.map action >> Async.Parallel)
|> Async.Parallel // wrong? process root items sequentially
|> Async.RunSynchronously
Expectations:
a) Process one subarray at a time in parallel, then process the second subarray in parallel and so on. (In other words sequential processing for the root items and parallel processing for subitems)
b) Then collect all the results and merge them into a singly dimensioned results array while maintaining the order.
c) Preferably using built-in methods provided by Array, Seq, List, Async etc. instead of any custom operators (that'd be last resort)
d) Optional - If it's not possible to have something within the chain, then as a last resort perhaps convert the result subarrays into single array at the end and return to the caller, if that leads to a cleaner and minimalistic approach which I prefer.
Attempt 2
let result2 = items
|> Seq.map (Seq.map action >> Async.Parallel)
|> Async.Parallel // wrong? is it processing root items sequentially
|> Async.RunSynchronously
|> Array.collect id
Array.iter (fun (item: Result<string, string>) ->
match item with
| Ok r -> Console.WriteLine(r)
| Error e -> Console.WriteLine(e)
) result2
Edit
let action (item: string) : Async<Result<string, string>> =
async {
return Ok (item + ":processed")
}
let items = [| "item1"; "item2"; "item3"; "item4"; "item5"; "item6"; "item7"; "item8"; "item9"; "item10"|]
let result = items
|> Seq.chunkBySize 2
|> Seq.map (Seq.map action >> Async.Parallel)
|> Seq.map Async.RunSynchronously
|> Seq.toArray
|> Array.collect id
let result = items |> Array.map ( Array.map action >> Async.Parallel)
|> Array.map Async.RunSynchronously
|> Array.collect id
Edit: Note that majority of operations defined on Seq can be found in array and vice versa. If you initially have an array you can use array operation all the way down.
let items = [| "item1"; "item2"; "item3"; "item4"; "item5"; "item6"; "item7"; "item8"; "item9"; "item10"|]
let result = items
|> Array.chunkBySize 2
|> Array.map (Array.map action >> Async.Parallel >> Async.RunSynchronously)
|> Array.concat

F# Two Arrays - First Array Product Filtered By Index On Second Array

I have three arrays - first is a float array, second is a string array, and fltr is a string array. I need to generate a product of the elements in the first array filtered by whether the matching index in the second array contains all the characters in the elements of the filter array:
module SOQN =
open System
let first = [| 2.00; 3.00; 5.00; 7.00; 11.00 |]
let second = [| "ABCD"; "ABCE"; "ABDE"; "ACDE"; "BCDE" |]
let fltr = [| "AC"; "BD"; "CE" |]
let result =
first
|> Array.filter second // filter for elements containing characters in second array
|> Seq.reduce (fun x y -> x * y)
// Expected Result: let result = [| 42.00; 110.00; 231.00 |]
How do I generate the array of products?
Something like this
let first = [| 2.00; 3.00; 5.00; 7.00; 11.00 |]
let second = [| "ABCD"; "ABCE"; "ABDE"; "ACDE"; "BCDE" |]
let fltr = "AC"
Array.zip first second
|> Array.filter (fun (_, s) ->
Seq.forall (fun c -> s.Contains (string c)) fltr)
|> Array.map fst
|> Array.reduce (*)
The following snippet (though not idiomatic) provides the complete answer I was seeking and includes #xuanduc987 solution:
module SOANS =
open System
let first = [| 2.00; 3.00; 5.00; 7.00; 11.00 |]
let second = [| "ABCD"; "ABCE"; "ABDE"; "ACDE"; "BCDE" |]
let fltr = [| "AC"; "BD"; "CE" |]
let filterProduct (first:float[]) (second:string[]) (fltr:string) =
Array.zip first second
|> Array.filter (fun (_, s) ->
Seq.forall (fun c -> s.Contains (string c)) fltr)
|> Array.map fst
|> Array.reduce (*)
let third =
[for i in [0..fltr.Length - 1] do
yield (filterProduct first second fltr.[i])]
|> List.toArray
printfn "Third: %A" third
// Expected Result: Third: [| 42.0; 110.0; 231.0 |]
// Actual Result Third: [| 42.0; 110.0; 231.0 |]

f# Finding the difference between 2 obj[]lists

I have 2 obj[]lists list1 and list2. List1 has a length of 8 and list2 has a length of 10. There are arrays in list1 that only exist in list1. That also goes the same for list2. But there are array that exist in both. I'm wondering how to get the arrays that exist in list1. At the moment when I run my code I get a list of the arrays that exist in both lists, but it's missing the data unique to list1. I'm wondering how to get that unique list1 data. Any suggestions?
let getProdOnly (index:int)(list1:obj[]list)(list2:obj[]list) =
let mutable list3 = list.Empty
for i = 0 to list1.Length-1 do
for j = 0 to list2.Length-1 do
if list1.Item(i).GetValue(index).Equals(list2.Item(j).GetValue(index)) then
System.Diagnostics.Debug.WriteLine("Exists in List 1 and 2")
else
list3 <- list1.Item(i)
Something like this:
let ar1 = [|1;2;3|]
let ar2 = [|2;3;4|]
let s1 = ar1 |> Set.ofArray
let s2 = ar2 |> Set.ofArray
Set.difference s1 s2
//val it : Set<int> = set [1]
There are also a bunch of Array related functions, like compareWith, distinct, exists if you want to work with Arrays directly.
But as was pointed out in previous answers, this type of imperative code is not very idiomatic. Try to avoid mutable variables, try to avoid loops. It could probably rewritten with Array.map for example.
If you want the elements unique to one list, this is the easiest way to do it in F# 4.0:
list1
|> List.except list2
which will remove all the elements of list2 from list1. Note that except also calls a distinct, so you might need to watch out for that.
First I took your code with minor changes and added some printf debuging to see what is does.
let getProdOnly2 (index:int)(list1:obj[] list)(list2:obj[] list) =
let mutable list3 : obj[] list= list.Empty
for i = 0 to list1.Length-1 do
for j = 0 to list2.Length-1 do
if list1.[i].[index] = list2.[j].[index] then
printfn "equal"
System.Diagnostics.Debug.WriteLine("Exists in List 1 and 2")
list3
else
printfn "add %A %A" (list1.Item(i)) (list2.Item(j))
list3 <- list1.Item(i) :: list3
list3
list3
And it does adding an element each time it finds an element not equal the current element.
So my attempt is to take the list1 and just ceep or better filter the elements that are not part of list2.
let getProdOnly3 (index:int)(list1:obj[] list)(list2:obj[] list) =
list1
|> List.filter (fun el1 ->
list2
|> List.fold (fun acc el2 -> acc && (el2<>el1)) true )
I tested the code with the following lists
let list1 = [ [| 1;2;3;4|]
[| 1;2;3;4|]
[| 2;3;4|]
[| 3;4;5|] ] |> List.map (fun a -> a |> Array.map (fun e -> box e))
let list2 = [ [| 2;3;4|]
[| 3;4;5|] ] |> List.map (fun a -> a |> Array.map (fun e -> box e))
In difference to s952163 my result will have double entries if list1 has double entries, do not know if that is wanted or unwanted beahyuvier.

How to generate an array with a dynamic type in F#

I'm trying to write a function to return a dynamically typed array. Here is an example of one of the ways I've tried:
let rng = System.Random()
type ConvertType =
| AsInts
| AsFloat32s
| AsFloats
| AsInt64s
type InputType =
| Ints of int[]
| Float32s of float32[]
| Floats of float[]
| Int64s of int64[]
let genData : int -> int -> ConvertType -> InputType * int[] =
fun (sCount:int) (rCount:int) (ct:ConvertType) ->
let source =
match ct with
| AsInts -> Array.init sCount (fun _ -> rng.Next()) |> Array.map (fun e -> int e) |> Ints
| AsFloat32s -> Array.init sCount (fun _ -> rng.Next()) |> Array.map (fun e -> float32 e) |> Float32s
| AsFloats -> Array.init sCount (fun _ -> rng.Next()) |> Array.map (fun e -> float e) |> Floats
| AsInt64s -> Array.init sCount (fun _ -> rng.Next()) |> Array.map (fun e -> int64 e) |> Int64s
let indices = Array.init rCount (fun _ -> rng.Next sCount) |> Array.sort
source, indices
The problem I have is that on down when I use the function, I need the array to be the primitive type e.g. float32[] and not "InputType."
I have also tried doing it via an interface created by an inline function and using generics. I couldn't get that to work how I wanted either but I could have just been doing it wrong.
Edit:
Thanks for the great reply(s), I'll have to try that out today. I'm adding the edit because I solved my problem, although I didn't solve it how I wanted (i.e. like the answer). So an FYI for those who may look at this, I did the following:
let counts = [100; 1000; 10000]
let itCounts = [ 1000; 500; 200]
let helperFunct =
fun (count:int) (numIt:int) (genData : int -> int -> ('T[] * int[] )) ->
let c2 = int( count / 2 )
let source, indices = genData count c2
....
[<Test>]
let ``int test case`` () =
let genData sCount rCount =
let source = Array.init sCount (fun _ -> rng.Next())
let indices = Array.init rCount (fun _ -> rng.Next sCount) |> Array.sort
source, indices
(counts, itCounts) ||> List.Iter2 (fun s i -> helperFunct s i genData)
.....
Then each proceeding test case would be something like:
[<Test>]
let ``float test case`` () =
let genData sCount rCount =
let source = Array.init sCount (fun _ -> rng.Next()) |> Array.map (fun e -> float e)
let indices = Array.init rCount (fun _ -> rng.Next sCount) |> Array.sort
source, indices
.....
But, the whole reason I asked the question was that I was trying to avoid rewriting that genData function for every test case. In my real code, this temporary solution kept me from having to break up some stuff in the "helperFunct."
I think that your current design is actually quite good. By listing the supported types explicitly, you can be sure that people will not try to call the function to generate data of type that does not make sense (say byte).
You can write a generic function using System.Convert (which lets you convert value to an arbitrary type, but may fail if this does not make sense). It is also (very likely) going to be less efficient, but I have not measured that:
let genDataGeneric<'T> sCount rCount : 'T[] * int[] =
let genValue() = System.Convert.ChangeType(rng.Next(), typeof<'T>) |> unbox<'T>
let source = Array.init sCount (fun _ -> genValue())
let indices = Array.init rCount (fun _ -> rng.Next sCount) |> Array.sort
source, indices
You can then write
genDataGeneric<float> 10 10
but people can also write genDataGeneric<bool> and the code will either crash or produce nonsense, so that's why I think that your original approach had its benefits too.
Alternatively, you could parameterize the function to take a converter that turns int (which is what you get from rng.Next) to the type you want:
let inline genDataGeneric convertor sCount rCount : 'T[] * int[] =
let genValue() = rng.Next() |> convertor
let source = Array.init sCount (fun _ -> genValue())
let indices = Array.init rCount (fun _ -> rng.Next sCount) |> Array.sort
source, indices
Then you can write just getDataGeneric float 10 10 which is still quite elegant and it will be efficient too (I added inline because I think it can help here)
EDIT: Based on the Leaf's comment, I tried using the overloaded operator trick and it turns out you can also do this (this blog has the best explanation of what the hell is going on in the following weird snippet!):
// Specifies conversions that we want to allow
type Overloads = Overloads with
static member ($) (Overloads, fake:float) = fun (n:int) -> float n
static member ($) (Overloads, fake:int64) = fun (n:int) -> int64 n
let inline genDataGeneric sCount rCount : 'T[] * int[] =
let convert = (Overloads $ Unchecked.defaultof<'T>)
let genValue() = rng.Next() |> convert
let source = Array.init sCount (fun _ -> genValue())
let indices = Array.init rCount (fun _ -> rng.Next sCount) |> Array.sort
source, indices
let (r : float[] * _) = genDataGeneric 10 10 // This compiles & works
let (r : byte[] * _) = genDataGeneric 10 10 // This does not compile
This is interesting, but note that you still have to specify the type somewhere (either by usage later or using a type annotation). If you're using type annotation, then the earlier code is probably simpler and gives less confusing error messages. But the trick used here is certainly interesting.

Resources