Extract groups matched regex to array in scala - arrays

I got this problem. I have a
val line:String = "PE018201804527901"
that matches with this
regex : (.{2})(.{4})(.{9})(.{2})
I need to extract each group from the regex to an Array.
The result would be:
Array["PE", "0182","018045279","01"]
I try to do this regex:
val regex = """(.{2})(.{4})(.{9})(.{2})""".r
val x= regex.findAllIn(line).toArray
but it doesn't work!

regex.findAllIn(line).subgroups.toArray

Note that findAllIn does not automatically anchor the regex pattern, and will find a match inside a much longer string. If you need to only allow matches inside 17 char strings, you can use a match block like this:
val line = "PE018201804527901"
val regex = """(.{2})(.{4})(.{9})(.{2})""".r
val results = line match {
case regex(g1, g2, g3, g4) => Array(g1, g2, g3, g4)
case _ => Array[String]()
}
// Demo printing
results.foreach { m =>
println(m)
}
// PE
// 0182
// 018045279
// 01
See a Scala demo.
It also handles no match scenario well initializing an empty string array.
If you need to get all matches and all groups, then you will need to grab the groups into a list and then add the list to a list buffer (scala.collection.mutable.ListBuffer):
val line = "PE018201804527901%E018201804527901"
val regex = """(.{2})(.{4})(.{9})(.{2})""".r
val results = ListBuffer[List[String]]()
val mi = regex.findAllIn(line)
while (mi.hasNext) {
val d = mi.next
results += List(mi.group(1), mi.group(2), mi.group(3), mi.group(4))
}
// Demo printing
results.foreach { m =>
println("------")
println(m)
m.foreach { l => println(l) }
}
Results:
------
List(PE, 0182, 018045279, 01)
PE
0182
018045279
01
------
List(%E, 0182, 018045279, 01)
%E
0182
018045279
01
See this Scala demo

Your solution #sheunis was very helpful, finally I resolved it with this method:
def extractFromRegex (regex: Regex, line:String): Array[String] = {
val list = ListBuffer[String]()
for(m <- regex.findAllIn(line).matchData;
e <- m.subgroups)
list+=e
list.toArray
}
Because your solution with this code:
val line:String = """PE0182"""
val regex ="""(.{2})(.{4})""".r
val t = regex.findAllIn(line).subgroups.toArray
Shows the next exception:
Exception in thread "main" java.lang.IllegalStateException: No match available
at java.util.regex.Matcher.start(Matcher.java:372)
at scala.util.matching.Regex$MatchIterator.start(Regex.scala:696)
at scala.util.matching.Regex$MatchData$class.group(Regex.scala:549)
at scala.util.matching.Regex$MatchIterator.group(Regex.scala:671)
at scala.util.matching.Regex$MatchData$$anonfun$subgroups$1.apply(Regex.scala:553)
at scala.util.matching.Regex$MatchData$$anonfun$subgroups$1.apply(Regex.scala:553)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at scala.util.matching.Regex$MatchData$class.subgroups(Regex.scala:553)
at scala.util.matching.Regex$MatchIterator.subgroups(Regex.scala:671)

Related

How to add every element of a list at the end of every element of another list in scala?

I would like to add element of a list at the end of every element of another list.
I have :
val Cars_tmp :List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
=> Result : List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
val Values_tmp: List[String] = a.map(r => ((r.split("[|]")(1).toInt)/ (r.split("[|]")(3).toInt)).toString ).toList
=> Result : List[String] = List(2, 5)
I would like to have the following result (first element of Values_tmp is concatenate with first element of Cars_tmp, second element of Values_tmp is concatenate with second element of Cars_tmp...) like below:
List("Cars|10|Paris|5|Type|New|2", "Cars|15|Paris|3|Type|New|5")
I tried to do this:
Values_tmp.foldLeft( Seq[String](), Cars_tmp) { case ((acc, rest), elmt) => ((rest :+ elmt)::acc) }
I have the following error:
console>:28: error: type mismatch;
found : scala.collection.immutable.IndexedSeq[Any]
required: List[String]
Than you for your help.
Try to avoid zip, it "fails" silently when the iterables do not have the same size. (In your code, it seems obvious that the 2 lists have the same size, but for more complex code, this is not obvious.)
You can compute the "value" you need and concatenate it on the fly:
val Cars_tmp: List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
def getValue(str: String): String = {
val Array(_, a, _, b, _, _) = str.split('|') // Note the single quote for the split.
(a.toInt / b.toInt).toString
}
Cars_tmp.map(str => str + getValue(str))
I proposed an implementation of getValue using the unapply of Arrays, but you can keep your implementation !
def getValue(r: String) = ((r.split("[|]")(1).toInt)/ (r.split("[|]")(3).toInt)).toString

Scala - Split array within array, extract certain information and apply to case class

uri:23e6b806-7a39-4836-bae2-f369673defef offset:1
uri:z65e9d4e-a099-41a1-a9fe-3cf74xbb01a4 offset:2
uri:2beff8bf-1019-4265-9da4-30c696397e08 offset:3
uri:3b1df8bb-69f6-4892-a516-523fd285d659 offset:4
uri:4f961415-b847-4d2c-9107-87617671c47b offset:5
uri:015ba25c-c145-456a-bae7-ebe999cb8e0f offset:6
uri:z1f9592f-64d0-443d-ad0d-38c386dd0adb offset:7
The above is an arrays of arrays.
Each line is an element in the array however this in itself is an array. I did this by splitting each line on the comma and removing it. What I am trying to do is only extract the uri and offset and apply it to a case class.
case class output2(uri: String, offset: Int)
All I want is the actual values, so each instance of the case class, the uri and offset would be in the below format -
e1af5db7-3aad-4ab0-ac3a-55686fccf6ae
1
I'm trying to find a simple way to do this.
No need to split() each line on the comma. Make the comma part of the recognized intput pattern.
val data = Array("uri:23e6b806-7a39-4836-bae2-f369673defef,offset:1"
,"uri:z65e9d4e-a099-41a1-a9fe-3cf74xbb01a4,offset:2"
,"poorly formatted data will be ignored"
,"uri:2beff8bf-1019-4265-9da4-30c696397e08,offset:3"
,"uri:3b1df8bb-69f6-4892-a516-523fd285d659,offset:4"
,"uri:4f961415-b847-4d2c-9107-87617671c47b,offset:5"
,"uri:015ba25c-c145-456a-bae7-ebe999cb8e0f,offset:6"
,"uri:z1f9592f-64d0-443d-ad0d-38c386dd0adb,offset:7")
case class Data(uri:String, offset:Int)
val dataRE = "uri:([^,]+),offset:(\\d+)".r
val rslt:Array[Data] = data.collect{case dataRE(uri, os) => Data(uri, os.toInt)}
You can build your data checking the guid using the regex like:
val regexp = """uri:(\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b) offset:([0-9]+)""".r
val regexp(pattern, value) = "uri:23e6b806-7a39-4836-bae2-f369673defef offset:1"
output2(pattern, value.toInt)
I'd do it this way:
case class Output(uri: String, offset: Int)
val lines = Source
.fromFile("input")
.getLines
.toList
def parseUri(s: String): Option[String] =
s.splitAt(s.indexOf(":") + 1)._2 match {
case "" => None
case uri => Some(uri)
}
def parseOffset(s: String): Option[Int] =
s.splitAt(s.indexOf(":") + 1)._2 match {
case "" => None
case offset => Some(offset.toInt)
}
def parseOutput(xs: Array[String]): Option[Output] = for {
uri <- parseUri(xs(0))
offset <- parseOffset(xs(1))
} yield {
Output(uri, offset)
}
lines.map(_.split(" ")).flatMap { x =>
parseOutput(x)
}

Create Tuple out of Array(Array[String) of Varying Sizes using Scala

I am new to scala and I am trying to make a Tuple pair out an RDD of type Array(Array[String]) that looks like:
(122abc,223cde,334vbn,445das),(221bca,321dsa),(231dsa,653asd,698poq,897qwa)
I am trying to create Tuple Pairs out of these arrays so that the first element of each array is key and and any other part of the array is a value. For example the output would look like:
122abc 223cde
122abc 334vbn
122abc 445das
221bca 321dsa
231dsa 653asd
231dsa 698poq
231dsa 897qwa
I can't figure out how to separate the first element from each array and then map it to every other element.
If I'm reading it correctly, the core of your question has to do with separating the head (first element) of the inner arrays from the tail (remaining elements), which you can use the head and tail methods. RDDs behave a lot like Scala lists, so you can do this all with what looks like pure Scala code.
Given the following input RDD:
val input: RDD[Array[Array[String]]] = sc.parallelize(
Seq(
Array(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
)
)
)
The following should do what you want:
val output: RDD[(String,String)] =
input.flatMap { arrArrStr: Array[Array[String]] =>
arrArrStr.flatMap { arrStrs: Array[String] =>
arrStrs.tail.map { value => arrStrs.head -> value }
}
}
And in fact, because of how the flatMap/map is composed, you could re-write it as a for-comprehension.:
val output: RDD[(String,String)] =
for {
arrArrStr: Array[Array[String]] <- input
arrStr: Array[String] <- arrArrStr
str: String <- arrStr.tail
} yield (arrStr.head -> str)
Which one you go with is ultimately a matter of personal preference (though in this case, I prefer the latter, as you don't have to indent code as much).
For verification:
output.collect().foreach(println)
Should print out:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)
This is a classic fold operation; but folding in Spark is calling aggregate:
// Start with an empty array
data.aggregate(Array.empty[(String, String)]) {
// `arr.drop(1).map(e => (arr.head, e))` will create tuples of
// all elements in each row and the first element.
// Append this to the aggregate array.
case (acc, arr) => acc ++ arr.drop(1).map(e => (arr.head, e))
}
The solution is a non-Spark environment:
scala> val data = Array(Array("122abc","223cde","334vbn","445das"),Array("221bca","321dsa"),Array("231dsa","653asd","698poq","897qwa"))
scala> data.foldLeft(Array.empty[(String, String)]) { case (acc, arr) =>
| acc ++ arr.drop(1).map(e => (arr.head, e))
| }
res0: Array[(String, String)] = Array((122abc,223cde), (122abc,334vbn), (122abc,445das), (221bca,321dsa), (231dsa,653asd), (231dsa,698poq), (231dsa,897qwa))
Convert your input element to seq and all and then try to write the wrapper which will give you List(List(item1,item2), List(item1,item2),...)
Try below code
val seqs = Seq("122abc","223cde","334vbn","445das")++
Seq("221bca","321dsa")++
Seq("231dsa","653asd","698poq","897qwa")
Write a wrapper to convert seq into a pair of two
def toPairs[A](xs: Seq[A]): Seq[(A,A)] = xs.zip(xs.tail)
Now send your seq as params and it it will give your pair of two
toPairs(seqs).mkString(" ")
After making it to string you will get the output like
res8: String = (122abc,223cde) (223cde,334vbn) (334vbn,445das) (445das,221bca) (221bca,321dsa) (321dsa,231dsa) (231dsa,653asd) (653asd,698poq) (698poq,897qwa)
Now you can convert your string, however, you want.
Using df and explode.
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
val df2 = df.withColumn("key", 'arr(0)).withColumn("values",explode('arr)).filter('key =!= 'values).drop('arr).withColumn("tuple",struct('key,'values))
df2.show(false)
df2.rdd.map( x => Row( (x(0),x(1)) )).collect.foreach(println)
Output:
+------+------+---------------+
|key |values|tuple |
+------+------+---------------+
|122abc|223cde|[122abc,223cde]|
|122abc|334vbn|[122abc,334vbn]|
|122abc|445das|[122abc,445das]|
|221bca|321dsa|[221bca,321dsa]|
|231dsa|653asd|[231dsa,653asd]|
|231dsa|698poq|[231dsa,698poq]|
|231dsa|897qwa|[231dsa,897qwa]|
+------+------+---------------+
[(122abc,223cde)]
[(122abc,334vbn)]
[(122abc,445das)]
[(221bca,321dsa)]
[(231dsa,653asd)]
[(231dsa,698poq)]
[(231dsa,897qwa)]
Update1:
Using paired rdd
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x => { val y = x.getAs[mutable.WrappedArray[String]]("arr")(0); (y,x)} )
val pair = new PairRDDFunctions(rdd1)
pair.flatMapValues( x => x.getAs[mutable.WrappedArray[String]]("arr") )
.filter( x=> x._1 != x._2)
.collect.foreach(println)
Results:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)

Swift 3 simple search engine

I have an array of Strings containing user inputted values. I have a string containing several words (the number of words in the string varies). I want to increment an Int every time one of the words in the string matches a word in the array.
I'm using this method:
var searchWordsArr: [String] = [] \\filled by user input
var message: String = "" \\random number of words
var numRelevantWords = 0
var i = 0
while i < self.searchWordsArr.count {
i+=1
if message.contains(self.searchWordsArr[i-1]) {
numRelevantWords += 1
}
}
In my first example, the string contained 25 words and the array contained 3 words. The 3 words came up a total of 12 times in the string. Using the above method, the value of numRelevantWords was 2.
I would use regex. Some day soon, Swift will have native regular expressions. But until it does, you have to resort to Foundation:
let words = ["the", "cat", "sat"]
let input = "The cat sat on the mat and the mat sat on the catastrophe"
var result = 0
let pattern = "\\b(" + words.joined(separator:"|") + ")\\b"
do {
let regex = try NSRegularExpression(pattern: pattern, options: .caseInsensitive)
let match = regex.matches(in: input, options: [], range: NSRange(location: 0, length: input.utf16.count))
result += match.count // 7, that's your answer
} catch {
print("illegal regex")
}
Add them to a NSCountedSet:
let searchWords = NSCountedSet()
searchWords.add("Bob")
searchWords.add("John")
searchWords.add("Bob")
print(searchWords.count(for: "Bob")) // 2
"pure" Swift (no Foundation)
let message = "aa bb aaaa ab ac aa aa"
let words = message.characters.split(separator: " ").map(String.init)
let search:Set = ["aa", "bb", "aa"]
let found = words.reduce(0) { (i, word) -> Int in
var i = i
i += search.contains(word) ? 1 : 0
return i
}
print(found, "word(s) from the message are in the search set")
prints
4 word(s) from the the message are in the search set
UPDATE (see discussion)
using Set
var search: Set<String> = [] // an empty set with Element == String
search.insert("aa") // insert word to set
using Array
var search: Array<String> = [] // an empty array
search.append("aa") // append word to array
maybe you are looking for
let message = "the cat in the hat"
let words = message.characters.split(separator: " ").map(String.init)
let search:Set = ["aa", "bb", "pppp", "the"]
let found = search.reduce(0) { (i, word) -> Int in
var i = i
i += words.contains(word) ? 1 : 0
return i
}
print(found, "word(s) from the search set found in the message")
prints
1 word(s) from the search set found in the message
if you would like to produce the same as with accepted answer
let words = ["the", "cat", "sat"]
let input = "The cat sat on the mat and the mat sat on the catastrophe"
let inputWords = input.lowercased().characters.split(separator: " ").map(String.init)
let found = inputWords.reduce(0) { (i, word) -> Int in
var i = i
i += words.contains(word.lowercased()) ? 1 : 0
return i
}
print(found, "word(s) from the search set found in the message")
prints
7 word(s) from the search set found in the message

F# concatenate int array option to string

I have a data contract (WCF) with a field defined as:
[<DataContract(Namespace = _Namespace.ws)>]
type CommitRequest =
{
// Excluded for brevity
...
[<field: DataMember(Name="ExcludeList", IsRequired=false) >]
ExcludeList : int array option
}
I want to from the entries in the ExcludeList, create a comma separated string (to reduce the number of network hops to the database to update the status). I have tried the following 2 approaches, neither of which create the desired string, both are empty:
// Logic to determine if we need to execute this block works correctly
try
// Use F# concat
let strList = request.ExcludeList.Value |> Array.map string
let idString = String.concat ",", strList
// Next try using .NET Join
let idList = String.Join ((",", (request.ExcludeList.Value.Select (fun f -> f)).Distinct).ToString ())
with | ex ->
...
Both compile and execute but neither give me anything in the string. Would greatly appreciate someone pointing out what I am doing wrong here.
let intoArray : int array option = Some [| 1; 23; 16 |]
let strList = intoArray.Value |> Array.map string
let idString = String.concat "," strList // don't need comma between params
// Next try using .NET Join
let idList = System.String.Join (",", strList) // that also works
Output:
>
val intoArray : int array option = Some [|1; 23; 16|]
val strList : string [] = [|"1"; "23"; "16"|]
val idString : string = "1,23,16"
val idList : string = "1,23,16"

Resources