Creating a Random Feature Array in Spark DataFrames - arrays

When creating an ALS model, we can extract a userFactors DataFrame and an itemFactors DataFrame. These DataFrames contain a column with an Array.
I would like to generate some random data and union it to the userFactors DataFrame.
Here is my code:
val df1: DataFrame = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating")
val model1 = (new ALS()
.setImplicitPrefs(true)
.fit(df1))
val iF = model1.itemFactors
val uF = model1.userFactors
I then create a random DataFrame using a VectorAssembler with this function:
def makeNew(df: DataFrame, rank: Int): DataFrame = {
var df_dummy = df
var i: Int = 0
var inputCols: Array[String] = Array()
for (i <- 0 to rank) {
df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
inputCols = inputCols :+ "feature".concat(i.toString)
}
val assembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("userFeatures")
val output = assembler.transform(df_dummy)
output.select("user", "userFeatures")
}
I then create the DataFrame with new user IDs and add the random vectors and bias:
val usersDf: DataFrame = Seq(567), (678)).toDF("user")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)
The problem arises when I union the two DataFrames.
usersFactorsNew.union(uF) produces the error:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> array<float> at the second column of the second table;;
If I print the schema, the uF DataFrame has a feature vector of type Array[Float] and the usersFactorsNew DataFrame as a feature vector of type Vector.
My question is how to change the type of the Vector to an Array in order to perform the union.
I tried writing this udf with little success:
val toArr: org.apache.spark.ml.linalg.Vector => Array[Double] = _.toArray
val toArrUdf = udf(toArr)
Perhaps the VectorAssembler is not the best option for this task. However, at the moment, it is the only option I have found. I would love to get some recommendations for something better.

Instead of creating a dummy dataframe and using VectorAssembler to generate a random feature vector, you can simply use an UDF directly. The userFactors from the ALS model will return an Array[Float] so the output from the UDF should match that.
val createRandomArray = udf((rank: Int) => {
Array.fill(rank)(Random.nextFloat)
})
Note that this will give numbers in the interval [0.0, 1.0] (same as the rand() used in the question), if other numbers are required, modify as fit.
Using a rank of 3 and the userDf:
val usersFactorsNew = usersDf.withColumn("userFeatures", createRandomArray(lit(3)))
will give a dataframe as follows (of course with random feature values)
+----+----------------------------------------------------------+
|user|userFeatures |
+----+----------------------------------------------------------+
|567 |[0.6866711267486822,0.7257031656127676,0.983562255688249] |
|678 |[0.7013908820314967,0.41029552817665327,0.554591149586789]|
+----+----------------------------------------------------------+
Joining this dataframe with the uF dataframe should now be possible.
The reason the UDF did not work should be due to it being an Array[Double] while you need anArray[Float]for theunion. It should be possible to fix with amap(_.toFloat)`.
val toArr: org.apache.spark.ml.linalg.Vector => Array[Float] = _.toArray.map(_.toFloat)
val toArrUdf = udf(toArr)

All of your process are all correct. Even the udf function is working successfully. All you need to do is change the last part of makeNew function as
def makeNew(df: DataFrame, rank: Int): DataFrame = {
var df_dummy = df
var i: Int = 0
var inputCols: Array[String] = Array()
for (i <- 0 to rank) {
df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
inputCols = inputCols :+ "feature".concat(i.toString)
}
val assembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("userFeatures")
val output = assembler.transform(df_dummy)
output.select(col("id"), toArrUdf(col("userFeatures")).as("features"))
}
and you should be perfect to go so that when you do (I created userDf with id column and not user column)
val usersDf: DataFrame = Seq((567), (678)).toDF("id")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)
usersFactorsNew.union(uF).show(false)
you should be getting
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |features |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|567|[0.8259185719733708, 0.327713892339658, 0.049547223031371046, 0.056661808506210054, 0.5846626163454274, 0.038497936270104005, 0.8970865088803417, 0.8840660648882804, 0.837866669938156, 0.9395263094918058, 0.09179528484355126, 0.4915430644129799, 0.11083447052043116, 0.5122858182953718, 0.4302683812966408, 0.3862741815833828, 0.6189322403095068, 0.3000371006293433, 0.09331299668168902, 0.7421838728601371, 0.855867963988993]|
|678|[0.7686514248005568, 0.5473580740023187, 0.072945344124282, 0.36648594574355287, 0.9780202082328863, 0.5289221651923784, 0.3719451099963028, 0.2824660794505932, 0.4873197501260199, 0.9364676464120849, 0.011539929543513794, 0.5240615794930654, 0.6282546154521298, 0.995256022569878, 0.6659179561266975, 0.8990775317754092, 0.08650071017556926, 0.5190186149992805, 0.056345335742325475, 0.6465357505620791, 0.17913532817943245] |
|123|[0.04177388548851013, 0.26762014627456665, -0.19617630541324615, 0.34298020601272583, 0.19632814824581146, -0.2748605012893677, 0.07724890112876892, 0.4277132749557495, 0.1927199512720108, -0.40271613001823425] |
|234|[0.04139673709869385, 0.26520395278930664, -0.19440513849258423, 0.3398836553096771, 0.1945556253194809, -0.27237895131111145, 0.07655145972967148, 0.42385169863700867, 0.19098000228405, -0.39908021688461304] |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Related

I'm trying to convert Pandas dataframe to HuggingFace DatasetDic

I have a pandas dataframe with 20k rows containing 2 columns named English, te. Changed the English column name to en. Trying to split the dataset into train, validation and test. And, I want to convert that dataset into
raw_datasets
the output i'm expecting is
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 18000
})
validation: Dataset({
features: ['translation'],
num_rows: 1000
})
test: Dataset({
features: ['translation'],
num_rows: 1000
})
})
I'm trying to write a code like raw_datasets["train"][0], then it should return output like below
{'translation': {'en': 'Membership of Parliament: see Minutes',
'to': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}
The data must be in DatasetDict, similar to if we load data from huggingface like dataset DatasetDict type. Below is the code i've written but it's not working
import pandas as pd
from collections import namedtuple
Dataset = namedtuple('Dataset', ['features', 'num_rows'])
DatasetDict = namedtuple('DatasetDict', ['train', 'validation', 'test'])
def create_dataset_dict(df):
# Rename the column
df = df.rename(columns={'English': 'en'})
# Split the data into train, validation and test
train_df = df.iloc[:18000, :]
validation_df = df.iloc[18000:19000, :]
test_df = df.iloc[19000:, :]
# Create the dataset dictionaries
train = Dataset(features=['translation'], num_rows=18000)
validation = Dataset(features=['translation'], num_rows=1000)
test = Dataset(features=['translation'], num_rows=1052)
# Create the final dataset dictionary
datasets = DatasetDict(train=train, validation=validation, test=test)
return datasets
def preprocess_dataset(df):
df = df.rename(columns={'English': 'en'})
train_df = df.iloc[:18000, :]
validation_df = df.iloc[18000:19000, :]
test_df = df.iloc[19000:, :]
train_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in train_df.iterrows()]
validation_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in validation_df.iterrows()]
test_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in test_df.iterrows()]
return DatasetDict(train=train_dict, validation=validation_dict, test=test_dict)
df = pd.read_csv('eng-to-te.csv')
raw_datasets = preprocess_dataset(df)
The above code is not working. Can anyone help me with this?

Spark GraphX - How to pass and array to to filter graph edges?

I am using Scala on Spark 2.1.0 GraphX. I have an array as shown below:
scala> TEMP1Vertex.take(5)
res46: Array[org.apache.spark.graphx.VertexId] = Array(-1895512637, -1745667420, -1448961741, -1352361520, -1286348803)
If I had to filter the edge table for a single value, let's say for soruce ID -1895512637
val TEMP1Edge = graph.edges.filter { case Edge(src, dst, prop) => src == -1895512637}
scala> TEMP1Edge.take(5)
res52: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(-1895512637,-2105158920,89), Edge(-1895512637,-2020727043,3), Edge(-1895512637,-1963423298,449), Edge(-1895512637,-1855207100,214), Edge(-1895512637,-1852287689,339))
scala> TEMP1Edge.count
17/04/03 10:20:31 WARN Executor: 1 block locks were not released by TID = 1436:[rdd_36_2]
res53: Long = 126
But when I pass an array which contains a set of unique source IDs, the code runs successfully but it doesn't return any values as shown below:
scala> val TEMP1Edge = graph.edges.filter { case Edge(src, dst, prop) => src == TEMP1Vertex}
TEMP1Edge: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = MapPartitionsRDD[929] at filter at <console>:56
scala> TEMP1Edge.take(5)
17/04/03 10:29:07 WARN Executor: 1 block locks were not released by TID = 1471:
[rdd_36_5]
res60: Array[org.apache.spark.graphx.Edge[Int]] = Array()
scala> TEMP1Edge.count
17/04/03 10:29:10 WARN Executor: 1 block locks were not released by TID = 1477:
[rdd_36_5]
res61: Long = 0
I suppose that TEMP1Vertex is of type Array[VertexId], so I think that your code should be like:
val TEMP1Edge = graph.edges.filter {
case Edge(src, _, _) => TEMP1Vertex.contains(src)
}

numpy slicing using user defined input

I have (in a larger project) data contained in numpy.array.
Based on user input I need to move a selected axis (dimAxisNr) to the first dimension of the array and slice one or more (including the first) dimension based on user input (such as Select2 and Select0 in the example).
Using this input I generate a DataSelect which contains the information needed to slice. But the output size of the sliced array is different from the one using inline indexing. So basically I need a way to generate the '37:40:2' and '0:2' from an input list.
import numpy as np
dimAxisNr = 1
Select2 = [37,39]
Select0 = [0,1]
plotData = np.random.random((102,72,145,2))
DataSetSize = np.shape(plotData)
DataSelect = [slice(0,item) for item in DataSetSize]
DataSelect[2] = np.array(Select2)
DataSelect[0] = np.array(Select0)
def shift(seq, n):
n = n % len(seq)
return seq[n:] + seq[:n]
#Sort and Slice the data
print(np.shape(plotData))
print(DataSelect)
plotData = np.transpose(plotData, np.roll(range(plotData.ndim),-dimAxisNr))
DataSelect = shift(DataSelect,dimAxisNr)
print(DataSelect)
print(np.shape(plotData))
plotData = plotData[DataSelect]
print(np.shape(plotData))
plotDataDirect = plotData[slice(0, 72, None), 37:40:2, slice(0, 2, None), 0:2]
print(np.shape(plotDataDirect))
I'm not sure I've understood your question at all...
But if the question is "How do I generate a slice based on a list of indices like [37,39,40,23] ?"
then I would answer : you don't have to, just use the list as is to select the right indices, like so :
a = np.random.rand(4,5)
print(a)
indices = [2,3,1]
print(a[0:2,indices])
Note that the sorting of the list matters: [2,3,1] yields a different result from [1,2,3]
Output :
>>> a
array([[ 0.47814802, 0.42069094, 0.96244966, 0.23886243, 0.86159478],
[ 0.09248812, 0.85569145, 0.63619014, 0.65814667, 0.45387509],
[ 0.25933109, 0.84525826, 0.31608609, 0.99326598, 0.40698516],
[ 0.20685221, 0.1415642 , 0.21723372, 0.62213483, 0.28025124]])
>>> a[0:2,[2,3,1]]
array([[ 0.96244966, 0.23886243, 0.42069094],
[ 0.63619014, 0.65814667, 0.85569145]])
I have found the answer to my question. I need to use numpy.ix_.
Here is the working code:
import numpy as np
dimAxisNr = 1
Select2 = [37,39]
Select0 = [0,1]
plotData = np.random.random((102,72,145,2))
DataSetSize = np.shape(plotData)
DataSelect = [np.arange(0,item) for item in DataSetSize]
DataSelect[2] = Select2
DataSelect[0] = Select0
#print(list(37:40:2))
def shift(seq, n):
n = n % len(seq)
return seq[n:] + seq[:n]
#Sort and Slice the data
print(np.shape(plotData))
print(DataSelect)
plotData = np.transpose(plotData, np.roll(range(plotData.ndim),-dimAxisNr))
DataSelect = shift(DataSelect,dimAxisNr)
plotDataSlice = plotData[np.ix_(*DataSelect)]
print(np.shape(plotDataSlice))
plotDataDirect = plotData[slice(0, 72, None), 37:40:2, slice(0, 2, None), 0:1]
print(np.shape(plotDataDirect))

Practicing with Spark's join and scala on Array[String]

I am new to both Spark and Scala, and I'm trying to practice the join command in Spark.
I have two csv files:
Ads.csv is
5de3ae82-d56a-4f70-8738-7e787172c018,AdProvider1
f1b6c6f4-8221-443d-812e-de857b77b2f4,AdProvider2
aca88cd0-fe50-40eb-8bda-81965b377827,AdProvider1
940c138a-88d3-4248-911a-7dbe6a074d9f,AdProvider3
983bb5e5-6d5b-4489-85b3-00e1d62f6a3a,AdProvider3
00832901-21a6-4888-b06b-1f43b9d1acac,AdProvider1
9a1786e1-ab21-43e3-b4b2-4193f572acbc,AdProvider1
50a78218-d65a-4574-90de-0c46affbe7f3,AdProvider5
d9bb837f-c85d-45d4-95f2-97164c62aa42,AdProvider4
611cf585-a8cf-43e9-9914-c9d1dc30dab5,AdProvider1
Impression.csv is:
5de3ae82-d56a-4f70-8738-7e787172c018,Publisher1
f1b6c6f4-8221-443d-812e-de857b77b2f4,Publisher2
aca88cd0-fe50-40eb-8bda-81965b377827,Publisher1
940c138a-88d3-4248-911a-7dbe6a074d9f,Publisher3
983bb5e5-6d5b-4489-85b3-00e1d62f6a3a,Publisher3
00832901-21a6-4888-b06b-1f43b9d1acac,Publisher1
9a1786e1-ab21-43e3-b4b2-4193f572acbc,Publisher1
611cf585-a8cf-43e9-9914-c9d1dc30dab5,Publisher1
I want to join them with the first ID as the key and two values.
So I read them in like this:
val ads = sc.textFile("ads.csv")
ads: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
val impressions = sc.textFile("impressions.csv")
impressions: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21
Ok, so I have to make key,value pairs:
val adPairs = ads.map(line => line.split(","))
val impressionPairs = impressions.map(line => line.split(","))
res11: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[6] at map at <console>:23
res13: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at map at <console>:23
But I can't join them:
val result = impressionPairs.join(adPairs)
<console>:29: error: value join is not a member of org.apache.spark.rdd.RDD[Array[String]]
val result = impressionPairs.join(adPairs)
Do I need to convert the pairs into another format?
You are almost there, but what you need is to transform the Array[String] into key-value pairs, like this:
val adPairs = ads.map(line => {
val substrings = line.split(",")
(substrings(0), substrings(1))
})
(and the same for impressionPairs)
That will give you rdds of type RDD[(String, String)] which can then be joined :)

Speeding up a pattern matching algorithm in scala on a big csv file

I'm currently trying to filter a large database using scala. I've written a simple piece of code to match an ID in one database to a list of ID's in another.
Essentially I want to go through database A and if the ID number in the ID column matches one from database B, to extract that entry from Database A.
The code i've written works fine, but it's slow (i.e. has to run over a couple of days) and i'm trying to find a way to speed it up. It may be that it can't be sped up by much, or it can be much much faster with better coding.
So any help would be much appreciated.
Below is a description of the databases and a copy of the code.
Database A is approximately 10gb in size with over 100 million entries and database B has a list of approx 50,000 IDs.
Each database looks like as follows:
Database A:
ID, DataX, date
10, 100,01012000
15, 20, 01012008
5, 32, 01012006
etc...
Database B:
ID
10
15
12
etc...
My code is as follows:
import scala.io.Source
import java.io._
object filter extends App {
def ext[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
val key = io.Source.fromFile("C:\\~Database_B.csv").getLines()
val key2 = new Array[String](50000)
key.copyToArray(key2)
ext(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("C:\\~Output.csv")))) {
writer =>
val line = io.Source.fromFile("C:\\~Database_A.csv").getLines.drop(1)
while (line.hasNext) {
val data= line.next
val array = data.split(",").map(_.trim)
val idA = array(0)
val dataX = array(1)
val date = array(2)
key2.map { idB =>
if (idA == idB) {
val print = (idA + "," + dataX + "," + date)
writer.write(print)
writer.newLine()
} else None
}
}
}
}
First, there are way more efficient ways to do that than writing a Scala program. Loading two tables in a database and do a join will take about 10 minutes (including data loading) on a modern computer.
Assuming you have to use scala, there is an obvious improvement. Store you keys as a HashSet and use keys.contains(x) instead of traversing all keys. This would give you O(1) lookup instead of O(N) that you have now, which should speed up your program significantly.
Minor point -- use string interpolation instead of concatenation, i.e.
s"$idA,$dataX,$date"
// instead of
idA + "," + dataX + "," + date
Try this:
import scala.io.Source
import java.io._
object filter extends App {
def ext[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
// convert to a Set
val key2 = io.Source.fromFile("C:\\~Database_B.csv").getLines().toSet
ext(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("C:\\~Output.csv")))) {
writer =>
val lines = io.Source.fromFile("C:\\~Database_A.csv").getLines.drop(1)
for (data <- lines) {
val array = data.split(",").map(_.trim)
array match {
case Array(idA, dataX, date) =>
if (key2.contains(idA)) {
val print = (idA + "," + dataX + "," + date)
writer.write(print)
writer.newLine()
}
case _ => // invalid input
}
}
}
}
IDs are now stored in a set. This will give a better performance.

Resources