Tfidf with a custom list - tf-idf

I have a list of raw strings that look like this;
listtocheck = ['fadsfsfgblahsdfgsfg','adfaghelloggfg','gagfghellosdfhere','blahsgsdfgsdfhellohsdfhgshstring']
and I want to perform TfIdf with these and a list of items I have in a list (not itself).
mylist = ['blah','hello','here','string']
This list I am vectorising as such;
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer = 'char_wb', ngram_range=(2,3))
listvec = tf.fit_transform(mylist)
This gives me the tfidf of the things in mylist. What I would like to be able to go is to check the number of times that the ngrams from mylist appear in each item of listtocheck and then perform TfIdf based on the total number times that ngram appears in all of the strings in listtocheck

In order to achieve this I had to first .fit() on mylist but then .transform() on listtocheck.
Here is the code I used in the end:
from sklearn.feature_extraction.text import TfidfVectorizer
def create_vec(listtocheck,mylist):
tf = TfidfVectorizer(analyzer = 'char_wb',ngram_range=(2,3))
tf.fit(mylist)
X = tf.transform(listtocheck)
return X
vecs = create_vec(listtocheck, mylist)

Related

Find objects that include an array that contains all elements of a second array

I'm trying to filter a set of objects based on values in one of their elements based on another array. I've got it working with filter just fine if the search is "OR" - it returns give me all objects for which at least one of the strings in the search array is found.
But I can't figure out how to make it work as an AND search - returning only the objects that match ALL of the strings in the search array.
Example:
struct Schedule {
let title: String
let classTypes: [String]
}
let schedule1 = Schedule(title: "One", classTypes: ["math","english","chemistry","drama"])
let schedule2 = Schedule(title: "Two", classTypes: ["pe","math","biology"])
let schedule3 = Schedule(title: "Three", classTypes: ["english","history","math","art"])
let schedules = [schedule1, schedule2, schedule3]
let searchArray = ["math", "english"]
//works for OR - "math" or "english"
var filteredSchedules = schedules.filter { $0.classTypes.contains(where: { searchArray.contains($0) }) }
I'd like to find a way for it to use the same search array
let searchArray = ["math", "english"]
But only return items 1 & 3 - as they both have BOTH math and english in the list.
There are good examples of AND conditions when the AND is across different search criteria: car type and colour - but I've been unable to find an example where the criteria are dynamically based on items in an array. For context, I could have dozens of schedules with 20+ class types.
You can work with a Set, isSubset will return true if the schedules element contains all elements of the searchSet
let searchSet = Set(searchArray)
var filteredSchedules = schedules.filter { searchSet.isSubset(of: $0.classTypes) }
As suggested by #LeoDabus it might be worth changing the type of classTypes to Set instead of arrays (if order doesn't matter) since they seems to be unique and then the filtering can be done in the opposite way without the need to convert searchArray each time
var filteredSchedules = schedules.filter { $0.classTypes.isSuperset(of: searchArray) }

Kotlin - Find matching objects in array

Let's say I have an array of strings and I want to get a list with objects that match, such as:
var locales=Locale.getAvailableLocales()
val filtered = locales.filter { l-> l.language=="en" }
except, instead of a single value I want to compare it with another list, like:
val lang = listOf("en", "fr", "es")
How do I do that? I'm looking for a one-liner solution without any loops. Thanks!
Like this
var locales = Locale.getAvailableLocales()
val filtered = locales.filter { l -> lang.contains(l.language)}
As pointed out in comments, you can skip naming the parameter to the lambda, and use it keyword to have either of the following:
val filtered1 = locales.filter{ lang.contains(it.language) }
val filtered2 = locales.filter{ it.language in lang }
Just remember to have a suitable data structure for the languages, so that the contains() method has low time complexity like a Set.

Slicing the first row of a Dataframe into an Array[String]

import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf,SparkContext}
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.expressions.Window
import scala.runtime.ScalaRunTime.{array_apply, array_update}
import scala.collection.mutable.Map
object SimpleApp {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("SimpleApp").setMaster("local")
val sc = new SparkContext(conf)
val input = "file:///home/shahid/Desktop/sample1.csv"
val hdfsOutput = "hdfs://localhost:9001/output.csv"
val localOutput = "file:///home/shahid/Desktop/output"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").load(input)
var colLen = df.columns.length
val df1 = df.filter(!(col("_c1") === ""))
I am capturing the top row into a val named headerArr.
val headerArr = df1.head
I wanted this val to be Array[String].
println("class = "+headerArr.getClass)
What can I do to either typecast this headerArr into an Array[String] or get this top row directly into an Array[String].
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9001"), sc.hadoopConfiguration)
fs.delete(new org.apache.hadoop.fs.Path("/output.csv"),true)
df1.write.csv(hdfsOutput)
val fileTemp = new File("/home/shahid/Desktop/output/")
if (fileTemp.exists)
FileUtils.deleteDirectory(fileTemp)
df1.write.csv(localOutput)
sc.stop()
}
}
I have tried using df1.first also but both return the same type.
The result of the above code on the console is as follows :-
class = class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
Help needed.
Thankyou for you time. xD
Given the following dataframe:
val df = spark.createDataFrame(Seq(("a", "hello"), ("b", "world"))).toDF("id", "word")
df.show()
+---+-----+
| id| word|
+---+-----+
| a|hello|
| b|world|
+---+-----+
You can get the first row as you already mentioned and then turn this result into a Seq, which is actually backed by a subtype of Array and that you can then "cast" to an array without copying:
// returns: WrappedArray(a, hello)
df.first.toSeq.asInstanceOf[Array[_]]
Casting is usually not a good practice in a language with very good static typing as Scala, so you'd probably want to stick to the Seq unless you really have a need for an Array.
Notice that thus far we always ended up not with an array of strings but with an array of objects, since the Row object in Spark has to accommodate for various types. If you want to get to a collection of strings you can iterate the fields and extract the strings:
// returns: Vector(a, hello)
for (i <- 0 until df.first.length) yield df.first.getString(i)
This of course will cause a ClassCastException if the Row contains non-strings. Depending on your needs, you may also want to consider using a Try to silently drop non-strings within the for-comprehension:
import scala.util.Try
// same return type as before
// non-string members will be filtered out of the end result
for {
i <- 0 until df.first.length
field <- Try(df.first.getString(i)).toOption
} yield field
Until now we returned an IndexedSeq, which is suitable for efficient random access (i.e. has constant access time to any item in the collection) and in particular a Vector. Again, you may really need to return an Array. To return an Array[String] you may want to call toArray on the Vector, which unfortunately copies the whole thing.
You can skip this step and directly output an Array[String] by explicitly using flatMap instead of relying on the for-comprehension and using collection.breakOut:
// returns: Array[String] -- silently keeping strings only
0.until(df.first.length).
flatMap(i => Try(df.first.getString(i)).toOption)(collection.breakOut)
To learn more about builders and collection.breakOut you may want to have a read here.
well my problem didn't solve with the best way but i tried a way out :-
val headerArr = df1.first
var headerArray = new Array[String](colLen)
for(i <- 0 until colLen){
headerArray(i)=headerArr(i).toString
}
But still I am open for new suggestions.
Although I am slicing the dataframe into a var of class = org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema and then transfering the elements to Array[String] with an iteration.

Sorting a dictionary by key and converting it into an array

I have a dictionary of prices and quantities. I am getting updates on the price and values multiple times in a second so I don't want to store them in an array because dictionary are much faster.
let mainPriceValDict = [Double:Double]()
The data is coming in as an array of JSON so I am using codable to parse the JSON and put it into a dictionary. When I use the data, it needs to be sorted in ascending and/or descending order because I am looping through each price in order to get to a certain total quantity. The format that the array is in that I am looping through is as follows:
let loopingArray = [PriceQuantityEntry]()
struct PriceQuantityEntry {
let price : Double
let size : Double
}
I want to sort the prices which are the keys in the dictionary above and convert them into an array of PriceQuantityEntry. What is the best way to do this? In ascending and deciding order. I have tried first getting all the keys sorted and then grabbing associated values and putting them into the array in order but this seems like more processing than this task actually requires.
I think the best way to do this would be to put a custom initializer in the struct to convert the dictionary value to a value of type PriceQuantityEntry but I am not exactly sure how that would work with the sorting.
This is what I am currently doing to get it to work. I just feel like there is a more efficient way for it to be done. If you feel like I should keep the structure as an array instead of converting it to a dict, let me know.
loopingArray = self.mainPriceValDict.sorted { $0.0 < $1.0 }.map { PriceQuantityEntry(price: $0.0, size: $0.1) }
If you are getting a lot of updates to individual entries, both a dictionary and an array may cause memory copies of the whole memory structure every time an entry is changed.
I would suggest using objects (classes) instead of structures. This will allow you to use both an array and a dictionary to reference the object instances. The dictionary will provide direct access for updates and the array will allow sequential processing in forward or backward order.
[EDIT] Example:
class PriceQuantityEntry
{
static var all:[PriceQuantityEntry] = []
static var prices:[Double:PriceQuantityEntry] = [:]
var price : Double
var size : Double
init(price:Double, size:Double)
{
self.price = price
self.size = size
PriceQuantityEntry.all.append(self)
// PriceQuantityEntry.all.resort() // on demand or when new prices added
PriceQuantityEntry.prices[price] = self
}
class func update(price:Double, with size:Double)
{
if let instance = PriceQuantityEntry.prices[price]
{ instance.size = size }
else
{
let _ = PriceQuantityEntry(price:price, size:size)
PriceQuantityEntry.resort()
}
}
class func resort()
{
PriceQuantityEntry.all.sort{$0.price < $1.price}
}
}
// if adding multiple initial entries before updates ...
let _ = PriceQuantityEntry(price:1, size:3)
let _ = PriceQuantityEntry(price:1.25, size:2)
let _ = PriceQuantityEntry(price:0.95, size:1)
PriceQuantityEntry.resort()
// for updates ...
PriceQuantityEntry.update(price:1, with: 2)
// going throug list ...
var count:Double = 0
var total:Double = 0
var quantity:Double = 5
for entry in PriceQuantityEntry.all
{
total += min(entry.size,quantity-count) * entry.price
count = min(quantity,count + entry.size)
if count == quantity {break}
}

Saving users and items features to HDFS in Spark Collaborative filtering RDD

I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// extract users latent factors
val users = model.userFeatures
// extract items latent factors
val items = model.productFeatures
// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected
However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:
[myname#host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D#3c3137b5)
(3,[D#505d9755)
(4,[D#241a409a)
(2,[D#c8c56dd)
.
.
It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:
for (user <- users) {
val (userId, lf) = user
val str = "user:" + userId + "\t" + lf.mkString(" ")
println(str)
}
This does print what I want but I can't then write to HDFS (this prints on the console).
What should I do to get the complete array written to HDFS properly?
Spark version is 1.2.1.
#JohnTitusJungao is right and also the following lines works as expected :
users.saveAsTextFile("myhdfs/outputdirectory/users")
items.saveAsTextFile("myhdfs/outputdirectory/items")
And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D#3c3137b5 , D for double, followed by # and hex code which is created using the Java toString method for this type of objects. More on that here.
val users: RDD[(Int, Array[Double])] = model.userFeatures
To solve that you'll need to make the array as a string :
val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))
The same goes for items.

Resources