I have a logical question here. I need to print a string in the format "user#email.com group#email.com" for each user whose business title and manager fit the business title, department and manager in the group CSV. If there's more than one matching group the string should be printed as above but with another group. What would be the best approach here? Create arrays from both CSV files then do an if in the loop?
Example
GroupCSV:
Group,Job title,Department,Manager (email address)
somesalesgroup#dundermifflin.com,Senior Sales Manager,Sales,michael.scott#dundermifflin.com
anothersalesgroup#dundermifflin.com,Senior Sales Manager,Sales,michael.scott#dundermifflin.com
UserCSV:
First name,Last name,Location,Start date,Job title,Department,Manager (email address)
Jim,Halpert,Scranton,7/1/2021,Senior Sales Manager,Sales,michael.scott#dundermifflin.com
Dwight,Schrute,Scranton,7/1/2021,Assistant to the Regional Manager,Sales,michael.scott#dundermifflin.com
I would like to have an output like:
[jim.halpert#dundermifflin#takeaway.com somesalesgroup#dundermifflin.com]
[jim.halpert#dundermifflin#takeaway.com anothersalesgroup#dundermifflin.com]
At the moment I've got this
var match [][]string
for _, u := range userRows {
for _, g := range groupRows {
if u[0] == g[0] {
match = append(match, string{g, u})
}
}
}
But I'm not sure what may be wrong here (string{g, u})
Resolved this way:
var match [][]string
for _, u := range userRows {
for _, g := range groupRows {
if u[6] == g[1] && u[2] == g[0] {
match = append(match, []string{u[5], g[2]})
}
}
}
The way I can think of is to convert GroupCSV into a map, and then read UserCSV to match the map.
If the process of reading the CSV file is a "for", there are two "for" in total.
Related
I am trying to find the highest sales between two given dates.
this is what my ad_report.csv file with headers:
date,impressions,clicks,sales,ad_spend,keyword_id,asin
2017-06-19,4451,1006,608,24.87,UVOLBWHILJ,63N02JK10S
2017-06-18,5283,3237,1233,85.06,UVOLBWHILJ,63N02JK10S
2017-06-17,0,0,0,21.77,UVOLBWHILJ,63N02JK10S
...
Below is all the working code I have that returns the row with the highest value, but not between the given dates.
require 'csv'
require 'date'
# get directory of the current file
LIB_DIR = File.dirname(__FILE__)
# get the absolute path of the ad_report & product_report CSV
# and set to a var
AD_CSV_PATH = File.expand_path('data/ad_report.csv', LIB_DIR)
PROD_CSV_PATH = File.expand_path('data/product_report.csv', LIB_DIR)
# create CSV::Table for ad-ad_report and product_report CSV
ad_report_table = CSV.parse(File.read(AD_CSV_PATH), headers: true)
prod_report_table = CSV.parse(File.read(PROD_CSV_PATH), headers: true)
## finds the row with the highest sales
sales_row = ad_report_table.max_by { |row| row[3].to_i }
At this point I can get the row that has the greatest sale, and all the data from that row, but it is not in the excepted range.
Below I am trying to use range with the preset dates.
## range of date for items between
first_date = Date.new(2017, 05, 02)
last_date = Date.new(2017, 05, 31)
range = (first_date...last_date)
puts sales_row
below is sudo code of what I feel that I am supposed to do, but there is probably a better method.
## check for highest sales
## return sales if between date
## else reject col if
## loop this until it returns date between
## return result
You could do this by creating a range containing two dates and then use Range#cover? to test if the date is in the range:
range = Date.new(2015-01-01)..Date.new(2020-01-01)
rows.select do |row|
range.cover?(Date.parse(row[1]))
end.max_by { |row| row[3].to_i }
Although the Tin Man is completely right in that you should use a database instead.
You could obtained the desired value as follows. I have assumed that the field of interest ('sales') represents integer values. If not, change .to_i to .to_f below.
Code
require 'csv'
def greatest(fname, max_field, date_field, date_range)
largest = nil
CSV.foreach(fname, headers:true) do |csv|
largest = { row: csv.to_a, value: csv[max_field].to_i } if
date_range.cover?(csv[date_field]) &&
(largest.nil? || csv[max_field].to_i > largest[:value])
end
largest.nil? ? nil : largest[:row].to_h
end
Examples
Let's first create a CSV file.
str =<<~END
date,impressions,clicks,sales,ad_spend,keyword_id,asin
2017-06-19,4451,1006,608,24.87,UVOLBWHILJ,63N02JK10S
2017-06-18,5283,3237,1233,85.06,UVOLBWHILJ,63N02JK10S
2017-06-17,0,0,0,21.77,UVOLBWHILJ,63N02JK10S
2017-06-20,4451,1006,200000,24.87,UVOLBWHILJ,63N02JK10S
END
fname = 't.csv'
File.write(fname, str)
#=> 263
Now find the record within a given date range for which the value of "sales" is greatest.
greatest(fname, 'sales', 'date', '2017-06-17'..'2017-06-19')
#=> {"date"=>"2017-06-18", "impressions"=>"5283", "clicks"=>"3237",
# "sales"=>"1233", "ad_spend"=>"85.06", "keyword_id"=>"UVOLBWHILJ",
# "asin"=>"63N02JK10S"}
greatest(fname, 'sales', 'date', '2017-06-17'..'2017-06-25')
#=> {"date"=>"2017-06-20", "impressions"=>"4451", "clicks"=>"1006",
# "sales"=>"200000", "ad_spend"=>"24.87", "keyword_id"=>"UVOLBWHILJ",
# "asin"=>"63N02JK10S"}
greatest(fname, 'sales', 'date', '2017-06-22'..'2017-06-25')
#=> nil
I read the file line-by-line (using CSV#foreach) to keep memory requirements to a minimum, which could be essential if the file is large.
Notice that, because the date is in "yyyy-mm-dd" format, it is not necessary to convert two dates to Date objects to compare them; that is, they can be compared as strings (e.g. '2017-06-17' <= '2017-06-18' #=> true).
I have what appears to be a strange bug in either the gocql driver for Cassandra, or in the Cassandra database itself.
I am trying to do a simple write and then read all request in two separate functions. I would expect that I would get all entries on the read all request, but I am only getting the last entry in Cassandra.
Here is how I am doing the write:
util.CassSession, _ = util.CassCluster.CreateSession()
defer util.CassSession.Close()
keySpaceMeta, _ := util.CassSession.KeyspaceMetadata("platypus")
valC, exists := keySpaceMeta.Tables["cassmessage"]
if exists==true {
fmt.Println("cassmessage exists!!!")
}else{
fmt.Println("cassmessage doesnt exist!")
}
if valC!=nil{
fmt.Println("return from valC cassmessage: ", valC)
}
insertString:=`INSERT INTO cassmessage
(messagefrom, messageto, messagecontent)
VALUES('`+sendMsgReq.MessageFrom+`', '`
+sendMsgReq.MessageTo+`', '`+sendMsgReq.MessageContent+`')`
fmt.Println("insertString value: ", insertString)
err := util.CassSession.Query(insertString).Exec()
if err != nil {
fmt.Println("there was an error in appending data to cassmessage: ", err)
} else {
fmt.Println("inserted data into cassmessage successfully")
}
the terminal output from the above:
app_1 | [17:59:43][WEBSERVER] : cassmessage exists!!!
app_1 | [17:59:43][WEBSERVER] : return from valC cassmessage:
&{platypus cassmessage [] []
[0xc000400140] [] map[messagefrom:0xc0004000a0
messageto:0xc000400140 messagecontent:0xc000400000]
[messagecontent messagefrom messageto]}
app_1 | [17:59:43][WEBSERVER] : inserted data into cassmessage successfully
I am not entirely sure what the output of valC is returning, although it appears to be some sort of memory address which is a good sign. I also see that I am not getting any error on the write exec function which is hopeful.
Here is how I am doing the read:
util.CassSession, _ = util.CassCluster.CreateSession()
defer util.CassSession.Close()
keySpaceMeta, _ := util.CassSession.KeyspaceMetadata("platypus")
valC, exists := keySpaceMeta.Tables["cassmessage"]
queryString := `SELECT messageto, messagecontent, messagefrom FROM cassmessage WHERE messagefrom='`+mailReq.Email+`'`
//returns nothing, should return many rows
queryString2 := `SELECT messageto, messagecontent, messagefrom FROM cassmessage`
//returns only last entry, should return many rows
queryString3 := `SELECT * FROM cassmessage WHERE messagefrom='`+mailReq.Email+`'`
//returns nothing, should return many rows
queryAllString := `SELECT * FROM cassmessage`
//returns only last entry, should return many rows
var messageto string
var messagecontent string
var messagefrom string
iter := util.CassSession.Query(queryAllString).Iter()
for iter.Scan(&messageto, &messagecontent, &messagefrom) {
fmt.Println("Iter messageto: %v", messageto)
fmt.Println("Iter messagecontent: %v", messagecontent)
fmt.Println("Iter messagefrom: %v", messagefrom)
}
the terminal output from above:
app_1 | [18:09:54][WEBSERVER] : Iter messageto: %v xyz#xyz.com
app_1 | [18:09:54][WEBSERVER] : Iter messagecontent: %v a
app_1 | [18:09:54][WEBSERVER] : Iter messagefrom: %v abc#abc.com
This is not what I expect, as this is the output from the read, after multiple writes to the database. If you look at comments on the various queryString values I have tried 2 of them return nothing when I expect all entries to be returned, and 2 of them only return the last write entry (they are all symmetric queries to my knowledge).
Does anyone know why I cannot return multiple entries using Iter or why my four different values on the different query strings I have tried are returning different results?
Thank you.
I maybe shouldn't, but I'm going to keep this here in case someone else runs into the same problem. I wasn't making sure that my primary key in my table was unique. Doing something like this:
util.CassSession.Query("CREATE TABLE cassmessage(" +
"messageto text, messagefrom text, messagecontent text, uniqueID text, PRIMARY KEY (uniqueID))").Exec()
Managed to fix the issue.
Thanks to everyone who took a look and helped. Cheers!
I have multiple csv files that have the name and the price of products. There may be or may not be products that are in both files. I have to find the highest and the lowest price across these files for each product.
I joined products from both files into one array:
Dir["./*.csv"].each do |file|
CSV.foreach(file, headers:true) do |row|
tmpRow = row.to_s.chomp + "," + file #saving name of the input file
list.push(tmpRow.chomp.split(","))
end
end
The array list looks like this:
[["5893105","2.38", "weightOrSomethingIrrelevant", "./FIAT_2.csv"]]
This is the main algorithm:
while list[0] do
if list[0] != nil
tmpPart = list[0][0]
tmpParts = list.select{ |part, price| part == tmpPart}
tmpParts.each do |tp|
tmpPrices.push(tp[1])
end
list[0][2].to_f != 0.0 ? tmpWeight = list[0][2].to_s : tmpWeight = "Undefined"
tmpMaxPrice = tmpParts.select{|part, price| part == tmpPart && price == tmpPrices.max}
tmpMinPrice = tmpParts.select{|part, price| part == tmpPart && price == tmpPrices.min}
result.push([tmpPart, tmpWeight, tmpPrices.max, tmpMaxPrice[0].last, tmpPrices.min, tmpMinPrice[0].last)
tmpPart = ""
list = list - tmpParts
tmpParts = []
tmpPrices = []
tmpMaxPrice = []
tmpMinPrice = []
tmpWeight = ""
end
end
The input files are huge (over 200 000 rows), so I am having problems with efficiency of my algorithm (as it processes one row in half a second).
I am wondering if there is any better way to write this app.
I would split this into several parts:
1) I suggest you have a table which represents files (the file name, location, line number etc) and connected to that a product table (the row data from that file)
2) script / function to ingest files and store rows as DB records
3) script / function to analyse rows and find products by name, using the DB and pulling price info out using Min/max.
This could later be improved to deal with naming inconsistencies products vs product occurrences etc.
I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))
I'm currently trying to filter a large database using scala. I've written a simple piece of code to match an ID in one database to a list of ID's in another.
Essentially I want to go through database A and if the ID number in the ID column matches one from database B, to extract that entry from Database A.
The code i've written works fine, but it's slow (i.e. has to run over a couple of days) and i'm trying to find a way to speed it up. It may be that it can't be sped up by much, or it can be much much faster with better coding.
So any help would be much appreciated.
Below is a description of the databases and a copy of the code.
Database A is approximately 10gb in size with over 100 million entries and database B has a list of approx 50,000 IDs.
Each database looks like as follows:
Database A:
ID, DataX, date
10, 100,01012000
15, 20, 01012008
5, 32, 01012006
etc...
Database B:
ID
10
15
12
etc...
My code is as follows:
import scala.io.Source
import java.io._
object filter extends App {
def ext[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
val key = io.Source.fromFile("C:\\~Database_B.csv").getLines()
val key2 = new Array[String](50000)
key.copyToArray(key2)
ext(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("C:\\~Output.csv")))) {
writer =>
val line = io.Source.fromFile("C:\\~Database_A.csv").getLines.drop(1)
while (line.hasNext) {
val data= line.next
val array = data.split(",").map(_.trim)
val idA = array(0)
val dataX = array(1)
val date = array(2)
key2.map { idB =>
if (idA == idB) {
val print = (idA + "," + dataX + "," + date)
writer.write(print)
writer.newLine()
} else None
}
}
}
}
First, there are way more efficient ways to do that than writing a Scala program. Loading two tables in a database and do a join will take about 10 minutes (including data loading) on a modern computer.
Assuming you have to use scala, there is an obvious improvement. Store you keys as a HashSet and use keys.contains(x) instead of traversing all keys. This would give you O(1) lookup instead of O(N) that you have now, which should speed up your program significantly.
Minor point -- use string interpolation instead of concatenation, i.e.
s"$idA,$dataX,$date"
// instead of
idA + "," + dataX + "," + date
Try this:
import scala.io.Source
import java.io._
object filter extends App {
def ext[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
// convert to a Set
val key2 = io.Source.fromFile("C:\\~Database_B.csv").getLines().toSet
ext(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("C:\\~Output.csv")))) {
writer =>
val lines = io.Source.fromFile("C:\\~Database_A.csv").getLines.drop(1)
for (data <- lines) {
val array = data.split(",").map(_.trim)
array match {
case Array(idA, dataX, date) =>
if (key2.contains(idA)) {
val print = (idA + "," + dataX + "," + date)
writer.write(print)
writer.newLine()
}
case _ => // invalid input
}
}
}
}
IDs are now stored in a set. This will give a better performance.