I have a string like that: "Men's Beech River Cable T-Shirt" how can I get category from this string?
str = "Men's Beech River Cable T-Shirt"
str2 = "MEN'S GOOSE EYE MOUNTAIN DOWN VEST"
cat1 = str1.split.last # T-Shirt
cat2 = str2.split.last # VEST
TOPS = %w(jacket vest coat blazer parka sweater shirt polo t-shirt)
Desired result:
category_str1 = "Tops" # Since T-Shirt (shirt) is in TOPS constant.
category_str2 = "Tops" # Since vest is in TOPS const.
I don't know how to describe my problem better, I hope you understand it from example provided.
str = "Men's Beech River Cable T-Shirt"
cat_orig = str.split.last # T-Shirt
TOPS = %w(jacket vest coat blazer parka sweater shirt polo)
RE_TOPS = Regexp.union(TOPS)
category = "Tops" if RE_TOPS =~ cat_orig.downcase
Note there are no comma's in the %w() style array syntax.
str = "Men's Beech River Cable T-Shirt"
cat_orig = str.split.last # T-Shirt
TOPS = %w(jacket vest coat blazer parka sweater shirt polo) # suppressed the comma to get a clean array
category = "Tops" if !cat_orig[/(#{TOPS.join("|")})/i].nil?
The join on the TOPS Array build an alternative regex of the form:
(jacket|vest|coat|blazer|parka|sweater|shirt|polo)
If any of those word is present in cat_orig, the return will be the matched word, if not it will return nil.
Note the leading i in the regex to makes it case insensitive.
The best way to do this is through a hash, not an array. Let's say your caetgories look something like this
categories = { "TOPS" => ["shirt", "coat", "blazer"],
"COOKING" => ["knife", "fork", "pan"] }
We can then loop through each category and find if their values include the word in the string
categories.each do |key, value|
puts key if str.downcase.split(' ').any? { |word| categories[key].include?(word) }
end
Loop through each category, and find if the category has a word that the string has.
Note: This does not yet search for substrings.
Related
I'm trying to figure out how to extract records from a file that contains only one occurrence of a trainer and only one occurrence of a jockey.
Essentially, the record would imply that the jockey has only one ride for the day and it is for trainer X who has only one runner for the day.
Here are some "sample data":
ALLAN DENHAM,MUSWELLBROOK,RACE 5,MOPITTS (10),JEFF PENZA,B,5
ALLAN KEHOE,MUSWELLBROOK,RACE 3,FOXY FIVE (5),KOBY JENNINGS,C,3
ALLAN KEHOE,MUSWELLBROOK,RACE 4,BANGALLEY LAD (3),KOBY JENNINGS,BBB,4
ANDREW ROBINSON,MUSWELLBROOK,RACE 6,TROPHIES GALORE (4),DARRYL MCLELLAN,AAA,6
BEN HILL,MUSWELLBROOK,RACE 4,WHALER BILL (10),GRANT BUCKLEY,BB,4
BEN HILL,MUSWELLBROOK,RACE 5,MR BILL (5),GRANT BUCKLEY,BB,4
BJORN BAKER,MUSWELLBROOK,RACE 3,MISS JAY FOX (9),ALYSHA COLLETT,BB,3
BRETT CAVANOUGH,MUSWELLBROOK,RACE 3,OFFICE AFFAIR (10),RACHAEL MURRAY,B,3
BRETT THOMPSON,MUSWELLBROOK,RACE 7,COSTAS (2),RONALD SIMPSON,BB,7
CODY MORGAN,MUSWELLBROOK,RACE 6,BAJAN GOLD (5),JEFF PENZA,BB,6
CODY MORGAN,MUSWELLBROOK,RACE 7,RAPID EAGLE (9),DARRYL MCLELLAN,B,7
In the sample data, the first record that would meet my criteria would be the following:
BJORN BAKER,MUSWELLBROOK,RACE 3,MISS JAY FOX (9),ALYSHA COLLETT,BB,3
Note: BJORN BAKER only appears once and ALYSHA COLLETT only appears once.
In the sample data, trainer ALLAN DENHAM has only one runner for the day but jockey JEFF PENZA has 2 rides, one for trainer ALLAN DENHAM & one for trainer CODY MORGAN so this does not my meet my criteria.
Another record that would meet my criteria would be the following record:
BRETT CAVANOUGH,MUSWELLBROOK,RACE 3,OFFICE AFFAIR (10),RACHAEL MURRAY,B,3
Note: BRETT CAVANOUGH only appears once and RACHAEL MURRAY only appears once.
BRETT THOMPSON,MUSWELLBROOK,RACE 7,COSTAS (2),RONALD SIMPSON,BB,7
Note: BRETT THOMPSON only appears once and RONALD SIMPSON only appears once.
And so on...
I've loaded the "sample data" (top of page) into an array in Perl and have investigated how to use hash, etc. in order to extract the unique records but I cannot figure out how to extract the required records based on the uniqueness of the combination of the two elements (i.e. one trainer + the one corresponding jockey)
use Data::Dumper;
$infile = "TRAINER-JOCKEY-SAMPLE.txt";
open my $infile, "<:encoding(utf8)", $infile or die "$infile: $!";
my #recs = <$infile>;
close $infile;
my %uniques;
for my $rec (#recs)
{
my ($trainer, $racecourse, $racenum, $hnameandnum, $jockey, $TDRating, $rnum) = split(",", $rec);
++$uniques{$trainer}{$jockey};
}
print Dumper(\%uniques);
for my $trainer (sort keys %uniques)
{
my $answer = join ',', sort keys %{ $uniques{$trainer} };
print "$trainer has unique values $answer\n";
}
Note: I need to print the entire record when successful (see below):
BJORN BAKER,MUSWELLBROOK,RACE 3,MISS JAY FOX (9),ALYSHA COLLETT,BB,3
Your help would be greatly appreciated.
Both the trainer and the jockey have to appear just once in the list (unless the input has duplicate lines).
So, let's just count the occurrences of trainers. To be able to match them to jockeys, we'll store jockeys to trainers in a hash of hashes.
Once we build the two structures, just select the jockeys with only one associated trainer and check that the trainer appeared just once, which had to be with the jockey they were associated to.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my (%jockeys, %trainers);
while (<>) {
my ($jockey, $trainer) = (split /,/)[0, 4];
++$trainers{$trainer};
undef $jockeys{$jockey}{$trainer};
}
for my $jockey (keys %jockeys) {
next if 1 < keys %{ $jockeys{$jockey} };
my $trainer = (keys %{ $jockeys{$jockey} })[0];
say "$jockey,$trainer" if 1 == $trainers{$trainer};
}
Update: To print the whole lines, we need to store them somewhere, too. We can slightly modify the program by remembering the whole lines in another hash; we can use either the trainer or the jockey as the key.
#!/usr/bin/perl
use warnings;
use strict;
my (%jockeys, %trainers, %full);
while (<>) {
my ($jockey, $trainer) = (split /,/)[0, 4];
++$trainers{$trainer};
undef $jockeys{$jockey}{$trainer};
$full{$jockey} = $_;
}
for my $jockey (keys %jockeys) {
next if 1 < keys %{ $jockeys{$jockey} };
my $trainer = (keys %{ $jockeys{$jockey} })[0];
print $full{$jockey} if 1 == $trainers{$trainer};
}
I have a class People with three properties
class People
attr_accessor :first_name, :last_name, :age
end
And I have two arrays:
a = [p1, p2]
b = [p3, p4]
Is there any easy way to combine these two arrays in a new array and remove the item with a condition like:
p1.first_name + p1.last_name == p3.first_name + p3.last_name
And after that all the item should be belong to array a
For example
p1.first_name = "Ada"
p1.last_name = "Wang"
p1.age = 28
p2.first_name = "Leon"
p2.last_name = "S"
p2.age = 28
p3.first_name = "Ada"
p3.last_name = "Wang"
p3.age = 18
p4.first_name = "Mario"
p4.last_name = "M"
p4.age = 80
the result should be [p1] the 28 years old Ada.Wang
I'm not sure I get your point, but maybe this is a possible option.
c = a + b
c.uniq! { |e| e.first_name && e.last_name }
Call Array#uniq! with a block on c which is the concatenation of a and b.
If arrays a and b themselves do not contain people with matching first and last names then this would work:
b.each_with_index do |p, i|
if !(b[i].first_name == a[i].first_name and b[i].last_name == a[i].last_name)
a.push(p) # as people p does not contain the same first/last names as a it can now be added to a
end
end
To check for other fields simply replace first_name / last_name with other variables.
I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:
Year, Name, County, Number
2000, JOHN, KINGS, 50
2000, BOB, KINGS, 40
2000, MARY, NASSAU, 60
2001, JOHN, KINGS, 14
2001, JANE, KINGS, 30
2001, BOB, NASSAU, 45
And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?
I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)
val names = sc.textFile("names.csv").map(l => l.split(","))
val uniqueCounty = names.map(x => x(2)).distinct.collect
for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)
}
Here is the solution using RDD. I am assuming you need top occurring name per county.
val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect
Output:
res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))
You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.
Use spark-csv to load your csv file into a Dataframe.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show
Gives output:
+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS| 50|
|2000| BOB| KINGS| 40|
|2000|MARY|NASSAU| 60|
|2001|JOHN| KINGS| 14|
|2001|JANE| KINGS| 30|
|2001| BOB|NASSAU| 45|
+----+----+------+------+
DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.
import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show
Gives output:
+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU| 60|
| KINGS| 50|
+------+-----------+
Is this what you are trying to achieve?
Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.
More information about Dataframes API
EDIT
For correct output:
df.registerTempTable("names")
//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
"count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show
Gives output:
+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN| 2|
|NASSAU|MARY| 1|
+------+----+---------------+
I writing a little program to generate some bogus top-ten sales numbers for book sales. I'm trying to do this in as compact a fashion as possible and do it without using MySQL or another DB.
I have written out what I want to happen. I've created a bogus catalog array and a bogus sales array corresponding sales to the index of the catalog entries. That part all works great.
I want to create a third array that includes all the titles from the catalog array with the sales numbers from the sales array, like a join in a DB, but without any DB. I can't figure out how to do that part of it though. I think once I have it in there I can sort it the way I want it, but making that third array is killing. I cannot figure out what I'm doing wrong or how to do it right.
So given the following code:
require 'random_word'
class BestOnline
def initialize
#catalog = Array.new
#sales = Array.new
#topten = Array.new
inventory = rand(50) + 10
days = rand(1..50)
now = Time.now
yesterday = now - 86400
saleshistory = now - (days * 86400)
(1..inventory).each do
#catalog << {
:title => "#{RandomWord.adjs.next.capitalize} #{RandomWord.nouns.next.capitalize}",
:price => rand(5.99..29.99).round(2)}
end
(0..days).each do
#sales << {
:id => rand(0..#catalog.count),
:salescount => rand(0..24),
:date => rand(saleshistory..now) }
end
end
def bestsellers
#sales.each do
# THIS DOESNT WORK AND I'M STUCK AS HOW TO FIX IT.
# #topten << {
# :title => #catalog[:id],
# :salescount => #sales[:salescount]
# }
end
puts #topten.group_by{ |tt| tt[:salescount]}.sort_by{ |k,v| -k}.first(10)
end
end
BestOnline.new.bestsellers
How can I create a third array that contains the titles and number of sales and output the result of the top-ten books sold?
Try this out:
def bestsellers
#sales.each do |sale|
#topten << {
title: #catalog[sale[:id]][:title],
salescount: sale[:salescount] }
end
#topten.sort! { |x, y| y[:salescount] <=> x[:salescount] }
puts #topten.first(10)
end
I suggest you write:
def bestsellers(sales)
sales.max_by(10) { |h| h[:salescount][:salescount]] }
end
puts bestsellers(sales)
Enumerable#max_by was permitted to have an argument in Ruby v2.2.
There are several problems with the way you've structured your code. Now that you have running code (by incorporating #fbonds66's answer), I suggest you post it at SO's sister-site Code Review. The purpose of CR is to suggest improvements to working code. If you read through some of the questions and answers there I think you will be impressed.
I was doing the dereferencing wrong trying to build the 3rd array of the 1st two:
#sales.each do |sale|
#topten << {
:title => #catalog[sale[:id]][:title],
:salescount => sale[:salescount]
}
end
I needed to work on the hash returned from .each as |sale| and use correct syntax to get what I was after from the other arrays.
I know this is probably dead simple, but I've got some data such as this in one file:
Artichoke
Green Globe, Imperial Star, Violetto
24" deep
Beans, Lima
Bush Baby, Bush Lima, Fordhook, Fordhook 242
12" wide x 8-10" deep
that I'd like to be able to format into a nice TSV type of table, to look something like this:
Name | Varieties | Container Data
----------|------------- |-------
some data here nicely padded with even spacing and right aligned text
Try String#rjust(width):
"hello".rjust(20) #=> " hello"
I wrote a gem to do exactly this: http://tableprintgem.com
No one has mentioned the "coolest" / most compact way -- using the % operator -- for example: "%10s %10s" % [1, 2]. Here is some code:
xs = [
["This code", "is", "indeed"],
["very", "compact", "and"],
["I hope you will", "find", "it helpful!"],
]
m = xs.map { |_| _.length }
xs.each { |_| _.each_with_index { |e, i| s = e.size; m[i] = s if s > m[i] } }
xs.each { |x| puts m.map { |_| "%#{_}s" }.join(" " * 5) % x }
Gives:
This code is indeed
very compact and
I hope you will find it helpful!
Here is the code made more readable:
max_lengths = xs.map { |_| _.length }
xs.each do |x|
x.each_with_index do |e, i|
s = e.size
max_lengths[i] = s if s > max_lengths[i]
end
end
xs.each do |x|
format = max_lengths.map { |_| "%#{_}s" }.join(" " * 5)
puts format % x
end
This is a reasonably full example that assumes the following
Your list of products is contained in a file called veg.txt
Your data is arranged across three lines per record with the fields on consecutive lines
I am a bit of a noob to rails so there are undoubtedly better and more elegant ways to do this
#!/usr/bin/ruby
class Vegetable
##max_name ||= 0
##max_variety ||= 0
##max_container ||= 0
attr_reader :name, :variety, :container
def initialize(name, variety, container)
#name = name
#variety = variety
#container = container
##max_name = set_max(#name.length, ##max_name)
##max_variety = set_max(#variety.length, ##max_variety)
##max_container = set_max(#container.length, ##max_container)
end
def set_max(current, max)
current > max ? current : max
end
def self.max_name
##max_name
end
def self.max_variety
##max_variety
end
def self.max_container()
##max_container
end
end
products = []
File.open("veg.txt") do | file|
while name = file.gets
name = name.strip
variety = file.gets.to_s.strip
container = file.gets.to_s.strip
veg = Vegetable.new(name, variety, container)
products << veg
end
end
format="%#{Vegetable.max_name}s\t%#{Vegetable.max_variety}s\t%#{Vegetable.max_container}s\n"
printf(format, "Name", "Variety", "Container")
printf(format, "----", "-------", "---------")
products.each do |p|
printf(format, p.name, p.variety, p.container)
end
The following sample file
Artichoke
Green Globe, Imperial Star, Violetto
24" deep
Beans, Lima
Bush Baby, Bush Lima, Fordhook, Fordhook 242
12" wide x 8-10" deep
Potatoes
King Edward, Desiree, Jersey Royal
36" wide x 8-10" deep
Produced the following output
Name Variety Container
---- ------- ---------
Artichoke Green Globe, Imperial Star, Violetto 24" deep
Beans, Lima Bush Baby, Bush Lima, Fordhook, Fordhook 242 12" wide x 8-10" deep
Potatoes King Edward, Desiree, Jersey Royal 36" wide x 8-10" deep
another gem: https://github.com/visionmedia/terminal-table
Terminal Table is a fast and simple, yet feature rich ASCII table generator written in Ruby.
I have a little function to print a 2D array as a table. Each row must have the same number of columns for this to work. It's also easy to tweak to your needs.
def print_table(table)
# Calculate widths
widths = []
table.each{|line|
c = 0
line.each{|col|
widths[c] = (widths[c] && widths[c] > col.length) ? widths[c] : col.length
c += 1
}
}
# Indent the last column left.
last = widths.pop()
format = widths.collect{|n| "%#{n}s"}.join(" ")
format += " %-#{last}s\n"
# Print each line.
table.each{|line|
printf format, *line
}
end
Kernel.sprintf should get you started.