Clearing up confusion with Maps/Collections (Groovy) - arrays

I define a collection that is supposed to map two parts of a line in a tab-separated text file:
def fileMatches = [:].withDefault{[]}
new File('C:\\BRUCE\\ForensicAll.txt').eachLine { line ->
def (source, matches) = line.split (/\t/)[0, 2]
fileMatches[source] << (matches as int)}
I end up with entries such as filename:[984, 984] and I want [filename : 984] . I don't understand how fileMatches[source] << (matches as int) works. How might I get the kind of collection I need?

I'm not quite sure I understand what you are trying to do. How, for example, would you handle a line where the two values are different instead of the same (which seems to be what is implied by your code)? A map requires unique keys, so you can't use filename as a key if it has multiple values.
That said, you could get the result you want with the data implied by your result using:
def fileMatches = [:]
new File('C:\\BRUCE\ForensicAll.txt').eachLine { line ->
def (source, matches) = line.split(/\t/)[0,2]
fileMatches[source] = (matches as int)
}
But this will clobber the data (i.e., you will always end up with the second value from the last line your file. If that's not what you want you may want to rethink your data structure here.
Alternatively, assuming you want unique values, you could do:
def fileMatches = [:].withDefault([] as Set)
new File('C:\\BRUCE\ForensicAll.txt').eachLine { line ->
def (source, matches) = line.split(/\t/)[0,2]
fileMatches[source] << (matches[1] as int)
}
This will result in something like [filename:[984]] for the example data and, e.g., [filename:[984, 987]] for files having those two values in the two columns you are checking.
Again, it really depends on what you are trying to capture. If you could provide more detail on what you are trying to accomplish, your question may become answerable...

Related

How to store a specific element from a csv into a variable?

I have a csv file called Energy.csv and it's very simple:
Building A,150
Building B,160
I would like to import the csv into Ruby and print the second row, second column element (160). This is what I have so far but I don't know how to improve this code.
require 'csv'
class CSVImport
energyusageA = Array.new
energyusageB = Array.new
CSV.foreach('CSV/Energy.csv') do |row|
energyusageA = row[1]
energyusageB = row[2]
end
puts energyusageB[1]
end
To read this in quickly if it's a small file and memory isn't a constraint:
CSV.open('CSV/energy.csv').read[1][1]
Where that pulls the second row's second value as everything's zero-indexed in Ruby.
In your code you have it wrapped inside of a class definition but that doesn't really make sense unless you're defining methods. Yes you can run executable code there in that context but that's reserved for other situations like meta-progamming.
A Ruby-style design looks like this:
class CSVReader
def initialize(path)
#path = path
end
def value(row: 1, col: 1)
CSV.open(#path).read[row][col]
end
end
Where you can call it like:
CSVReader.new('CSV/energy.csv').value
CSVReader.new('CSV/energy.csv').value(row: 4, col: 2)
And so on.
I think you were quite close.
To get second row you can do it the following way:
require 'CSV'
second_row = Array.new
CSV.foreach('energy.csv') do | row |
second_row << row[1]
end
To get the second element you just need to access second column element (since ruby is 0-based it is 1):
second_row[1] will print you =>"160".
Edit I think I need to explain one more thing. The difference between = (assignment) and << (appending).
The = assigns the variable the right side.
The << appends the right side to the end of an Array.
You can try it out on the following test:
test = Array.new
test = 'Yo' => this assigns string to the test (*"Yo\n"* will be stored in the *test8 variable)
OR
test << 'Yo' => this appends to the empty array the string 'Yo' so the *Array* will look like this *["Yo"]*.

Having trouble putting my header back on my CSV file

Here is my code:
require 'CSV'
contents = CSV.read('/Users/namename/Desktop/test.csv')
arr = []
first_row = contents[0]
contents.shift
contents.each do |row|
if row[12].to_s =~ /PO Box/i or row[12].to_s =~ /^[[:digit:]]/
#File.open('out.csv','a').puts('"'+row.join('","')+'"')
arr << row
else
row[12], row[13] = row[13], row[12]
#File.open('out.csv','a').puts('"'+row.join('","')+'"')
arr << row
end
end
arr.unshift(first_row)
arr.each do |row|
File.open('out.csv', 'a').puts('"' + row.join('","') + '"')
end
First I .shift so that my header fields don't catch the pattern (and ultimately swap) in the first conditional of the first .each loop. Then I conditionally swap cell values that match the pattern, and then store the correctly shifted values in an array. After this, I .unshift to attempt to put back my header fields I stored in first_row, but when I view the resulting out.csv file I get all my headers in the middle. Why?
Example data:
https://gist.github.com/anonymous/e1017d3ba81634d9e1227e7fe49536cb
The root of your problem is that you're not using the features provided by the CSV module.
First, CSV.read takes a :headers option that will catch the headers for you so you don't have to worry about them and, as a bonus, lets you access fields by header name instead of numeric index (handy if the CSV fields' order is changed). With the :headers option, CSV.read returns a CSV::Table object, which has another benefit I'll discuss in a moment.
Second, you're generating your own faux-CSV output instead of letting the CSV module do it. This, in particular, is needless and dangerous:
...puts('"' + row.join('","') + '"')
If any of your column values has quotation marks or newlines, which need to be escaped, this will fail, badly. You could use CSV.generate_line(row) instead, but you don't need to if you've used the headers: option above. Like I said, it returns a CSV::Table object, which has a to_csv method, and that method accepts a :force_quotes option. That will quote every field just like you want—and, more importantly, safely.
Armed with the above knowledge, the code becomes much saner:
require "csv"
contents = CSV.read('/Users/namename/Desktop/test.csv', headers: true)
contents.each do |row|
next unless row["DetailActiveAddressLine1"] =~ /PO Box|^[[:digit:]]/i
row["DetailActiveAddressLine1"], row["DetailActiveAddressLine2"] =
row["DetailActiveAddressLine2"], row["DetailActiveAddressLine1"]
end
File.open('out.csv', 'a') do |file|
file.write(contents.to_csv(force_quotes: true))
end
If you'd like, you can see a version of the code in action (without file access, of course) on Ideone: http://ideone.com/IkdCpb

How can I write strings to an h5 in matlab?

I've managed to answer my own question. This code will write cell arrays of any shape containing strings. The datasets can be modified/overwritten by simply calling again with a different input.
https://www.mathworks.com/matlabcentral/fileexchange/24091-hdf5-read-write-cellstr-example
%Okay, Matlab's h5write(filename, dataset, data) function doesn't work for
%strings. It hasn't worked with strings for years. The forum post that
%comes up first in Google about it is from 2009. Yeah. This is terrible,
%and evidently it's not getting fixed. So, low level functions. Fun fun.
%
%What I've done here is adapt examples, one from the hdf group's website
%https://support.hdfgroup.org/HDF5/examples/api18-m.html called
%"Read / Write String Datatype (Dataset)", the other by Jason Kaeding.
%
%I added functionality to check whether the file exists and either create
%it anew or open it accordingly. I wanted to be able to likewise check the
%existence of a dataset, but it looks like this functionality doesn't exist
%in the API, so I'm doing a try-catch to achieve the same end. Note that it
%appears you can't just create a dataset or group deep in a heirarchy: You
%have to create each level. Since I wanted to accept dataset names in the
%same format as h5read(), in the event the dataset doesn't exist, I loop
%over the parts of the dataset's path and try to create all levels. If they
%already exist, then this action throws errors too; hence a second
%try-catch.
%
%I've made it more advanced than h5create()/h5write() in that it all
%happens in one call and can accept data inputs of variable size. I take
%care of updating the dataset's extent to accomodate changing data array
%sizes. This is important for applications like adding a new timestamp
%every time the file is modified.
%
%#author Pavel Komarov pavel#gatech.edu 941-545-7573
function h5createwritestr(filename, dataset, str)
%"The class of input data must be cellstring instead of char when the
%HDF5 class is VARIABLE LENGTH H5T_STRING.", but also I don't want to
%force the user to put braces around single strings, so this.
if ischar(str)
str = {str};
end
%check whether the specified .h5 exists and either create or open
%accordingly
if ~exist(filename, 'file')
file = H5F.create(filename, 'H5F_ACC_TRUNC', 'H5P_DEFAULT', 'H5P_DEFAULT');
else
file = H5F.open(filename, 'H5F_ACC_RDWR', 'H5P_DEFAULT');
end
%set variable length string type
vlstr_type = H5T.copy('H5T_C_S1');
H5T.set_size(vlstr_type,'H5T_VARIABLE');
% There is no way to check whether a dataset exists, so just try to
% open it, and if that fails, create it.
try
dset = H5D.open(file, dataset);
H5D.set_extent(dset, fliplr(size(str)));
catch
%create the intermediate groups one at a time because evidently the
%API's functions aren't smart enough to be able to do this themselves.
slashes = strfind(dataset, '/');
for i = 2:length(slashes)
url = dataset(1:(slashes(i)-1));%pull out the url of the next level
try
H5G.create(file, url, 1024);%1024 "specifies the number of
catch %bytes to reserve for the names that will appear in the group"
end
end
%create a dataspace for cellstr
H5S_UNLIMITED = H5ML.get_constant_value('H5S_UNLIMITED');
spacerank = max(1, sum(size(str) > 1));
dspace = H5S.create_simple(spacerank, fliplr(size(str)), ones(1, spacerank)*H5S_UNLIMITED);
%create a dataset plist for chunking. (A dataset can't be unlimited
%unless the chunk size is defined.)
plist = H5P.create('H5P_DATASET_CREATE');
chunksize = ones(1, spacerank);
chunksize(1) = 2;
H5P.set_chunk(plist, chunksize);% 2 strings per chunk
dset = H5D.create(file, dataset, vlstr_type, dspace, plist);
%close things
H5P.close(plist);
H5S.close(dspace);
end
%write data
H5D.write(dset, vlstr_type, 'H5S_ALL', 'H5S_ALL', 'H5P_DEFAULT', str);
%close file & resources
H5T.close(vlstr_type);
H5D.close(dset);
H5F.close(file);
end
I found a bug!
spacerank = length(size(str));
Now it works flawlessly as far as I can tell.

Find pairs with distance in ruby array

I have a big array with a sequence of values.
To check if the values in place x have an influence on the values on place x+distance
I want to find all the pairs
pair = [values[x], values[x+1]]
The following code works
pairs_with_distance = []
values.each_cons(1+distance) do |sequence|
pairs_with_distance << [sequence[0], sequence[-1]]
end
but it looks complicated and I wonder if if I make it shorter and clearer
You can make the code shorter by using map directly:
pairs_with_distance = values.each_cons(1 + distance).map { |seq|
[seq.first, seq.last]
}
I prefer something like the example below, because it has short, readable lines of code, and because it separates the steps -- an approach that allows you to give a meaningful names to intermediate calculations (groups in this case). You can probably come up with better names based on the real domain of the application.
values = [11,22,33,44,55,66,77]
distance = 2
groups = values.each_cons(1 + distance)
pairs = groups.map { |seq| [seq.first, seq.last] }
p pairs

How to stop a loop in Groovy

From the collection fileMatches, I want to assign the maps with the 10 greatest values to a new collection called topTen. So I try to make a collection:
def fileMatches = [:].withDefault{[]}
new File('C:\\BRUCE\\ForensicAll.txt').eachLine { line ->
def (source, matches) = line.split (/\t/)[0, 2]
fileMatches[source] << (matches as int)
I want to iterate through my collection and grab the 10 maps with greatest values. One issue I might be having is that the output of this doesn't look quite like I imagined. One entry for example:
C:\cygwin\home\pro-services\git\projectdb\project\stats\top.h:[984, 984]
The advice so far has been excellent, but I'm not sure if my collection is arranged to take advantage of the suggested solutions (I have filename:[984, 984] when maybe I want [filename, 984] as the map entries in my collection). I don't understand this stuff quite yet (like how fileMatches[source] << (matches as int) works, as it produces the line I posted immediately above (with source:[matches, matches] being the output).
Please advise, and thanks for the help!
Check this another approach, using some Collection's skills. It does what you want with some simplicity...
def fileMatches = [um: 123, dois: 234, tres: 293, quatro: 920, cinco: 290];
def topThree;
topThree = fileMatches.sort({tup1, tup2 -> tup2.value <=> tup1.value}).take(3);
Result:
Result: [quatro:920, tres:293, cinco:290]
You might find it easier to use some of the built-in collection methods that Groovy provides, e.g.:
fileMatches.sort { a, b -> b.someFilename <=> a.someFilename }[0..9]
or
fileMatches.sort { it.someFileName }[-1..-10]
The range on the end there will cause an error if you have < 10 entries, so it may need some adjusting if that's your case.

Resources