Saving numbers to file and plotting them, in a loop - file

Given the following two loops, I want to save G.temp and S.temp to file and plot them at a later stage.
for(p in seq_along(time(S))){
G.temp<-window(G,start = p,end=p+1)
S.temp <- window(S,start=p,end=p)
print(max(as.numeric(G.temp)-as.numeric(S.temp)))
}

You may want to read one of the introductions to R to learn about data types into which you can write your temporary results. Then use a standard function like write.table to export it.
Your installation of R came with this introduction to R which will also cover basic plotting questions.
For your second question, data import/export is covered in the R Data Import/Export manual that also came with R.

Related

Looping Strucutre R for mtcars

I have homework very new to R and I skipped an introductor course which I now realise was maybe not a good idea just need help with these questions.
Using “mtcars” dataset in R and answering to the following questions. For each question you
need to provide the exact R codes in R-studio. You should provide the copy of the plots.
(make a screenshot from R plot)
a) Create a scatterplot with any chosen variable with appropriate tittle and labels. (you
need to add the R codes here)
b) Use a loop structure and shows 4 different plots with different colours and symbols.
All 4 plots should be appeared in one window.
No idea where to even begin

What is the best way to load a large folder of files into Julia to compare the columns of each file?

New to Julia and programming in general so this is a two part question. Suppose I have a Folder with 3,000 CSV files. Each file is roughly 7,000 x 7. (The number of rows may vary from file to file but the number of columns is constant.) I am trying to read each of these files into an 3000 x N x M tensor or other data structure in julia to compare the outputs by column. (This would mostly involve comparing the sum of the lags in each column vector of each file)
Question 1: What is the most efficient data structure to parse through this data? I would essentially be calculating the max of the sum of the lags of each column for all files. I've been told by a more experienced user that I should be using NamedArrays for this. I was wondering if anyone could provide some insights as to why? Would DataFrames be able to perform similar calculations?
Question 2: Is there an efficient way to read all these files into named arrays? I can read these files into Dataframes with Glob using the following code.
Folder="/Users/Desktop/Data"
Files=glob("*.csv", Folder)
df=DataFrame.(CSV.File.(Files))
But I don't know how to read it into NamedArrays directly. Any insights would be greatly appreciated thanks!

Minimize a linear programming system in C

I need to minimize a huge linear programming system where all related data (objective function, constraints) are stored in the memory in arrays and structures but not in lp file format or CPLEX
I saw that there are many solvers like here and here but the problem is how can I minimize the model without calling it from a file of a special format?
I did the same work previously in R and Python by solving the model directly after producing it without the need to save it initially in a special file and then call it by the solver. Here is an example in Python:
from lpsolve55 import *
from lp_maker import *
from lp_solve import *
lp = lp_maker(obj_func, constraints , rhs, sense_equality)
solvestat = lpsolve('solve', lp)
obj = lpsolve('get_objective', lp)
I think this is possible to do in C but really I don't know where to find how it is possible to do it.
One option is to use the APIs that commercial solvers like CPLEX and Gurobi provide for C/C++. Essentially, these APIs let you build the model in logical chunks (objective function, constraints, etc.). The APIs do the work of translating the logic of the model to the matrices and vectors that the solver actually needs in order to solve the model.
Another approach is to use a modeling language like AMPL or GAMS. AMPL, for example, also provides a C/C++ API.
Which one you choose probably depends on what solver you plan to use and how often you need to modify your model and/or data programmatically.

Import CSV file into python, then turn it into numpy array, then feed it to sklearn algorithm

Sklearn algorithm require a feature and a label for it to learn.
I have a CSV file which contain some data. These data is actually a challenge from hackerearth website in which participant need to create a learning algorithm that learn from data on massive amount of individuals from affiliate network and their ad click performance which then predict future performance of other individuals in the affiliate network which allow the company to optimize their ad performance.
The features in these data include id,date,siteid, offerid, category, merchant, countrycode,type of browser, type of device and the number of clicks their ads have gotten.
https://www.hackerearth.com/practice/algorithms/string-algorithm/string-searching/practice-problems/machine-learning/predict-ad-clicks/
So my plan is to use the first 7 information as my feature and ad click as label. Unfortunately, countrycode,browser and device information is in text (Google Chrome, Desktop) and not integers which can be turned into array.
Q1: Is there a way for sklearn to accept not just numpy arrays but also words as features? Am I support to use vectorizer for this? If so, how would I do it? If not, can I just replace the wording data into numbers (Google Chrome replaced by 1, firefox replaced by 2) and still have it to work? (I am using Naive Bayes algorithm)
Q2: Would Naive Bayes algorithm be suitable for this task? Since this competition require participant to create a program that predict the probability of individuals in affiliate network have their ads click, I assume Naive Bayes would be best suited.
Training data : https://drive.google.com/open?id=1vWdzm0uadoro3WcpWmJ0SVEebeaSsHvr
Testing data : https://drive.google.com/open?id=1M8gR1ZSpNEyVi5W19y0d_qR6EGUeGBQl
My messy coding and horrible attempt at this challenge which I don't think will be much help:
from sklearn.naive_bayes import GaussianNB
import csv
import pandas as pd
import numpy as np
data = []
from numpy import genfromtxt
import pandas as pd
data = genfromtxt('smaller.csv', delimiter=',')
dat = pd.read_csv('smaller.csv', delimiter=',')
print(dat(siteid))
feature = []
label =[]
i = 1
j = 1
while i <17:
feature.append(data[i][2:8])
i += 1
while j <17:
label.append(data[i][9])
j += 1
clf = GaussianNB()
clf.fit(feature,label)
print(clf.predict([data[18][2:8]]))
print(data[18])
Answer for Question1: No. Sklearn only works with numerical data. So you need to convert your text to numbers.
Now to convert text to numbers you can follow multiple approaches. First is as you said just assign numbers to them. But you need to to take in account if the text data shows any order like the numbers assigned to them or not. In that case, most often one-hot encoding is used. Please see the below scikit-learn documentation for that:
- http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
Answer to Question 2: It depends on the data and task at hand.
No single algorithm is capable of handling every type of data optimally.
Most of the times we need to compare multiple algorithms and see what gives best result for our data. See this example:
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py
Even in a single algorithm we need to check for various parameter values, tune those values for maximum score. This is called grid-search. See this example:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py
Hope this clears your doubts. Make sure to go through the scikit-learn documentation and examples:
http://scikit-learn.org/stable/user_guide.html
http://scikit-learn.org/stable/auto_examples/index.html
They are one of the best out there.

What is the most efficient way to read a CSV file into an Accelerate (or Repa) Array?

I am interested in playing around with the Accelerate library, and I would like to perform some operations on data stored inside of a CSV file. I've read this excellent introduction to Accelerate, but I'm not sure how I can go about reading CSVs into Accelerate efficiently. I've thought about this, and the only thing I can think of is to parse the entire CSV file into one long list, and then feed the entire list into Accelerate.
My data sets will be quite large, and it doesn't seem efficient to read a 1 gb+ file into memory only to copy somewhere else. I noticed there was a CSV Enumerator package on Hackage, but I'm not sure how to use it with Accelerate's generate function. Another constraint is that it seems the dimensions of the Array, or at least number of elements, must be known before generating an array using Accelerate.
Has anyone dealt with this kind of problem before?
Thanks!
I am not sure if this is 100% applicable to accelerate or repa, but here is one way I've handled this for Vector in the past:
-- | A hopefully-efficient sink that incrementally grows a vector from the input stream
sinkVector :: (PrimMonad m, GV.Vector v a) => Int -> ConduitM a o m (Int, v a)
sinkVector by = do
v <- lift $ GMV.new by
go 0 v
where
-- i is the index of the next element to be written by go
-- also exactly the number of elements in v so far
go i v = do
res <- await
case res of
Nothing -> do
v' <- lift $ GV.freeze $ GMV.slice 0 i v
return $! (i, v')
Just x -> do
v' <- case GMV.length v == i of
True -> lift $ GMV.grow v by
False -> return v
lift $ GMV.write v' i x
go (i+1) v'
It basically allocates by empty slots and proceeds to fill them. Once it hits the ceiling, it grows the underlying vector once again. I haven't benchmarked anything, but it appears to perform OK in practice. I am curious to see if there will be other more efficient answers here.
Hope this helps in some way. I do see there's a fromVector function in repa and perhaps that's your golden ticket in combination with this method.
I haven't tried reading CSV files into repa but I recommend using cassava (http://hackage.haskell.org/package/cassava). Iirc I had a 1.5G file which I used to create my stats. With cassava, my program ran in a surprisingly small amount of memory. Here's an extended example of usage:
http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-haskell/
In the case of repa, if you add rows incrementally to an array (which it sounds like you want to do) then one would hope the space usage would also grow incrementally. It certainly is worth an experiment. And possibly also contacting the repa folks. Please report back on your results :-)

Resources