How to read the same column from multiple files and collect it in an array - arrays

I have 9 csv files each containing the same number of columns (61) as well as the same column headers. The files are basically follows-up of each other. Each column belongs to a signal reading which has been recorded for a long period of time and hence divided into multiple files. I need to graph the collected data for every single column. To do that I thought I would read one column at a time from all files and store the data into an array and graph it against time.
Since the data load is to much, system takes a reading every 5 seconds for a month, I want to read the data for every 30 mins which equals to reading 1 row per 362 rows.
I've tried plotting everything without skipping rows and it takes forever because of the data load.
file_list = glob.glob('*.csv')
cols = [0,1] # add more columns here
df = pd.DataFrame()
for f in file_list:
df = df.append(
pd.read_csv(f, delimiter='\s+', header=None, usecols=cols),
ignore_index=True,
)
arr = df.values
This is what I tried to read only specific columns from multiple files but I receive this message: "Usecols do not match columns, columns expected but not found: [1]"

the command below will do a parallel read followed by a concatenation. Assuming file_list contains a list of files that can be read with read_file function below
import multiprocessing as mp
def read_file(file):
return pd.read_csv(file)
pool = mp.Pool(mp.cpu_count()) # one worker per CPU. You can try other things
df = pd.concat(pool.map(read_file, file_list)
pool.terminate()
pool.join()

Related

Looping dataframes to mmer model, saving the output in 1 dataframe

Badly needed advise on how I can execute a code where all the dataframes will be use in a a sommer model. In addition, the file output varcomp should be stored in another data frame in order to compute the h2.
To give you an idea here's the scenario.
I have 388 dataframes each having 3 columns, 2 columns are random effects variable (REV) whereas the other column is the phenotype.
the code I am using is
ans1<-mmer(Phenotype~1, random= C1 + C2 + C1:C2, data= dataframe1)
summary(ans1)$VarComp
The last code will give you the variance components of the model.
I need to save the Varcomp object in a dataframe where the Phenotype is the column name in order to compute the h2 at the end of the analysis.
Thank you so much

LUA - how to loop less times than entries in a table?

LUA neewb having trouble with all the different ways of looping.
I adjusted a template script for mapping audio files in my filesystem to an audio sampler software, where each audio file goes into 1 group --> zone --> file.
This works as intended but now for a "Lite-Version" package of the software I only need some of the samples from my computer and this is where I'm having problems with my loop functions.
This is the function causing issues:
for index, file in next, samples do
...
The table "samples" consists of 180 samples but I'm only creating 38 groups and zones in the audio software for putting audio files into, so obviously this returns an error.
ERROR:
" ...- PATH.LINE: bad argument #2 to '--index' (invalid index, size is 58 got 58)"
The 58 makes sense because it is 38 sample groups + 18 empty groups + 2 template groups that get duplicated.
Since that is still smaller than 180 it wants to keep going and I'm not sure how to tell it to stop there.
CODE:
local samples = {}
local root = 0
-- [[ FILE SYSTEM]]
local i = 1
for _,p in filesystem.directoryRecursive(folderPath) do
if filesystem.isRegularFile(p) then
if filesystem.extension(p) == '.wav' or filesystem.extension(p) == '.aif' or filesystem.extension(p) == '.aiff' then
samples[i] = p
i = i+1
end
end
end
After this I create the amount of groups I need in the audio software (38).
A group contains a zone which contains a sample
-- [[ ZONES & FILES ]]
-- Create zones and place one file in each of the created groups.
-- file is a string property of the object Zones
-- samples is a table, populated with paths to our samples
for index, file in next, samples do
-- Initialize the zone variable.
local z = Zone()
-- Add a zone for each sample.
instrument.groups[index +1].zones:add(z)
-- Populate the attached zone with a sample from our table.
z.file = file
-- detect and set root note
local detectedPitch = mir.detectPitch(index)
z.rootKey = math.floor(detectedPitch + 0.5)
end
So my question is: how do I loop through the table samples but only do it up to 38 and not to 180?
I could do
for index = 1, #NUM_FACT_LAYERS do
but what do I do about file then?
Samples only has the paths to the files but z.file needs the string of the file.
z.file = file won't work in this case.
I guess that's why the previous script used for for in

sum of variables in SPSS-statistics25 for multiple external files

I have 50 external EXCEL files. For each of these files, let's say #I, I import data as it follows in the SYNTAX of SPSS-statistics25:
GET DATA /TYPE=XLSX
/FILE='file#I.xlsx'
/SHEET=name 'Sheet2'
/CELLRANGE=full
/READNAMES=on
/ASSUMEDSTRWIDTH=32767.
EXECUTE.
DATASET NAME DataSet1 WINDOW=FRONT.
Then, I rank the variables included in #I file (WA CI) and I select one single case, at most, as it follows:
RANK VARIABLES= WA CI (D)
/RANK
/PRINT=YES
/TIES=LOW.
COUNT SVAR= RWA RCI (1).
SELECT IF( SVAR=2).
EXECUTE.
The task is the following:
I should print the sum of values of RWA looping on each EXCEL file #I. RWA can have value 1 or can be empty. If there are not selected cases (RWA is empty), the contribution to the sum of values should be 0. The final outcome should be the number of times RWA and RCI have the same TOP rank out of 50 Excel files.
How can I do this in a smart way?
Since I can't see the real data files, the following is a little in the dark, but I think it should be a viable strategy (you might as well try :)):
* first defining a macro to stack all the files together.
define stackFiles ()
GET DATA /TYPE=XLSX /FILE='file1.xlsx'
/SHEET=name 'Sheet2' /CELLRANGE=full /READNAMES=on /ASSUMEDSTRWIDTH=32767 /keep WA CI.
compute source=1.
exe.
dataset name gen.
!do !i=2 !to 40
GET DATA /TYPE=XLSX /FILE=!concat("'file", !i, ".xlsx'")
/SHEET=name 'Sheet2' /CELLRANGE=full /READNAMES=on /ASSUMEDSTRWIDTH=32767/keep WA CI.
compute source=!i.
exe.
add files /file=gen /file=*.
exe.
!doend.
!enddefine.
* now run the macro.
stackFiles .
* now for the rest of the analysis.
* first split the data by source file, then rank and select.
sort cases by source.
split file by source.
RANK VARIABLES= WA CI (D) /RANK /PRINT=YES /TIES=LOW.
COUNT SVAR= RWA RCI (1).
SELECT IF SVAR=2.
EXECUTE.
At this point you have up to 40 rows remaining - 0 or 1 from each original file. You can count or sum using descriptives RWA.

How to read a flat file and load it into 2 different SQL tables(different table structure) using SSIS 2019

I have a flat file which doesn't have the header record. The data except the trailing record is like a fixed width flat file with no delimiters.
Data in flat file looks like:
TOM ROLLS
DAVECHILLS
TOTAL2XYZ
Fixed Width data(first 2 lines as shown in the above flat file data)
ColumnName Start position End position
Name 1 4
Last_name 5 9
I want to load the data(till trailing record) in data_table and the trailing record(starting with Total) in another table. The data in the total table should look like
c1 c2
2 XYZ
For the data table, I am currently using "fixed width" and dividing the data into different column and it is working fine. Can you please help to load this last trailing record in a different table(Total table as discussed above)
You have not provided enough data for me to test because I can find several methods to load one row and accomplish what you are asking but those methods not necessarily work with multiple rows depending on the structure of you source data.
On the surface it appears that you simply need to make another flat file connection and define the starting and end position to extract only the data for the second table.

How to create runtime variable for reading csv file header using Pandas

I have a csv file .It logs some data depending upon the test condition.
The header file of this csv file is like below
UTC Time(s) SVID-1 Constel-1 Status-1 Zij-1 SVID-2 Constel-2 Status-2
10102 1 G P 0 2 G P
Zij-2 SVID-3 Constel-3 Status-3 Zij-3 .......
0.3 3 G A --
.....
Apart from UTC Time column, other columns may increase or decrease depending
upon test condition or number of satellites I use.
If any extra satellite introduces or reduces then corresponding Svid,Constel, Status,Zij will be present or will not be there.
I am interested to know whether is it possible to create runtime variable for each column without looking into csv file header.

Resources