Splitting two large CSV files preserving relations between file A and B across the resulting files

Splitting two large CSV files preserving relations between file A and B across the resulting files - file

I have TWO large, CSV files (around 1GB). They both share relation between each other (ID is lets say like a foreign key). Structure is simple, line by line but CSV cells with a line break in the value string can appear
37373;"SOMETXT-CRCF or other other line break-";3838383;"sasa ssss"
One file is P file and other is T file. T is like 70% size of the P file (P > T). I must cut them to smaller parts since they are to big for the program I have to import them... I can not simply use split -l 100000 since I will loose ID=ID relations which must be preserved! Relation can be 1:1, 2:3, 4:6 or 1:5. So stupid file splitting is no option, we must check the place where we create a new file. This is example with simplified CSV structure and a place where I want the file to be cut (and the lines above go to separate P|T__00x file and we continue till P or T ends). Lines are sorted in both files, so no need to search for IDs across whole file!
File "P" (empty lines for clearness):
CSV_FILE_P;HEADER;GOES;HERE
564788402;1;1;"^";"01"
564788402;2;1;"^";"01"
564788402;3;1;"^";"01"
575438286;1;1;"^";"01"
575438286;1;1;"^";"01"
575438286;2;1;"37145859"
575438286;2;1;"37145859"
575438286;3;1;"37145859"
575438286;3;1;"37145859"
575439636;1;1;"^"
575439636;1;1;"^"
# lets say ~100k line limit of file P is somewhere here and no more 575439636 ID lines , so we cut.
575440718;1;1;"^"
575440718;1;1;"^"
575440718;2;1;"10943890"
575440718;2;1;"10943890"
575440718;3;1;"10943890"
575440718;3;1;"10943890"
575441229;1;1;"^";"01"
575441229;1;1;"^";"01"
575441229;2;1;"4146986"
575441229;2;1;"4146986"
575441229;3;1;"4146986"
575441229;3;1;"4146986"
File T (empty lines for clearness)
CSV_FILE_T;HEADER;GOES;HERE
564788402;4030000;1;"0204"
575438286;6102000;1;"0408"
575438286;6102000;0;"0408"
575439636;7044010;1;"0408"
575439636;7044010;0;"0408"
# we must cut here since bigger file "P" 100k limit has been reached
# and we end here because 575439636 ID lines are over.
575440718;6063000;1;"0408"
575440718;6063000;0;"0408"
575441229;8001001;0;"0408"
575441229;8001001;1;"0408"
Can you please help splitting those two files into many 100 000 (or so) lines separate files T_001 and corresponding P_001 file and so on? So ID matches between file parts. I believe awk will be the best tool but I have not got much experience in this field. And the last thing - CSV header should be preserved in each of the files.
I have powerful AIX machine to cope with that (linux also possible since AIX commands are limited sometimes)

You can parse the beginning IDs with awk and then check to see if the current ID is the same as the last one. Only when it is different are you allowed close the current output file and open a new one. At that point record the ID for tracking the next file. You can track this id in a text file or in memory. I've done it in memory but with big files like this you could run into trouble. It's easier to keep track in memory than opening multiple files and reading from them.
Then you just need to distinguish between the first file (output and recording) and the second file (output and using the prerecorded data).
The code does a very brute force check on the possibility of a CRLF in a field - if the line does not begin with what looks like an ID, then it outputs the line and does no further testing on it. Which is a problem if the CRLF is followed immediately by a number and semicolon! This might be unlikely though...
Run with: gawk -f parser.awk P T
I don't promise this works!
BEGIN {
MAXLINES = 100000
block = 0
trackprevious = 0
}
FNR == 1 {
# First line is CSV header
csvheader = $0
if (FILENAME == "-")
{
_error = 1
print "Error: Need filename on command line"
exit 1
}
if (trackprevious)
{
_error = 1
print "Only one file can track another"
exit 1
}
if (block >= 1)
{
# New file - track previous output...
close(outputname)
Tracking[block] = idval
print "Output for " FILENAME " tracks previous file"
trackprevious = 1
}
else
{
print "Chunking output (" MAXLINES ") for " FILENAME
}
linecount = 0
idval = 0
block = 1
outputprefix = FILENAME "_block"
outputname = sprintf("%s_%03d", outputprefix, block)
print csvheader > outputname
next
}
/^[0-9]+;/ {
linecount++
newidval = $0
sub(/;.*$/, "", newidval)
newidval = newidval + 0 # make a number
startnewfile = 0
if (trackprevious && (idval != newidval) && (idval == Tracking[block]))
{
startnewfile = 1
}
else if (!trackprevious && (idval != newidval) && (linecount > MAXLINES))
{
# Last ID value found before new file:
Tracking[block] = idval
startnewfile = 1
}
if (startnewfile)
{
close(outputname)
block++
outputname = sprintf("%s_%03d", outputprefix, block)
print csvheader > outputname
linecount = 1
}
print $0 > outputname
idval = newidval
next
}
{
linecount++
print $0 > outputname
}

Related

Reading and Writing Arrays from Multiple HDF Files in IDL

I am fairly new to IDL, and I am trying to write a code that will take a MODIS HDF file (level three data MOD14A1 and MYD14A1 to be specific), read the array, and then write the data from the array preferably to a csv file, but ASCII would work, too. I have code that will allow me to do this for one file, but I want to be able to do this for multiple files. Essentially, I want it to read one HDF array, write it to a csv, move to the next HDF file, and then write that array to the same csv file in the next row. Any help here would be greatly appreciated. I've supplied the code I have so far to do this with one file.
filename = dialog_pickfile(filter = filter, path = path, title = title)
csv_file = 'Data.csv'
sd_id = HDF_SD_START(filename, /READ)
; read "FirePix", "MaxT21"
attr_index = HDF_SD_ATTRFIND(sd_id, 'FirePix')
HDF_SD_ATTRINFO, sd_id, attr_index, DATA = FirePix
attr_index = HDF_SD_ATTRFIND(sd_id, 'MaxT21')
HDF_SD_ATTRINFO, sd_id, attr_index, DATA = MaxT21
index = HDF_SD_NAMETOINDEX(sd_id, 'FireMask')
sds_id = HDF_SD_SELECT(sd_id, index)
HDF_SD_GETDATA, sds_id, FireMask
HDF_SD_ENDACCESS, sds_id
index = HDF_SD_NAMETOINDEX(sd_id, 'MaxFRP')
sds_id = HDF_SD_SELECT(sd_id, index)
HDF_SD_GETDATA, sds_id, MaxFRP
HDF_SD_ENDACCESS, sds_id
HDF_SD_END, sd_id
help, FirePix
print, FirePix, format = '(8I8)'
print, MaxT21, format = '("MaxT21:", F6.1, " K")'
help, FireMask, MaxFRP
WRITE_CSV, csv_file, FirePix
After I run this, and choose the correct file, this is the output I am getting:
FIREPIX LONG = Array[8]
0 4 0 0 3 12 3 0
MaxT21: 402.1 K
FIREMASK BYTE = Array[1200, 1200, 8]
MAXFRP LONG = Array[1200, 1200, 8]
The "FIREPIX" array is the one I want stored into a csv.
Thanks in advance for any help!

Instead of using WRITE_CSV, it is fairly simple to use the primitive IO routines to write a comma-separated array, i.e.:
openw, lun, csv_file, /get_lun
; the following line produces a single line the output CSV file
printf, lun, strjoin(strtrim(firepix, 2), ', ')
; TODO: do the above line as many times as necessary
free_lun, sun

How to loop over multiple files, ignore empty files without raising an exception?

I'm writing a program that will accept multiple files at once. But I'm trying to change this program in a way that if a empty file is entered, then ignore that file but continue to read over all the other files, and provide me an output without raising any exception errors related to the empty file.
For example:
File 1 = contains text that will work with this program
File 2 = is empty
This is a piece of my program:
from sys import argv
script , filenames = argv[0], argv[1:]
for file in filenames:
with open(file) as f:
var = f.read()
print "\n\nYou File Name: '(%r)'" % (file)
var1 = var.split()
var2 = len(var1)
print '\n\nThe Total Number of Words: "({0:,})"'.format(var2)
var3 = var.split()[0]
var4 = len(var3)
print '\n\nThe First Word and Length: "(%s)" ({0:,})'.format(var4) % (var3)
If I run this program using File 2, I will get the following error:
var3 = var.split()[0]
IndexError: list index out of range
Is there a way that allows me to run File 1 and File 2 together, but get the output for File 1, then print a message for File 2 saying that its an unrecognizable file? I tried adding try/except but still wasn't working correctly.

Use if / else to check the length of your file:
for file in filenames:
with open(file) as f:
var = f.read()
print "\n\nYou File Name: '(%r)'" % (file)
if len(var) > 0:
var1 = var.split()
var2 = len(var1)
print '\n\nThe Total Number of Words: "({0:,})"'.format(var2)
var3 = var.split()[0]
var4 = len(var3)
print '\n\nThe First Word and Length: "(%s)" ({0:,})'.format(var4) % (var3)
else:
print 'File empty'

Python Simple PiggyBank Program

This is my Python Program that I have been having some issues with:
-- coding: cp1252 --
from time import gmtime, strftime
print("Welcome to the PiggyBank version 1.")
num_write = int(input("How much money would you like to store in your PiggyBank?"))
f = open("PiggyBanks_Records.txt", "w")
current_time = strftime("%Y-%m-%d %H:%M:%S", gmtime())
convert_1 = str(current_time)
convert_2 = str(int(num_write))
add_1 = ("\n" + convert_1 + " £" + convert_2)
add_2 = ("\n" + add_1) #Tried to make it so new line is added every time the program is run
final_record = str(add_2)
print("Final file written to the PiggyBank: " + final_record)
#Write to File
f.write(final_record)
f.close()
Right now whenever the program writes to the file it over-writes. I would preferably would like to keep, like a history of the amounts added. If anyone can help so the string that needs to be written to the .txt file goes down by one line and essentially keeps going for ever. I am also open to any suggestion on how I can shorten this code.

You need to open your file with append mode :
f = open("PiggyBanks_Records.txt", "a")

Using the 'w' write option with open automatically looks for the specified file, and deletes its contents if it already exists (which you can read about here) or creates it if it doesn't. Use 'a' instead to add / append to the file.

Read from text file and assign data to new variable

Python 3 program allows people to choose from list of employee names.
Data held on text file look like this: ('larry', 3, 100)
(being the persons name, weeks worked and payment)
I need a way to assign each part of the text file to a new variable,
so that the user can enter a new amount of weeks and the program calculates the new payment.
Below is my code and attempt at figuring it out.
import os
choices = [f for f in os.listdir(os.curdir) if f.endswith(".txt")]
print (choices)
emp_choice = input("choose an employee:")
file = open(emp_choice + ".txt")
data = file.readlines()
name = data[0]
weeks_worked = data[1]
weekly_payment= data[2]
new_weeks = int(input ("Enter new number of weeks"))
new_payment = new_weeks * weekly_payment
print (name + "will now be paid" + str(new_payment))

currently you are assigning the first three lines form the file to name, weeks_worked and weekly_payment. but what you want (i think) is to separate a single line, formatted as ('larry', 3, 100) (does each file have only one line?).
so you probably want code like:
from re import compile
# your code to choose file
line_format = compile(r"\s*\(\s*'([^']*)'\s*,\s*(\d+)\s*,\s*(\d+)\s*\)")
file = open(emp_choice + ".txt")
line = file.readline() # read the first line only
match = line_format.match(line)
if match:
name, weeks_worked, weekly_payment = match.groups()
else:
raise Exception('Could not match %s' % line)
# your code to update information
the regular expression looks complicated, but is really quite simple:
\(...\) matches the parentheses in the line
\s* matches optional spaces (it's not clear to me if you have spaces or not
in various places between words, so this matches just in case)
\d+ matches a number (1 or more digits)
[^']* matches anything except a quote (so matches the name)
(...) (without the \ backslashes) indicates a group that you want to read
afterwards by calling .groups()
and these are built from simpler parts (like * and + and \d) which are described at http://docs.python.org/2/library/re.html
if you want to repeat this for many lines, you probably want something like:
name, weeks_worked, weekly_payment = [], [], []
for line in file.readlines():
match = line_format.match(line)
if match:
name.append(match.group(1))
weeks_worked.append(match.group(2))
weekly_payment.append(match.group(3))
else:
raise ...

Looping through A Folder and all its subdirectories

Ok. Many trouble shooting hours ...and many error "dings" later, I'm still having the same problem. Due to my beginner skills I'm having problems achieving the following segment of my project:
I will be as detailed as possible so I can hopefully nail it this time:
On my computer i have a folder C:\data which contains many different subfolders.
The subfolders are named by dates in a MMDDYY fashion. For example "040312"
In each subfolder are excel files named after Baseball teams. each subfolder may contain a different combination of xls files.
I am trying to write code that achieves the following objectives:
1.) Loops through all the subfolders of the C:\data folder looking for xls files that have the filenames: Angles.xls, Diamondbacks.xls, etc.
2.) If the files are found in each subfolder import the spreadsheet data and generate a plot of the data titled "Score" and "Allow".
3.) If the file is not found any given subfolder skip and continue to the next file to be located.
4.)Then save the generated plot in the same folder that the spreadsheet was imported from as a .fig and a .bmp file.
I've gotten hints to use various functions like: genpath, dir, but the code I've been fumbling through isn't able to achieve my goals.
a) the script doesn't import the excel files from all the subfolders
b) the script wont save the .fig or .bmp file in the associated subfolder
Here is the code I have been fumbling through:
%I know all of this is wrong wrong wrong. Please help to adjust my code to %achieve the objectives outlined above!
addpath(genpath('c:\data'))
folder = 'c:\data';
subdirs = dir(folder);
subdirs(~[subdirs.isdir]) = [] ;
numberOfFolders = length(subdirs);
if numberOfFolders <= 0
uiwait(warndlg('Number of folders = 0!'))
end
wantedfiles = {'Angels' 'Diamondbacks' 'Orioles' 'Royals' 'Yankees' 'Mets' 'Giants'};
for K = 1 : numberOfFolders
thissubdir = subdirs(K).name;
if strcmp(thissubdir, '.') || strcmp(thissubdir, '..')
continue;
end
subdirpath = [folder '\' thissubdir];
for L = 1 : length(wantedfiles)
for wantedfiles = {'Angels' 'Diamondbacks' 'Orioles' 'Royals' 'Yankees' 'Mets' 'Giants'};
folder = '';
fileToRead1 = [wantedfiles{1} '.xls'];
sheetName='Sheet1';
if exist(fileToRead1, 'file') == 0
% File does not exist
% Skip to bottom of loop and continue with the loop
continue;
end
%This is to import the data and organize it
% All of this code I had auto-generated from importing files manually
[numbers, strings, raw] = xlsread(fileToRead1, sheetName);
if ~isempty(numbers)
newData1.data = numbers;
end
if ~isempty(strings) && ~isempty(numbers)
[strRows, strCols] = size(strings);
[numRows, numCols] = size(numbers);
likelyRow = size(raw,1) - numRows;
% Break the data up into a new structure with one field per column.
if strCols == numCols && likelyRow > 0 && strRows >= likelyRow
newData1.colheaders = strings(likelyRow, :);
end
end
% Create new variables in the base workspace from those fields.
for i = 1:size(newData1.colheaders, 2)
assignin('base', genvarname(newData1.colheaders{i}), newData1.data(:,i));
end
% Now I execute the plotting of data
subplot (2,1,1), plot(Score,Allow)
title([wantedfiles{1} 'Testing to see if it works']);
subplot (2,1,2), plot(Allow,Score)
title('Well, did it?');
% here I save the generated plots, but they don't save where I want them to
saveas(gcf,[wantedfiles{1} ' did it work.fig']);
saveas(gcf,[wantedfiles{1} ' did it work.bmp']);
end
end
end
%At the end of the script I still was unable to loop over the files that I wanted
rmpath(genpath('c:\data'));

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Splitting two large CSV files preserving relations between file A and B across the resulting files - file

Related

Reading and Writing Arrays from Multiple HDF Files in IDL

How to loop over multiple files, ignore empty files without raising an exception?

Python Simple PiggyBank Program

Read from text file and assign data to new variable

Looping through A Folder and all its subdirectories

Categories

Resources