Parsing variable length input file into objects - suggestions? - text-parsing

I have been given the task of writing a small ATM program where. The program upon receives an input file, runs through the file, carrying out instructions.
The input file is of the following format:
8000
 
12345678 1234 1234
500 100
B
W 100
 
87654321 4321 4321
100 0
W 10
The first line is the total cash held in the ATM followed by a blank line. The remaining input represents zero or more user sessions. Each session consists of:
The user's account number, correct PIN and the PIN they actually entered. These are separated by spaces.
Then, on a new line, the customer's current balance and overdraft facility.
Then, one or more transactions, each on a separate line. These can be one of the following types:
Balance inquiries, represented by the operation code B.
Cash withdrawals, represented by the operation code W followed by an amount.
A blank line marks the end of a user session.
I am able to write the part of the program that carries out the transactions and outputs results.
What I need help on is parsing the input file in a meaningful way (possibly into objects). I am having issues with the fact the input is of variable length, making looping very difficult.
Can anyone push me in the right direction? I'm not just being lazy looking for the answer. I just need a nudge. I'm stuck on this for half a day now.
Thanks a million.

You can use regular expressions to parse each line, it looks like you can safely match each line uniquely.
12345678 1234 1234 = ^(\d+)\s(\d+)\s(\d+)$
500 100 = ^(\d+)\s(\d+)$
B = ^B$
W 100 = ^W\s(\d+)
Since the first line is know, just convert it to an Integer manually.
Then walk the file, line by line, on each empty line start trying to parse the next lines with each of the regular expressions until you have a match. Use the regular expression groups () to extract the relevant data. Handle them accordingly. When you get an empty line, reset everything and start parsing with the regular expressions again.
Read up on Event Driven applications, which is what this is, the apparently loop in reading the file is a red-herring.
The blank lines represent the starts of logical set of events. Each line then represents an atomic event, which should easily map to a function/method call.

Related

How do I get to write a code with JS to repeat an input 50 times

One of your friends has been told to write her (or his) name 50 times by her teacher. As a JavaScript student, you have determined to help her.
You decided to automate the name writing.
This is because writing her name manually from 1 to 50 is tedious work. As result, you tasked yourself to write a program that will display your friend's name 50 times.
Either run a for loop , or you can use a string method called repeat.

How to isolate numbers separated by stars in Lua?

In some web service, I receive this
"time":"0.301*0.869*1.387*2.93*3.653*3.956*4.344*6.268*6.805*7.712*9.099*9.784*11.071*11.921*13.347*14.253*14.965*16.313*16.563*17.426*17.62*18.114"
I want to separate the numbers and insert them into a table like this, how ?
0.301
0.869
1.387
2.93
3.653
3.956
4.344
6.268
6.805
7.712
9.099
9.784
11.071
11.921
13.347
14.253
14.965
16.313
16.563
17.426
17.62
18.114
A little string-matching should get the job done:
local str = [["time":"0.301*0.869*1.387*2.93*3.653*3.956*4.344*6.268*6.805*7.712*9.099*9.784*11.071*11.921*13.347*14.253*14.965*16.313*16.563*17.426*17.62*18.114"]]
local list = {}
for num in str:gmatch("%**(%d+%.%d+)") do
table.insert(list, tonumber(num))
end
A Little Explanation
I'll first briefly summarize what some of the symbols here are:
%d this means to look for a specific digit.
%. means to look specifically for a period
+ means to look for 1 or more of the specific thing you wanted to match earlier.
%* means to look specifically for a star.
* when the percentage sign isn't in front, this means that you can match 0 or more of a specific match.
Now, let's put this together to look at it from the start:
%** This means that we want the string to start with a star, but it that is optional. The reason we need it to be optional is because the first number you wanted does not have a star in front of it.
%d+ means to look for a sequence of digit(s) until something else pops up. In our case, this would be like the '18' in '18.114' or the '1' in '1.387'
%. as I said means we want the next thing found to be a period.
%d+ means we want another sequence of digit(s). Such as the 114 in 18.114
So, what do the parenthesis mean? It just means that we don't care about anything else outside the parenthesis when we capture the pattern.

How to format data in (a) CSV file(s) so that it can easily be imported in R?

Edit:
So, this format would work:
featureID charge xcoordinate ycoordinate
1 2 5105.9217 336.125209180674
1 2 5108.7642 336.124751115092
2 0 2434.9217 145.893331325278
But what if I have two columns with multiple value that are linked. Say column quality has a machine and a quality linked and the column looks like this
MachineQuality
[[{1:1224}, {2:3453}], [{1:2242}, {2:4142}]
Now if I want to split that up like I did with the coordinates of the convexhull I would need 2 rows instead of 1. But wouldn't I need 2 rows for every row that is already in (so 4, because there are already 2 extra for the coordinates) like this:
featureID charge xcoordinate ycoordinate quality1 quality2
1 2 5105.9217 336.125209180674 1224 3453
1 2 5105.9217 336.125209180674 2242 4142
1 2 5108.7642 336.124751115092 1224 3453
1 2 5108.7642 336.124751115092 2242 4142
[...]
Would it have to be like this?
I'm very new to R, my knowledge doesn't go much further than knowing how to make a vector and some simple plots. I'm going to use R for an internship project the next couple of months and during this time I will (hopefully) learn some of the ins and outs of R. However, before I start I need to produce the data that I'm going to do the statistics on. I need to know beforehand how I should format my output CSV data so that I can easily read it in once I start my R analysis.
One thing that I've been asked to do is make a CSV file out of the data so that it can be read in by R. The example CSV files for importing with R that I've seen all look like this
featureID Charge value
1 2 10
2 0 9
However, my data mostly consists out of columns for which the values contain multiple values. To clarify:
As an example, my data exists of "features" that, amongs other information has a "convexhull". This convexhull consists of paired x and y coordinates. So what I could have for data is (only showing two coordinates, can be many)
featureID Charge Convexhull
1 2 [[{'y': '336.125209180674'}, {'x': '5105.9217'}], [{'y': '336.124751115092'}, {'x': '5108.7642'}]]
Is it possible to get this in one CSV file, being able to read it in R correctly (so that the paired x and y coordinates are preserved)? If so, how should the CSV file look like? For example, I've seen examples for CSV files with multiple values that look like this:
featureID charge xcoordinate ycoordinate
1 2 5105.9217 336.125209180674
5108.7642 336.124751115092
2 0 2434.9217 145.893331325278
But I can't find if this is easily imported by R.
If this is not doable in one CSV file, are the CSV files easily imported independently, with a primary key idea, like database linking?
The only critical things are that you have a unique character separating your data columns and that each column is the same length. As long as the second row in your last example is filled in that will import fine.
You need to consider what you want to do with the data after it's in R to decide how you might want any other special formatting beforehand. But, as long as the column separator is a unique character and the columns are of equal length then it will import.
(You can violate the unique separator requirement if your entries are wrapped in quotes. And if you want to get really fancy you could "import" almost anything. But if someone's asking you to format the data then they probably want a rectangular data.frame compatible layout. They probably want unique values in each column (no columns of points). But that's between you and them.)
long vs. wide form. Your last example is known as long form (except all cells should be filled in) and your first example is roughly wide form as discussed on the ?reshape page and illustrated in the examples at the end of that page. You likely want to stick with long form. For an alternative see the reshape2 package.
save & load. Note that if you are only writing it out to read it back in to R later (as opposed to communicating it to some other software) you could use save and load which don't require any change to the object at all.
json. Another possibility given the form of your example is that you might want to look at the rjson package .

hadoop split file in equally size

Im trying to learn diving a file stored in hdfs into splits and reading it to different process (on different machines.)
What I expect is if I have a SequenceFile containing 1200 records with 12 process, I would see around 100 records per process. The way to divide the file is by getting the length of data, then dividing by number of processes, deriving chunk/beg/end size for each split, and then passing that split to e.g. SequenceFileRecordReader, retrieving records in a simple while loop : The code is as below.
private InputSplit getSplit(int id) throws IOException {
...
for(FileStatus file: status) {
long len = file.getLen();
BlockLocation[] locations =
fs.getFileBlockLocations(file, 0, len);
if (0 < len) {
long chunk = len/n;
long beg = (id*chunk)+(long)1;
long end = (id)*chunk;
if(n == (id+1)) end = len;
return new FileSplit(file, beg, end, locations[locations.length-1].getHosts());
}
}
...
}
However, the result shows that the sum of total records counted by each process is different from the records stored in file. What is the right way to divide the SequenceFile into chunk evenly and distribute them to different hosts?
Thanks.
I can't help but wonder why you are trying to do such a thing. Hadoop automatically splits your files and 1200 records to be split into 100 records doesn't sound like a lot of data. If you elaborate on what your problem is, someone might be able to help you more directly.
Here are my two ideas:
Option 1: Use Hadoop's automatic splitting behavior
Hadoop automatically splits your files. The number of blocks a file is split up into is the total size of the file divided by the block size. By default, one map task will be assigned to each block (not each file).
In your conf/hdfs-site.xml configuration file, there is a dfs.block.size parameter. Most people set this to 64 or 128mb. However, if you are trying to do something tiny, like 100 sequence file records per block, you could set this really low... to say 1000 bytes. I've never heard of anyone wanting to do this, but it is an option.
Option 2: Use a MapReduce job to split your data.
Have your job use an "identity mapper" (basically implement Mapper and don't override map). Also, have your job use an "identity reducer" (basically implement Reducer and don't override reduce). Set the number of reducers to the number of splits you want to have. Say you have three sequence files you want split into a total of 25 files, you would load up those 3 files and set the number of reducers to 25. Records will get randomly sent to each reducer, and what you will end up is close to 25 equal splits.
This works because the identity mappers and reducers effectively don't do anything, so your records will stay the same. The records get sent to random reducers, and then they will get written out, one file per reducer into part-r-xxxx files. Each of those files will contain your sequence file(s) split into somewhat even chunks.

Reading REAL's from file in FORTRAN 77 - odd results

I'm currently messing around in FORTRAN 77 and I've ran into a problem that I can't seem to figure out. I'm trying to read from a file that looks similar to below:
000120 Description(s) here 18 7 10.15
000176 Description(s) here 65 20 56.95
...
The last column in each row is a monetary amount (never greater than 100). I am trying to read the file by using code similar to below
integer pid, qty, min_qty
real price
character*40 descrip
open(unit=2, file='inventory.dat', status='old')
read(2, 100, IOSTAT=iend) pid, descript, qty, min_qty, price
100 format(I11, A25, I7, I6, F5)
Everything seems to be read just fine, except for the last column. When I check the value of price, say for example, for the second line; instead of getting 56.95 I get something like 56.8999999999.
Now, I understand that I might have trailing 9's or whatnot because it's not totally precise, but shouldn't it be a little closer to 95 cents? Maybe there's something that I'm doing wrong, I'm not sure. Hopefully I'm just not stuck with my program running like this! Any help is greatly appreciated!
Is that exactly the code you use to read the file? Do you have "X" formats to align the columns? Such as (I11, A25, 2X, I7, 3X, I6, 3X, F5) (with made up values). If you got the alignment off by one and read only "56.9" for "56.95", then floating point imprecision could easily give you 56.89999, which is very close to 56.9
You could also read the line into a string and read the numbers from sub-strings -- this would require only precisely identifying the location of the string. Once the sub-strings contained only spaces and numbers, you could use a less-finicky IO directed read: read (string (30:80), *) qty, min_qty, price.

Resources