I have a text file that has a large grouping of numbers (137mb text file) and am looking to use groovy to open the text file, read it line-by-line, modify the numbers, and then place them into a database (as strings). There are going to be 2 items per line that need to be written to separate database columns, which are related.
My text file looks as such:
A.12345
A.14553
A.26343
B.23524
C.43633
C.23525
So the flow would be:
Step 1.The file is opened
Step 2.Line 1 is red
Step 3.Line 1 is split into letter/number pair [:]
Step 4.The number is divided by 10
Step 5.Letter is written to letter data base (as string)
Step 6.Number is written to number database (as string)
Step 7.Letter:number pair is also written to a separate comma separated text file.
Step 8.Proceed to next line (line 2)
Output text file should look like this:
A,1234.5
A,1455.3
A,2634.3
B,2352.4
C,4363.3
C,2352.5
Database for numbers should look like this:
1:1234.5
2:1455.3
3:2634.3
4:2352.4
5:4363.3
6:2352.5
*lead numbers are database index locations, for relational purpose
Database for letters should look like this:
1:A
2:A
3:A
4:B
5:C
6:C
*lead numbers are database index locations, for relational purpose
I have been able to do most of this; the issue I am running into is not be able to use the .eachLine( line -> ) function correctly... and have NO clue how to output the values to the databases.
There is one more thing I am quite dense about, and that is the instance where the script encounters an error. The text file has TONS of entries (around 9000000) so I am wondering if there is a way to make it so if the script fails or anything happens that I can restart the script from the last modified line.
Meaning, the script has an error (my computer gets shut down somehow) and stops running at line 125122 (completes modification of line 125122) of the text file... how do I make it so when I start the script the second time run the script at line 125123.
Here is my sample code so far:
//openfile
myFile = new File("C:\\file.txt")
//set fileline to target
printFileLine = { it }
//set target to argument
numArg = myFile.eachLine( printFileLine )
//set argument to array split at "."
numArray = numArg.split(".")
//set int array for numbers after the first char, which is a letter
def intArray = numArray[2] { it as int } as int
//set string array for numbers after the first char, which is a letter
def letArray = numArray[1] { it as string }
//No clue how to write to a database or file... or do the persistence thing.
Any help would be appreciated.
I would use a loop to cycle over every line within the text file, I would also use Java methods for manipulating strings.
def file = new File('C:\\file.txt')
StringBuilder sb = new StringBuilder();
file.eachLine { line ->
//set StringBuilder to new line
sb.setLength(0);
sb.append(line);
//format string
sb.setCharAt(1, ',');
sb.insert(5, '.');
}
You could then write each line to a new text file, example here. You could use a simple counter (e.g. counter = 0; and then counter++;) to store the latest line that has been read/written and use that if an error occurs. You could catch possible errors within a try/catch statement if you are regularly getting crashes also.
This guide should give you a good start with working with a database (presuming SQL).
Warning, all of this code is untested and should hopefully give you more direction. There are probably many other ways to solve this differently, so keep an open mind.
Related
I would really appreciate any help you could provide me solving this problem.
I have two input .csv files in python. One of them, is a file that has different machine codes (one per line):
epdmolpd37901v,
epzsttom35101v,
epdsrptb36301v,
ppdpedbm07903,
ppdtrnod07202a,
These codes have a way to be translated:
The first letter is the environment of the machines:
P - Production
E - External User Acceptance
The following 2 letters are the location:
PD - Primary Data Center
PZ - Primary DMZ
The following 5 letters are the description, and that successively...
In another .csv file i have bi-dimensional arrays that have the translations to these acronyms:
P,Production
E,External User Acceptance
PD, Primary Data Center
PZ, Primary DMZ
etc...
How do I generate a different .csv file that will write the following information?
epdmolpd37901v, External User Acceptance, Primary Data Center, etc...
epzsttom35101v, External User Acceptance, Primary DMZ, etc...
etc...
The code has to read the first character from one .csv file, then go to the other .csv file a search on the first column for the same character and the respective translation, and finally write the translation of the full code in a new csv_file.
Can anyone help me with this? I am a beginner and would really appreciate any guidance.
My code right now (needs some work):
Thank you very much for your time!
You do not specify language of your choice, so I would analyze this as pseudocode. I assume here that acronyms are UNIQUE so you can't have "P" as first position and "P" at a different one.
Read original file into a list "of"
Read translation file into a list "tf"
; do translation file first
for each line in translation file {
split the line at "," and put each part in an array or hash list
; allows you to look up a value
}
; create a function that has input "acronym" and output "translation"
function translateit(acr as string) : as string
{ look it up in structure and return string }
; create the output file "outfile"
outfile = create_a_file()
; do the original file
for each line in original file {
writetooutput (original line)
writetooutput( translateit (left 1 ) ) ; char 1
writetooutput( translateit (mid 2,2) ) ; char 2,3
; ...
}
This way you "translate" the original line, piece by piece, as desired.
I'm trying to format a data file so that my other program will properly handle it. I am trying to handle the following data and I am getting a very weird error that I can't seem to put my finger on.
https://snap.stanford.edu/data/wiki-RfA.html
I am trying to format the data as [SRC TGT VOT], so I'd like the first two lines of my output file to be
1 2 1
3 2 1
because user 1 (stored in dictionary of users first) votes for user 2 with VOT 1 and then user 3 votes for user 2 with VOT 1. My problem is that when I try to run my code below, I always end up getting a very strange "invalid ascii sequence" error- can anyone help me identify the issue or perhaps find a way around this? It'd obviously be best if I could learn what I am doing wrong. Thank you!
Note, I understand that this is a bit specific of a question and I appreciate any help- I'm sort of baffled by this error and don't know how to resolve it at the moment.
f=open("original_vote_data.txt") #this is the file linked above
arr=readlines(f)
i=edge_count=src=tgt=vot=1
dict=Dict{ASCIIString, Int64}()
edges=["" for k=1:198275]
while i<1586200
src_temp=(arr[i])[5:end-2]
if (haskey(dict, src_temp))
new_src= dict[src_temp]
else
dict[src_temp]=src
new_src=src
src=src+1
end
tgt_temp=(arr[i+1])[5:end-2]
if (haskey(dict, tgt_temp))
new_tgt= dict[tgt_temp]
else
dict[tgt_temp]=tgt
new_tgt=tgt
tgt=tgt+1
end
vot_temp=(arr[i+2])[5]
edges[edge_count]=string(new_src)* " " * string(new_tgt)* " " *string(vot_temp)
edge_count=edge_count+1
i=i+8
end
Here we go - I'll write up my comment as an answer since it seems to have solved the question.
My hunch that the error stemmed from the fourth line (dict=Dict{ASCIIString, Int64}) was based on the fact that ASCIIStrings will error if you try to store non-ASCII characters in them. Since this file is coming from an international site, it's not unlikely that there are users with unicode characters in their names (or elsewhere in the data). So the simple fix is to change all instances of ASCIIString to UTF8String.
Just to make this answer a bit more complete, I downloaded the file and tried running the program. The simplest way to debug this is to run the script at top-level in the REPL and then inspect the program state after the error. After the error is thrown, i==3017. Now just try running each line of the while loop incrementally. You'll quickly see that line 3017 contains "SRC:Guðsþegn\n" — unicode, as I suspected. When you try to create a new entry in dict with that as the key, the error should have a backtrace to setindex! in dict.jl, where you'll see that it's trying to convert the key (a UTF8String) to an ASCIIString. So changing the dictionary type to have UTF8String keys solves the problem.
As it turns out, the edges array only contains strings of three integers (or sometimes a hyphen), so the ASCIIString there is ok, but still a little dangerous. I'd probably store that information in a more dedicated array of ints instead of converting it to a space-separated string: you know the first two elements in the string are ints, but the last element is unvalidated text from the file itself… which may be unicode or a space itself (which could mess up processing down the line).
I have a huge list of addresses and details I need to convert into an Excel spreadsheet and I think the best way would be to read the data and then write a second document that separates the lines so that they are tab-delimited whilst recognizing blank lines (between data entries) to preserve each separate address.
It is in the format:
AddressA1
AddressB1
Postcode1
Name1
PhoneNumber1
AddressA2
AddressB2
Postcode2
Name2
Name2
PhoneNumber2
AddressA3
AddressB3
Postcode3
Name3
PhoneNumber3
So the difficulty also comes when there are multiple names for a company, but I can hand format those if necessary (ideally they want to take on the same address as each other).
The resulting text document then, wants to be tab-delimited to:
Name|AddressA|AddressB|Postcode|Phone Number
I am thinking this would be easiest to do within a simple .bat command? or should I open the list in excel and run a script through that..?
I'm thinking if I can run through where it adds each entry to an array ($address $name etc) then I can use that to build a new text file by writing $name[i] tab $address[$i] etc
There are hundreds of entries and putting it in by hand is proving.. difficult.
I have some experience in MEL (basically C++) so I do understand programming in general, but somewhat at a loss in how .bat and Excel (VB?) handle and define empty lines and tabs.
The first step would be to bring the data into an Excel file. Once the data has been imported, we can re-package it to meet your specs. The first step:
Sub BringFileIn()
Dim TextLine As String, CH As String
Close #1
Open "C:\TestFolder\question.txt" For Input As #1
Dim s As String
Dim I As Long, J As Long
J = 1
I = 1
Do While Not EOF(1)
Line Input #1, TextLine
Cells(I, J) = TextLine
I = I + 1
Loop
Close #1
End Sub
Any text editor that can do regex search and replace across multiple lines can do the job nicely.
I have written a hybrid JScript/batch utility called REPL.BAT that performs a regex search and replace on stdin and writes the result to stdout. It is pure script that works on any modern Windows machine from XP forward - No 3rd party executable required. Full documentation is embedded within the script.
Assuming REPL.BAT is in your current directory, or better yet, somewhere within your PATH, then:
type file.txt|repl "\r?\n" "\t" mx|repl "\t\t" "\n" x|repl "^(([^\t]*\t){4})([^\t]*)$" "$1\t$3" x >newFile.txt
The above modifies the file in 3 steps and writes the result to a new file, leaving the original intact:
convert all newlines into tabs
convert consecutive tabs into newlines
insert an empty column (tab) before the last column on any line that contains only 5 columns.
Here's a method using only Word and Excel. I used the data that you posted. I am assuming that Name2 is the only optional field.
Paste your text into Word.
Replace all paragraph marks with a special
characters. (Ctrl-h, Search for ^p, Replace with |)
Replace all line breaks with a different special character. (Ctrl-h, Special character, search for Manual line break, replace with ;)
This is what it looks like in Word:
AddressA1;AddressB1;Postcode1;Name1;PhoneNumber1|AddressA2;AddressB2;Postcode2;Name2;Name2;PhoneNumber2|AddressA3;AddressB3;Postcode3;Name3;PhoneNumber3||
Then convert text to table (Insert -> Table -> Convert text to table), delimiting by ;. This gives 3 rows (plus 2 blank rows) of 1 column.
Then copy the table.
Now in Excel:
Paste the table. (It'll be one row in each row, with all of your fields in column A.)
Convert the text to columns (Data tab, Text to columns, Delimited, check semicolon)
Sort by column E. The phone numbers should be grouped together.
Cut the phone numbers in column E and copy to column F.
I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.
I need to load file to Lua's variables.
Let's say I got
name address email
There is space between each. I need the text file that has x-many of such lines in it to be loaded into some kind of object - or at least the one line shall be cut to array of strings divided by spaces.
Is this kind of job possible in Lua and how should I do this? I'm pretty new to Lua but I couldn't find anything relevant on Internet.
You want to read about Lua patterns, which are part of the string library. Here's an example function (not tested):
function read_addresses(filename)
local database = { }
for l in io.lines(filename) do
local n, a, e = l:match '(%S+)%s+(%S+)%s+(%S+)'
table.insert(database, { name = n, address = a, email = e })
end
return database
end
This function just grabs three substrings made up of nonspace (%S) characters. A real function would have some error checking to make sure the pattern actually matches.
To expand on uroc's answer:
local file = io.open("filename.txt")
if file then
for line in file:lines() do
local name, address, email = unpack(line:split(" ")) --unpack turns a table like the one given (if you use the recommended version) into a bunch of separate variables
--do something with that data
end
else
end
--you'll need a split method, i recommend the python-like version at http://lua-users.org/wiki/SplitJoin
--not providing here because of possible license issues
This however won't cover the case that your names have spaces in them.
If you have control over the format of the input file, you will be better off storing the data in Lua format as described here.
If not, use the io library to open the file and then use the string library like:
local f = io.open("foo.txt")
while 1 do
local l = f:read()
if not l then break end
print(l) -- use the string library to split the string
end