Recording and seeking to CSV file positions in Golang - file

I need to read a CSV file and record the locations of lines with certain values into an array, then later go back and retrieve those lines in no particular order and with good performance, so random access.
My program uses csv.NewReader(file), but I see no way to get or set the file offset that it uses. I tried file.Seek(0,io.SeekCurrent) to return the file position, but it doesn't change between calls to reader.Read(). I also tried fmt.Println("+v +v\n",reader,file) to see if anything stores the reader's file position, but I don't see it. I also don't know the best way to use the file position if I do find it.
Here's what I need to do:
file,_ = os.Open("stuff.csv")
reader = csv.NewReader(file)
//read file and record locations
for {
line,_ = reader.Read()
if wantToRememberLocation(line) {
locations = append(locations, getLocation()) //need this function
}
}
//then revisit certain lines
for {
reader.GoToLine(locations[random]) //need this function
line,_ = reader.Read()
doStuff(line)
}
Is there even a way to do this with the csv library, or will I have to write my own using more primitive file io functions?

Here's a solution using TeeReader. This example just saves all the positions and goes back and rereads some of them.
//set up some vars and readers to record position and length of each line
type Record struct {
Pos int64
Len int
}
records := make([]Record,1)
var buf bytes.Buffer
var pos int64
file,_ := Open("stuff.csv")
tr := io.TeeReader(file, &buf)
cr := csv.NewReader(tr)
//read first row and get things started
data,_ := cr.Read()
dostuff(data)
//length of current row determines position of next
lineBytes,_ := buf.ReadBytes('\n')
length := len(lineBytes)
pos += int64(length)
records[0].Len = length
records = append(records, Record{ Pos: pos })
for i:=1;;i++ {
//read csv data
data,err = c.Read()
if err != nil {break}
dostuff(data)
//record length and position
lineBytes,_ = buf.ReadBytes('\n')
lenth = len(lineBytes)
pos += int64(length)
records[i].Len = length
records = append(records, Record{ Pos: pos })
}
//prepare individual line reader
line := make([]byte,1000)
lineReader := bytes.NewReader(line)
//read random lines from file
for {
i := someLineNumber()
//use original file reader to fill byte slice with line
file.ReadAt(line[:records[i].Len], records[i].Pos)
//need new lineParser to start at beginning every time
lineReader.Seek(0,0)
lineParser := csv.NewReader(lineReader)
data,_ = lineParser.Read()
doStuff(data)
}

os.Open returns a File, which implements io.Seeker.
So you can do this to rewind the stream to the beginning:
_, err = file.Seek(0, io.SeekStart)
https://golang.org/src/os/file.go

Related

Expanding a temporary slice if more bytes are needed

I'm generating random files programmatically in a directory, at least temporaryFilesTotalSize worth of random data (a bit more, who cares).
Here's my code:
var files []string
for size := int64(0); size < temporaryFilesTotalSize; {
fileName := random.HexString(12)
filePath := dir + "/" + fileName
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
size += rand.Int63n(1 << 32) // random dimension up to 4GB
raw := make([]byte, size)
_, err := rand.Read(raw)
if err != nil {
panic(err)
}
file.Write(raw)
file.Close()
files = append(files, filePath)
}
Is there any way I can avoid that raw := make([]byte, size) allocation in the for loop?
Ideally I'd like to keep a slice on the heap and only grow if a bigger size is required. Any way to do this efficiently?
First of all you should know that generating random data and writing that to disk is at least an order of magnitude slower than allocating a contiguous memory for buffer. This definitely falls under the "premature optimization" category. Eliminating the creation of the buffer inside the iteration will not make your code noticeably faster.
Reusing the buffer
But to reuse the buffer, move it outside of the loop, create the biggest needed buffer, and slice it in each iteration to the needed size. It's OK to do this, because we'll overwrite the whole part we need with random data.
Note that I somewhat changed the size generation (likely an error in your code as you always increase the generated temporary files, since you use the size accumulated size for new ones).
Also note that writing a file with contents prepared in a []byte is easiest done using a single call to os.WriteFile().
Something like this:
bigRaw := make([]byte, 1 << 32)
for totalSize := int64(0); ; {
size := rand.Int63n(1 << 32) // random dimension up to 4GB
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
raw := bigRaw[:size]
rand.Read(raw) // It's documented that rand.Read() always returns nil error
filePath := filepath.Join(dir, random.HexString(12))
if err := os.WriteFile(filePath, raw, 0666); err != nil {
panic(err)
}
files = append(files, filePath)
}
Solving the task without an intermediate buffer
Since you are writing big files (GBs), allocating that big buffer is not a good idea: running the app will require GBs of RAM! We could improve it with an inner loop to use smaller buffers until we write the expected size, which solves the big memory issue, but increases complexity. Luckily for us, we can solve the task without any buffers, and even with decreased complexity!
We should somehow "channel" the random data from a rand.Rand to the file directly, something similar what io.Copy() does. Note that rand.Rand implements io.Reader, and os.File implements io.ReaderFrom, which suggests we could simply pass a rand.Rand to file.ReadFrom(), and the file itself would get the data directly from rand.Rand that will be written.
This sounds good, but the ReadFrom() reads data from the given reader until EOF or error. Neither will ever happen if we pass rand.Rand. And we do know how many bytes we want to be read and written: size.
To our "rescue" comes io.LimitReader(): we pass an io.Reader and a size to it, and the returned reader will supply no more than the given number of bytes, and after that will report EOF.
Note that creating our own rand.Rand will also be faster as the source we pass to it will be created using rand.NewSource() which returns an "unsynchronized" source (not safe for concurrent use) which in turn will be faster! The source used by the default/global rand.Rand is synchronized (and so safe for concurrent use–but is slower).
Perfect! Let's see this in action:
r := rand.New(rand.NewSource(time.Now().Unix()))
for totalSize := int64(0); ; {
size := r.Int63n(1 << 32)
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
filePath := filepath.Join(dir, random.HexString(12))
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
panic(err)
}
if err = file.Close(); err != nil {
panic(err)
}
files = append(files, filePath)
}
Note that if os.File would not implement io.ReaderFrom, we could still use io.Copy(), providing the file as the destination, and a limited reader (used above) as the source.
Final note: closing the file (or any resource) is best done using defer, so it'll get called no matter what. Using defer in a loop is a bit tricky though, as deferred functions run at the end of the enclosing function, and not at the end of the loop's iteration. So you may wrap it in a function. For details, see `defer` in the loop - what will be better?

Golang dynamic sizing slice when reading a file using buffo.read

I have a problem where, I need to use bufio.read to read a tsv file line by line and I need to record how many bytes each line Ive read is.
The problem is, It seems like I can't just initialize an empty slice and pass it into bufio.read and expect the slice to contain the entire line of the file.
file, _ := os.Open("file.tsv")
reader := bufio.NewReader(file)
b := make([]byte, 10)
for {
bytesRead, err:= reader.Read(b)
fmt.Println(bytesRead, b)
if err != nil {
break
}
}
So, for this example, since I specified the slice to be 10 bytes, the reader will read at most 10 bytes even if the line is bigger than 10 bytes.
However:
file, _ := os.Open("file.tsv")
reader := bufio.NewReader(file)
b := byte{} //or var b []byte
for {
bytesRead, err:= reader.Read(b)
fmt.Println(bytesRead, b)
if err != nil {
break
}
}
This will always read 0 bytes and I assume its because the buffer is length 0 or capacity 0.
How do I read a file Line by line, save the entire line in a variable or buffer, and return exactly how many bytes Ive read?
Thanks!
If you want to read line by line, and you're using a buffered reader, use the buffered reader's ReadBytes method.
line,err := reader.ReadBytes('\n')
This will give you a full line, one line at a time, regardless of byte length.

Count similar array value

I'm trying to learn Go (or Golang) and can't seem to get it right. I have 2 texts files, each containing a list of words. I'm trying to count the amount of words that are present in both files.
Here is my code so far :
package main
import (
"fmt"
"log"
"net/http"
"bufio"
)
func stringInSlice(str string, list []string) bool {
for _, v := range list {
if v == str {
return true
}
}
return false
}
func main() {
// Texts URL
var list = "https://gist.githubusercontent.com/alexcesaro/c9c47c638252e21bd82c/raw/bd031237a56ae6691145b4df5617c385dffe930d/list.txt"
var url1 = "https://gist.githubusercontent.com/alexcesaro/4ebfa5a9548d053dddb2/raw/abb8525774b63f342e5173d1af89e47a7a39cd2d/file1.txt"
//Create storing arrays
var buffer [2000]string
var bufferUrl1 [40000]string
// Set a sibling counter
var sibling = 0
// Read and store text files
wordList, err := http.Get(list)
if err != nil {
log.Fatalf("Error while getting the url : %v", err)
}
defer wordList.Body.Close()
wordUrl1, err := http.Get(url1)
if err != nil {
log.Fatalf("Error while getting the url : %v", err)
}
defer wordUrl1.Body.Close()
streamList := bufio.NewScanner(wordList.Body)
streamUrl1 := bufio.NewScanner(wordUrl1.Body)
streamList.Split(bufio.ScanLines)
streamUrl1.Split(bufio.ScanLines)
var i = 0;
var j = 0;
//Fill arrays with each lines
for streamList.Scan() {
buffer[i] = streamList.Text()
i++
}
for streamUrl1.Scan() {
bufferUrl1[j] = streamUrl1.Text()
j++
}
//ERROR OCCURRING HERE :
// This code if i'm not wrong is supposed to compare through all the range of bufferUrl1 -> bufferUrl1 values with buffer values, then increment sibling and output FIND
for v := range bufferUrl1{
if stringInSlice(bufferUrl1, buffer) {
sibling++
fmt.Println("FIND")
}
}
// As a testing purpose thoses lines properly paste both array
// fmt.Println(buffer)
// fmt.Println(bufferUrl1)
}
But right now, my build doesn't even succeed. I'm only greeted with this message:
.\hello.go:69: cannot use bufferUrl1 (type [40000]string) as type string in argument to stringInSlice
.\hello.go:69: cannot use buffer (type [2000]string) as type []string in argument to stringInSlice
bufferUrl1 is an array: [4000]string. You meant to use v (each
string in bufferUrl1). But in fact, you meant to use the second
variable—the first variable is the index which is ignored in the code
below using _.
type [2000]string is different from []string. In Go, arrays and slices are not the same. Read Go Slices: usage and internals. I've changed both variable declarations to use slices with the same initial length using make.
These are changes you need to make to compile.
Declarations:
// Create storing slices
buffer := make([]string, 2000)
bufferUrl1 := make([]string, 40000)
and the loop on Line 69:
for _, s := range bufferUrl1 {
if stringInSlice(s, buffer) {
sibling++
fmt.Println("FIND")
}
}
As a side-note, consider using a map instead of a slice for buffer for more efficient lookup instead of looping through the list in stringInSlice.
https://play.golang.org/p/UcaSVwYcIw has the fix for the comments below (you won't be able to make HTTP requests from the Playground).

Jump to specific line in file in Go

In Go is it possible to jump to particular line number in a file and delete it? Something like linecache in python.
I'm trying to match some substrings in a file and remove the corresponding lines. The matching part I've taken care of and I have an array with line numbers I need to delete but I'm stuck on how to delete the matching lines in the file.
This is an old question, but if anyone is looking for a solution I wrote a package that handles going to any line in a file. Link here. It can open a file and seek to any line position without reading the whole file into memory and splitting.
import "github.com/stoicperlman/fls"
// This is just a wrapper around os.OpenFile. Alternatively
// you could open from os.File and use fls.LineFile(file) to get f
f, err := fls.OpenFile("test.log", os.O_CREATE|os.O_WRONLY, 0600)
defer f.Close()
// return begining line 1/begining of file
// equivalent to f.Seek(0, io.SeekStart)
pos, err := f.SeekLine(0, io.SeekStart)
// return begining line 2
pos, err := f.SeekLine(1, io.SeekStart)
// return begining of last line
pos, err := f.SeekLine(0, io.SeekEnd)
// return begining of second to last line
pos, err := f.SeekLine(-1, io.SeekEnd)
Unfortunately I'm not sure how you would delete, this just handles getting you to the correct position in the file. For your case you could use it to go to the line you want to delete and save the position. Then seek to the next line and save that as well. You now have the bookends of the line to delete.
// might want lineToDelete - 1
// this acts like 0 based array
pos1, err := f.SeekLine(lineToDelete, io.SeekStart)
// skip ahead 1 line
pos2, err := f.SeekLine(1, io.SeekCurrent)
// pos2 will be the position of the first character in next line
// might want pos2 - 1 depending on how the function works
DeleteBytesFromFileFunction(f, pos1, pos2)
Based on my read of the linecache module it takes a file and explodes it into an array based on '\n' line endings. You could replicate the same behavior in Go by using strings or bytes. You could also use the bufio library to read a file a line by line and only store or save the lines you want.
package main
import (
"bytes"
"fmt"
)
import "io/ioutil"
func main() {
b, e := ioutil.ReadFile("filename.txt")
if e != nil {
panic(e)
}
array := bytes.Split(b, []byte("\n"))
fmt.Printf("%v", array)
}
I wrote a small function that allowing you remove from a file a specific line.
package main
import (
"io/ioutil"
"os"
"strings"
)
func main() {
path := "path/to/file.txt"
removeLine(path, 2)
}
func removeLine(path string, lineNumber int) {
file, err := ioutil.ReadFile(path)
if err != nil {
panic(err)
}
info, _ := os.Stat(path)
mode := info.Mode()
array := strings.Split(string(file), "\n")
array = append(array[:lineNumber], array[lineNumber+1:]...)
ioutil.WriteFile(path, []byte(strings.Join(array, "\n")), mode)
}

How to read a file starting from a specific line number using Scanner?

I am new to Go and I am trying to write a simple script that reads a file line by line. I also want to save the progress (i.e. the last line number that was read) on the filesystem somewhere so that if the same file was given as the input to the script again, it starts reading the file from the line where it left off. Following is what I have started off with.
package main
// Package Imports
import (
"bufio"
"flag"
"fmt"
"log"
"os"
)
// Variable Declaration
var (
ConfigFile = flag.String("configfile", "../config.json", "Path to json configuration file.")
)
// The main function that reads the file and parses the log entries
func main() {
flag.Parse()
settings := NewConfig(*ConfigFile)
inputFile, err := os.Open(settings.Source)
if err != nil {
log.Fatal(err)
}
defer inputFile.Close()
scanner := bufio.NewScanner(inputFile)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
}
// Saves the current progress
func SaveProgress() {
}
// Get the line count from the progress to make sure
func GetCounter() {
}
I could not find any methods that deals with line numbers in the scanner package. I know I can declare an integer say counter := 0 and increment it each time a line is read like counter++. But the next time how do I tell the scanner to start from a specific line? So for example if I read till line 30 the next time I run the script with the same input file, how can I make scanner to start reading from line 31?
Update
One solution I can think of here is to use the counter as I stated above and use an if condition like the following.
scanner := bufio.NewScanner(inputFile)
for scanner.Scan() {
if counter > progress {
fmt.Println(scanner.Text())
}
}
I am pretty sure something like this would work, but it is still going to loop over the lines that we have already read. Please suggest a better way.
If you don't want to read but just skip the lines you read previously, you need to acquire the position where you left off.
The different solutions are presented in a form of a function which takes the input to read from and the start position (byte position) to start reading lines from, e.g.:
func solution(input io.ReadSeeker, start int64) error
A special io.Reader input is used which also implements io.Seeker, the common interface which allows skipping data without having to read them. *os.File implements this, so you are allowed to pass a *File to these functions. Good. The "merged" interface of both io.Reader and io.Seeker is io.ReadSeeker.
If you want a clean start (to start reading from the beginning of the file), simply pass start = 0. If you want to resume a previous processing, pass the byte position where the last processing was stopped/aborted. This position is the value of the pos local variable in the functions (solutions) below.
All the examples below with their testing code can be found on the Go Playground.
1. With bufio.Scanner
bufio.Scanner does not maintain the position, but we can very easily extend it to maintain the position (the read bytes), so when we want to restart next, we can seek to this position.
In order to do this with minimal effort, we can use a new split function which splits the input into tokens (lines). We can use Scanner.Split() to set the splitter function (the logic to decide where are the boundaries of tokens/lines). The default split function is bufio.ScanLines().
Let's take a look at the split function declaration: bufio.SplitFunc
type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)
It returns the number of bytes to advance: advance. Exactly what we need to maintain the file position. So we can create a new split function using the builtin bufio.ScanLines(), so we don't even have to implement its logic, just use the advance return value to maintain position:
func withScanner(input io.ReadSeeker, start int64) error {
fmt.Println("--SCANNER, start:", start)
if _, err := input.Seek(start, 0); err != nil {
return err
}
scanner := bufio.NewScanner(input)
pos := start
scanLines := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
advance, token, err = bufio.ScanLines(data, atEOF)
pos += int64(advance)
return
}
scanner.Split(scanLines)
for scanner.Scan() {
fmt.Printf("Pos: %d, Scanned: %s\n", pos, scanner.Text())
}
return scanner.Err()
}
2. With bufio.Reader
In this solution we use the bufio.Reader type instead of the Scanner. bufio.Reader already has a ReadBytes() method which is very similar to the "read a line" functionality if we pass the '\n' byte as the delimeter.
This solution is similar to JimB's, with the addition of handling all valid line terminator sequences and also stripping them off from the read line (it is very rare they are needed); in regular expression notation, it is \r?\n.
func withReader(input io.ReadSeeker, start int64) error {
fmt.Println("--READER, start:", start)
if _, err := input.Seek(start, 0); err != nil {
return err
}
r := bufio.NewReader(input)
pos := start
for {
data, err := r.ReadBytes('\n')
pos += int64(len(data))
if err == nil || err == io.EOF {
if len(data) > 0 && data[len(data)-1] == '\n' {
data = data[:len(data)-1]
}
if len(data) > 0 && data[len(data)-1] == '\r' {
data = data[:len(data)-1]
}
fmt.Printf("Pos: %d, Read: %s\n", pos, data)
}
if err != nil {
if err != io.EOF {
return err
}
break
}
}
return nil
}
Note: If the content ends with an empty line (line terminator), this solution will process an empty line. If you don't want this, you can simply check it like this:
if len(data) != 0 {
fmt.Printf("Pos: %d, Read: %s\n", pos, data)
} else {
// Last line is empty, omit it
}
Testing the solutions:
Testing code will simply use the content "first\r\nsecond\nthird\nfourth" which contains multiple lines with varying line terminating. We will use strings.NewReader() to obtain an io.ReadSeeker whose source is a string.
Test code first calls withScanner() and withReader() passing 0 start position: a clean start. In the next round we will pass a start position of start = 14 which is the position of the 3. line, so we won't see the first 2 lines processed (printed): resume simulation.
func main() {
const content = "first\r\nsecond\nthird\nfourth"
if err := withScanner(strings.NewReader(content), 0); err != nil {
fmt.Println("Scanner error:", err)
}
if err := withReader(strings.NewReader(content), 0); err != nil {
fmt.Println("Reader error:", err)
}
if err := withScanner(strings.NewReader(content), 14); err != nil {
fmt.Println("Scanner error:", err)
}
if err := withReader(strings.NewReader(content), 14); err != nil {
fmt.Println("Reader error:", err)
}
}
Output:
--SCANNER, start: 0
Pos: 7, Scanned: first
Pos: 14, Scanned: second
Pos: 20, Scanned: third
Pos: 26, Scanned: fourth
--READER, start: 0
Pos: 7, Read: first
Pos: 14, Read: second
Pos: 20, Read: third
Pos: 26, Read: fourth
--SCANNER, start: 14
Pos: 20, Scanned: third
Pos: 26, Scanned: fourth
--READER, start: 14
Pos: 20, Read: third
Pos: 26, Read: fourth
Try the solutions and testing code on the Go Playground.
Instead of using a Scanner, use a bufio.Reader, specifically the ReadBytes or ReadString methods. This way you can read up to each line termination, and still receive the full line with line endings.
r := bufio.NewReader(inputFile)
var line []byte
fPos := 0 // or saved position
for i := 1; ; i++ {
line, err = r.ReadBytes('\n')
fmt.Printf("[line:%d pos:%d] %q\n", i, fPos, line)
if err != nil {
break
}
fPos += len(line)
}
if err != io.EOF {
log.Fatal(err)
}
You can store the combination of file position and line number however you choose, and the next time you start, you use inputFile.Seek(fPos, os.SEEK_SET) to move to where you left off.
If you want to use Scanner you have go trough the begging of the file till you find GetCounter() end-line symbols.
scanner := bufio.NewScanner(inputFile)
// context line above
// skip first GetCounter() lines
for i := 0; i < GetCounter(); i++ {
scanner.Scan()
}
// context line below
for scanner.Scan() {
fmt.Println(scanner.Text())
}
Alternatively you could store offset instead of line number in the counter but remember that termination token is stripped when using Scanner and for new line the token is \r?\n (regexp notation) so it isn't clear if you should add 1 or 2 to the text length:
// Not clear how to store offset unless custom SplitFunc provided
inputFile.Seek(GetCounter(), 0)
scanner := bufio.NewScanner(inputFile)
So it is better to use previous solution or not using Scanner at all.
There's a lot of words in the other answers, and they're not really reusable code so here's a re-usable function that seeks to the given line number & returns it and the offset where the line starts. play.golang
func SeekToLine(r io.Reader, lineNo int) (line []byte, offset int, err error) {
s := bufio.NewScanner(r)
var pos int
s.Split(func(data []byte, atEof bool) (advance int, token []byte, err error) {
advance, token, err = bufio.ScanLines(data, atEof)
pos += advance
return advance, token, err
})
for i := 0; i < lineNo; i++ {
offset = pos
if !s.Scan() {
return nil, 0, io.EOF
}
}
return s.Bytes(), pos, nil
}

Resources