Fseek to a line number (with lines of variable length)

Fseek to a line number (with lines of variable length) - c

I have a large file (~10GB) with variable length lines, and I would like to programmatically go to different line numbers. Is there an efficient way to do so?

Yes: build an index. For example, just once you can create a text file on the side which contains the byte offset of various line numbers, like this:
line,offset
0,0
10000,48272
20000,93726
Etc. Then when you want to go to line 13043, just jump to offset 48272 and skip another 3043 newlines. Simple and efficient.
Another approach would be to make your line lengths constant. This would work well if they already have similar lengths so you don't waste too much space. You can pad them out with \0 characters or spaces or whatever, then index the file like a big matrix (line N is at N*LEN bytes).
Finally, you could simply write the line numbers at the beginning of the lines themselves. Then just binary-search within the file, skip to a newline, and inspect the next line number to know whether to look backward or forward (and even guess by how much).

There is no efficient way to do so. You need to scan the entire file once to memorize when are the end of line markers.
Pragmatically, you need a large loop on e.g. getline(3)
You could memoize e.g. the offset of every 100 line, perhaps in a big array or some indexed file using GDBM or some Sqlite database.
My feeling is that you should not have such a huge text file in the first place at all (having a huge text file accessed randomly is the symptom of something wrong). It is not an efficient way to store such data, if you need to access it randomly. You could for example predigest it to fill some database, etc... Probably you should not put such a large piece of data in a text file, but directly in a database or whatever.

Not directly with fseek since it's only capable of moving the position by a bytes amount.
If the efficiency requirement comes from the fact that you must do it many times back and forth a simple solution could be to scan the whole file once and compute all the lines length, store them in a map or array and then use the values to move exactly where you want.

Related

Standard format for writing Compressed Row Storage into text files?

I have a large sparse matrix stored in Compressed Row Storage (CRS) format. This is basically three arrays: an array containing the Values, an array for Column Index, and a final array containing the Row Pointers. E.g. http://web.eecs.utk.edu/~dongarra/etemplates/node373.html
I want to write this information into a text (.txt) file, which is intended to be read and put into three arrays using C. I currently plan to do this by writing all the entries in the Value array in one long line separated by commas. E.g. 5.6,10,456,78.2,... etc. Then do the same for the other two arrays.
My C code will end read the first line, put all the values into an array labeled "Value". And so on.
Question
Is this "correct"? Or is there a standard way of putting CRS data into text files?

No standard format that I'm aware of. You decide on a format that makes your life easy.
First, consider that if you want to look at one of these text files, you'll be instantly put off by the long lines. Some text editors might simply hate you. There's nothing wrong with splitting lines up.
Second, consider writing out the number of elements in each array (well, I suppose there's only two different array lengths for the three arrays) at the beginning of the file. This will let you preallocate your arrays. If you have all array lengths at hand, you have the option of doing a single memory allocation.
Finally, consider writing out some sensible tag names. Some kind of header that can identify your file is the correct format, then something to denote the start of each array. It's kind of a sanity thing for your code to detect problems with the file. It might just be one character, but it's something.
Now... call me a grungy old programmer, but I'd probably just write whole lot in binary. Especially if it's floating point data, I wouldn't want to deal with the loss of precision you get when you write out numbers as text (or the space they can consume when you write them with full precision). Binary files are easy to write and quick to run. You just have to be careful if you're going to be using them across platforms with different endian order.
That's my 2 cents worth.. Hope it's useful to you.

If you want to stick to some widely-used standards, have a look at the Matrix Market. This is a repository with many matrices arising in a variety of engineering and science problems. You can find software libraries to save and read the matrices as well.

how we can set file pointer new position in terms of lines in c

As i think we have fseek function to set file pointer's new position measured in terms of bytes. How we can move file pointer new position in terms of lines?

The short answer: there's no easy way. A file in C is a bunch of bytes, and there is nothing in particular that makes the bytes '\n' and '\r' special (depending on your system). If you really care about a general solution, I would recommend building a lookup table for the byte offsets of line endings as you read the file, and then using it to jump around in the file later on.

Cant make pointer directly to the lines . Reads the file

The basic stdio functions operate on bytes only. You will have to read the file byte by byte and count the lines yourself.

I was facing the same problem. My solution was to store the seek positions of some of the lines and doing a forward search from there.
Eg. If you have a million lines, you can store seek positions of every thousandth line.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.

I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.

1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.

It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...

solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.

As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.

Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

How do you read a file until you hit a certain string in c?

I wanted to know how, in C, you can read a certain file until the reading hits a certain string, or character array. What I want to be able to do is, once the file hits that string, I want the position to be set at that point. I am going to use fseek for that, and that's not a problem. It's just the reading until a certain string is hit that I am not able to do. I've been reading up on some of the functions, but there doesn't seem to be anything that guides with this. Fgets is the closest thing to this, but I don't want to provide a certain number of characters to be read, as I don't know how many. But can you give me some tips on how to do this?
Thanks!

There are many efficient string searching algorithms, each of which can be implemented in C.
http://en.wikipedia.org/wiki/String_searching_algorithm
If you're looking for a string of length N, easiest is to keep a circular buffer of length N and read 1 byte at a time from the file adding it to the circular buffer. At each step you compare your buffer with the string you're searching for. It's highly inefficient but easy to code.

There's no built-in function to do exactly what you want, but there are a few options.
Option one: Read data in chunks. You don't know exactly where your data is, so read in a few kbs of data at a time, and search within these chunks. Make sure you deal with the case where the string you're looking for straddles a chunk boundrary! Once you've located the string, use fseek() to position yourself at the start of it.
Option two: Memory map the file and use memmem() on the entire file (as mapped into memory). This requires unportable calls to set up the memory mapping, so you'll need to know your OS (or use a portability wrapper library like glib). On 32-bit machines, it will also limit the size of files you can search in to a few hundred megabytes. It is, however, a very simple and efficient approach when it's an option.
If you go with option one, the trickiest part will be dealing with the chunk-straddling case. One option is to always keep two chunks in memory, and restart the search so it begins (length of target string) - 1 bytes before the end of the previous block. The actual search could then be done using memmem() or any other string searching algorithm. You could also convert your search into a DFA (since it is a regular language) and keep the current state across blocks.

Creating a binary search of an alphabetically ordered .txt file in C

I'm working on creating a binary search algorithm in C that searches for a string in a .txt file. Each line is a string representing a stock ticker. Not being familiar with C, this is taking far too long. I have a few questions:
1.) Once I have opened a file using fopen, does it make more sense in terms of efficiency for the algorithm to step through the file using some function provided in the C library for scanning files, doing the compare directly from the file, or should I copy each line into an array and have the algorithm search the array?
2.) If I should compare directly from the file, what is the best way to step through it? Assume I have the number of lines in the file, is there some way to go directly to the middle line, scan the string and do the compare?
I'm sorry if this is too vague. Not too sure how to better explain. Thanks for your time

Unless your file is exceedingly big (> 2GB) then loading the file in memory prior searching it is the way to go. In case you cannot load the file in memory, you could hold the offset of each line in an int[] or (if the file contains too many lines...) create another binary file and write the offset of each lines as integers...
Having everything in memory is by far preferable, though.

You cannot binary search lines of a text-file without knowing the length of each line in advance, so you'll most likely want to read each line into memory at first (unless the file is very big).
But if your goal is only to search for a single given line as quickly as possible, you might as well just do linear search directly on the file. There's no point in getting O(log n) at the cost of a O(n) setup cost if the search is only done once.

Reading it all in with a bulk read and walking through it with pointers (to memory) is very fast. Avoid doing multiple I/O calls if you can.
I should also mention that memory mapped files can be very suitable for something like this. See mmap() if on Unix. This is definitely your best bet for really large files.

This is a great question!
The challenge of binary search is that the benefits of binary search come from being able to skip past half the elements at each step in O(1). This guarantees that, since you only do O(lg n) probes, that the runtime is O(lg n). This is why, for example, you can do a fast binary search on an array but not a linked list - in the linked list, finding the halfway point of the elements takes linear time, which dominates the time for the search.
When doing binary search on a file you are in a similar position. Since all the lines in the file might not have the same length, you can't easily jump to the nth line in the file given some number n. Consequently, implementing a good, fast binary search on a file will be a bit tricky. Somehow, you will need to know where each line starts and stops so that you can efficiently jump around in the file.
There are many ways you can do this. First, you could load all the strings from the file into an array, as you've suggested. This takes linear time, but once you have the array of strings in memory all future binary searches will be very fast. The catch is that if you have a very large file, this may take up a lot of memory, and could be prohibitively expansive. Consequently, another alternative might be not to store the actual stings in the array, but rather the offsets into the file at which each string occurs. This would let you do the binary search quickly - you could seek the file to the proper offset when doing a comparison - and for large stings can be much more space-efficient than the above. And, if all the strings are roughly the same length, you could just pad every line to some fixed size to allow for direct computation of the start position of each line.
If you're willing to expend some time implementing more complex solutions, you might want to consider preprocessing the file so that instead of having one string per line, instead you have at the top of the file a list of fixed-width integers containing the offsets of each string in the file. This essentially does the above work, but then stores the result back in the file to make future binary searches much faster. I have some experience with this sort of file structure, and it can be quite fast.
If you're REALLY up for a challenge, you could alternatively store the strings in the file using a B-tree, which would give you incredibly fast lookup times fir each string by minimizing the number of disk reads that you need to do.
Hope this helps!

I don't see how you can do compare directly from the file. You will have to have a buffer to store data read from disk and use that buffer. So it doesn't make sense, it is just impossible.
You cannot jump to a particular line in the file. Not unless you know the offset in bytes of the beginning of that line relative to the beginning of the file.
I'd recommend using mmap to map this file directly into memory and work with it as with character array. Operating system will make work with file (like seeking, reading, writing) transparent to you, and you will just work with it like with a buffer in memory. Note that mmap is limited to 4 GB on 32-bit systems. But if that file is bigger, you probably need to ask the question - why on earth someone has this big file not in an indexed database.