How to flush the CPU cache in Linux from a C program? - c

I am writing a C program in which I need to flush my memory. I would like know if there is any UNIX system command to flush the CPU cache.
This is a requirement for my project which involves calculating the time taken for my logic.
I have read about the cacheflush(char *s, int a, int b) function but I am not sure as to whether it will be suitable and what to pass in the parameters.

I take it you mean "CPU cache", not memory cache
The link above is good: the suggestion "write a lot of data via CPU" is not Windows specific
Here's another variation on the same theme:
How to clear CPU L1 and L2 cache
Here's an article about Linux and CPU cache:
http://lwn.net/Articles/252125/
NOTE:
At this (very, very low) level, "Linux" != "Unix"

This is how Intel suggests flushing the cache:
mem_flush(const void *p, unsigned int allocation_size){
const size_t cache_line = 64;
const char *cp = (const char *)p;
size_t i = 0;
if (p == NULL || allocation_size <= 0)
return;
for (i = 0; i < allocation_size; i += cache_line) {
asm volatile("clflush (%0)\n\t"
:
: "r"(&cp[i])
: "memory");
}
asm volatile("sfence\n\t"
:
:
: "memory");
}

If you're writing a user-mode (not kernel-mode) program, and if it's single-threaded, then there's really no reason for you to ever bother flushing your cache in the first place. Your user-mode program can just forget that it even exists; it's just there to speed up your program's execution, and the OS manages it via the processor's MMU.
There are only a couple reasons I can think of that you might actually want to flush the cache from your user-mode application:
Your app is intended to run on a symmetric multiprocessor system, or has data transactions with external hardware)
You're simply testing your cache for some sort of performance test (in which case you should probably really should be writing your test to operate in kernel mode, perhaps as a driver).
In any case, assuming you're using Linux...
#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
This assumes you have a block of memory you just wrote to and you want to make sure it's flushed out of the cache back to main memory. The block begins at addr, and it's nbytes long, and it's in one of the two caches (or both):
ICACHE Flush the instruction cache.
DCACHE Write back to memory and invalidate the affected valid cache lines.
BCACHE Same as (ICACHE|DCACHE).
Normally you'd only need to flush the DCACHE, since when you write data to "memory" (i.e. to the cache), it's normally data, not instructions.
If you want to flush "all of the cache" for some strange testing reason, you could malloc() a big block that you know is larger than your CPU's cache (shoot, make it 8 times as big!), write any old garbage into it, and just flush that entire block.
See also: How to perform cache operations in C++?

OK, sorry about my first answer. I later read your follow-up comments below your question, so I realize now that you want to flush the INSTRUCTION CACHE to boot your program (or parts of it) out of the cache, so that when you test its performance, you also test its initial load time out of main memory into the instruction cache. Do you also need to flush any data your code will use out to main memory, so that both data and code are fresh loads?
Before anything else, I'd like to mention that main memory itself is also a form of cache, with your hard disk (either the program on disk, or swap space on disk) being the lowest, slowest place your program's instructions could be coming from. That said, when you first run through a routine for the first time, if it hasn't already been loaded into main memory from disk by virtue of being near other code that has already executed, then its CPU instructions will first have to be loaded from disk. That takes an order of magnitude or more longer than loading it from main memory into the cache. Then once it's loaded into main memory, it takes somewhere along the lines of an order of magnitude longer to load from main memory into the cache than it takes to load from the cache into the CPU's instruction fetcher. So if you want to test your code's cold-start performance, you have to decide what cold-start means.... pulling it out of disk, or pulling it out of main memory. I don't know of any command to "flush" instructions/data out of main memory out to swap space, so flushing it out to main memory is about as much as you can do (that I know of), but keep in mind that your test results may still differ from the first run (when it may be pulling it off disk) to subsequent runs, even if you do flush the instruction cache.
Now, how would one go about flushing the instruction cache to ensure that their own code is flushed out to main memory?
If I needed to do this (very odd thing to do in my opinion), I'd probably start by finding the length & approximate placement of my functions in memory. Since I'm using Linux, I'd issue the command "objdump -d {myprogram} > myprogram.dump.txt", then I'd open myprogram.dump.txt in an editor and search for the functions I want to flush out, and figure out how long they are by subtracting their end address form their start address using a hex calculator. I'd write down the sizes of each. Later I'd add cacheflush() calls in my code, giving it the address of each function I want to flush out as 'addr' and the length I found as 'nbytes', and ICACHE. Just for safety I'd probably fudge a little & add about 10% to the size, just in case I make a few tweaks to the code and forget to adjust the nbytes. I'd make a call to cacheflush() like this for each function I want to flush out. Then if I need to flush out the data also, if it's using global/static data, I can flush those also (DCACHE), but if it's stack or heap data, there's really nothing realistic that I can (or should) do to flush that out of cache. Trying to do so would be an exercise in silliness, because it would be creating a condition that would never or very rarely exist in normal execution.
Assuming you're using Linux...
#include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);
...where cache is one of:
ICACHE Flush the instruction cache.
DCACHE Write back to memory and invalidate the affected valid cache lines.
BCACHE Same as (ICACHE|DCACHE).
BTW, is this homework for a class?

Related

Why is memory barrier not required for UP? [duplicate]

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:
x = 0;
f = 0;
Thread #1:
while (f == 0);
print x;
Thread #2:
x = 42;
f = 1;
I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.
However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:
When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?
The CPU would not be aware that these are two threads. Threads are a software construct (1).
So the CPU sees these instructions, in this order:
store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x
If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.
In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.
(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.
The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)
The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.
(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).
As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)
But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.
If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.
Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)
And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/
so what makes sure the program order is preserved?
Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location1.
Without this, code like
x = 1;
int tmp = x;
might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.
We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.
Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction.
This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

char x[2048] and cache line issue

The following is the simple c source code, where char x[2048] is a global var and func1 is called by thread1, func2 is called by thread2:
char x[2048]={0} , y[16]={0};
void func1(){
strcpy(x,y);
}
void func2(){
printf("(%s)\n",x);
}
int main(int argc, char **argv){
strncpy(y,argv[1],sizeof(y)-1);
}
In Intel's cpu, one cache line has 64 bytes in it, so x should occupy 32
cache lines, my questions are:
while thread1 calls func1, should all 32 cache lines available to that CPU cache until then do strcpy? (or) compiler knows just one cache line is enough to do the job?
While thread2 call func2, should all 32 cache lines available to that CPU cache until then do printf? (or) compiler can identify one cache line is enough?
I suggest you read the Wikipedia page: https://en.wikipedia.org/wiki/CPU_cache
Some background:
Normally, cache line ($L) are transparent to programs. So most programers don't deal with cache line (bring it in, kick it out) directly. The CPU, once find that code/data is not in $L, would stall for such memory access and bring in $L on demand.
Although there are coding techniques to bring in data into cache line in code (e.g. via prefetch instruction), normally compiler won't be smart enough to do this for you as it might prefetch too early (so by the time $L is used, it has already been kicked out), or too late (CPU still has to stall for memory access).
Answer to you question:
No. Compiler doesn't know how many $Ls needs to be brought in (how could it know whether a piece of data is already in $L or not, so just be safe side and not outsmart itself). Compiler just issue, for example, MOV instruction, and CPU, while executing this instruction, found that operand is not in $, so would bring them in on demand. As you program only copies till '\0', so is the $L bringing in stops there.
The same as #1. Only $Ls that are read would be brought in and compiler has nothing to do with this.
More Info:
CPU prefetcher might bring in additional $Ls besides those currently needed. For example, it might bring in next $L with hoping for data locality.
Some advanced program use prefetch instructions to improve program performance. Suppose you know that your code would access some location in the near future, you can prefetch it, and by the time you need it, it is there already so won't incur $L miss penalty. But it's hard to get it right (you have to know the memory access pattern of your code and insert the prefetch instruction at the right place. Some high performance code designs software pipeline to do this, but again it's an advanced topic).
https://en.wikipedia.org/wiki/Instruction_prefetch
On x86 and x64 (as well as modern ARMs and other common CPU's), the cache is fully transparent to user-mode programs.
As a result, strcpy performs the first read, the CPU pulls in one cache line automatically, strcpy quits on the \0 and it's done. Same thing with printf("%s",x).

Does a cache write take longer with more caches to invalidate?

can you please help me to find out if it takes longer for a cache write to finish when there are more cores/caches holding a copy of that line.
I also want to measure/quantify how much longer it actually takes.
I couldn't find anything useful on google and I have trouble measuring it myself plus interpret what I measure because of the many things that can happen on a modern processor.
(reordering, prefetching, buffering and god knows what)
Details:
My basic process of measuring it is roughly as follows:
write soemthing to the cacheline on processor 0
read it on processors 1 to n.
rdtsc
write it on process 0
rdtsc
I am not even sure which instructions to actually use for read/write on process 0 in order to make sure the write/invalidate is finished before the final time measurement.
At the moment I fiddle with an atomic exchange (__sync_fetch_and_add()), but it seems that the number of threads is itself important for the length of this operation (not the number of threads to invalidate) -- which is probably not what I want to measure?!.
I also tried a read, then write, then memory barrier (__sync_synchronize()). This looks more like what I expect to see,
but here I am also not sure if the write is finished when the final rdtsc takes place.
As you can guess my knowledge of CPU internals is somewhat limited.
Any help is very much appreciated!
ps:
* I use linux, gcc and pthreads for the measurements.
* I want know this for modeling a parallel algorithm of mine.
Edit:
In a week or so (going on vacation tomorrow) I'll do some more research and post my code and notes and link it here (In case someone is interested), because the time I can spend on this is limited.
I started writing a very long answer, describing exactly how this works, then realized, I probably don't know enough about the exact details. So I'll do a shorter answer....
So, when you write something on one processor, if it's not already in that processors cache, it will have to be fetched in, and after the processor has read the data, it will perform the actual write. In doing so, it will send a cache-invalidate message to ALL other processors in the system. These will then throw away any content. If another processor has "dirty" content, it will in itself write out the data, and ask for an invalidation - in which case the first processor will have to RELOAD the data before finishing its write (otherwise, some other element in the same cacheline may get destroyed).
Reading it back into the cache will be required on every other processor that is interested in that cache-line.
The __sync_fetch_and_add() wilol use a "lock" prefix [on x86, other processors may vary, but the general idea on processors that support "per instruction" locks is roughtly the same] - this will issue a "I want this cacheline EXCLUSIVELY, everyone else please give it up and invalidate it". Just like the first case, the processor may well have to re-read anything that another processor may have made dirty.
A memory barrier will not ensure that data is updated "safely" - it will just make sure that "whatever happened (to memory) before now is visible to all processors by the time this instructon finishes".
The best way to optimize the use of processors is to share as little as possible, and in particular, avoid "false sharing". In a benchmark many years ago, there was a structure like [simplifed] this:
struct stuff {
int x[2];
... other data ... total data a few cachelines.
} data;
void thread1()
{
for( ... big number ...)
data.x[0]++;
}
void thread2()
{
for( ... big number ...)
data.x[1]++;
}
int main()
{
start = timenow();
create(thread1);
create(thread2);
end = timenow() - start;
}
Since EVERY time thread1 wrote to the x[0], thread2's processor had to get rid of it's copy of x[1], and vice versa, the result is was that the SMP test [vs just running thread1] was running about 15 times slower. By altering the struct like this:
struct stuff {
int x;
... other data ...
} data[2];
and
void thread1()
{
for( ... big number ...)
data[0].x++;
}
we got 200% of the 1 thread variant [give or take a few percent]
Right, so the processor has queues of buffers where write operations are stored when the processor is writing to memory. A memory barrier (mfence, sfence or lfence) instruction is there to ensure that any outstanding read/write, write or read type operation has completely been finished before the processor proceeds to the next instruction. Normally, the processor would just continue on it's jolly way through any following instructions, and eventualy the memory operation becomes fulfilled some way or another. Since modern processors have a lot of parallel operations and buffers all over the place, it can take quite some time before something ACTUALLY trickles through to where it eventually will end up. So, when it's CRITICAL to make sure that something has ACTUALLY been done before proceeding (for example, if we have written a bunch of instructions to the video memory, and we now want to kick off the run of those instructions, we need to make sure that the 'instruction' writing has actually finished, and some other part of the processor isn't still working on finishing that. So use an sfence to make sure that the write has really happened - that may not be a very realistic example, but I think you get the idea.)
Cache writes have to get line-ownership before dirtying the cache line. Depending on the
cache coherence model implemented in the processor architecture, the time taken for this step varies. The most common coherence protocols that I know are:
Snooping Coherence Protocol: all caches monitor address lines for cached memory lines i.e. all memory requests have to be broadcast to all cpus i.e. non-scalable as cpus increase.
Directory-based Coherence Protocol: all cache lines shared among many cpus is kept in a directory; so, invalidating/gaining ownership is a point-to-point cpu request rather than a broadcast i.e. more scalable, but latency suffers because the directory is a single point of contention.
Most cpu architectures support something called PMU (perf monitoring unit). This unit exports
counters for many things like: cache hits, misses, cache write latency, read latency, tlb hits, etc. Please consult the cpu manual to see if this info is available.

reading data from filesystem vs compiling the data directly into program

I have a file (10-20MB) containing data, where each line is a single piece of data.
I have a C program that reads the file from the filesystem, and then based on command line input, it reads each line of the file, does a calculation on each line to determine if that line should be returned, and then return a subset of the data.
Assume that the program does an fread and reads the entire file into memory at the beginning, and then parses it directly from memory.
Would the program execute faster if, instead of reading it from the filesystem, I compiled the data into the program directly, by creating an array such as the following?
char *dataArray[] = {"data1", "data2", "data3"....};
Since the OS needs to read the entire binary from the filesystem, my gut feeling is that the execution time of both techniques would be similar, since reading from the filesystem would be the high order bit. However, would anyone have more definitive ideas on this?
Defining everything as a program literal will certainly be faster.
You do not need the relatively slow "open" call for the data file and you don't need to move the data from the buffer to your storage.
This was a common optimization circa. 1970, and every programming/coding style book since then stongly recommends you do not do this. The actual performance increase is minimal and what you gain in performance you lose in maintainability and flexibility.
Should you want a quick maintainable optimisation for this type of problem then look at the "mmap" call which makes the buffer directly available to your program and minimises data movement.
I doubt the difference in execution time will be significant, but from a memory utilization standpoint, putting the data in the executable (and qualifying it const appropriately) will make a big difference.
If you read 10-20 megs of data from a file into memory allocated (e.g. via malloc) in your program, the data initially exists in two places in memory: the filesystem cache, and your program's private memory. The former copy can be discarded if memory is tight, but the latter occupies physical memory or swap permanently until it's freed.
If on the other hand the 10-20 megs of data are part of your program's image (in the executable file), the data will be demand-paged, and can be discarded whenever needed because the OS knows it can reload the pages if it needs them again.

Why is sequentially reading a large file row by row with mmap and madvise sequential slower than fgets?

Overview
I have a program bounded significantly by IO and am trying to speed it up.
Using mmap seemed to be a good idea, but it actually degrades the performance relative to just using a series of fgets calls.
Some demo code
I've squeezed down demos to just the essentials, testing against an 800mb file with about 3.5million lines:
With fgets:
char buf[4096];
FILE * fp = fopen(argv[1], "r");
while(fgets(buf, 4096, fp) != 0) {
// do stuff
}
fclose(fp);
return 0;
Runtime for 800mb file:
[juhani#xtest tests]$ time ./readfile /r/40/13479/14960
real 0m25.614s
user 0m0.192s
sys 0m0.124s
The mmap version:
struct stat finfo;
int fh, len;
char * mem;
char * row, *end;
if(stat(argv[1], &finfo) == -1) return 0;
if((fh = open(argv[1], O_RDONLY)) == -1) return 0;
mem = (char*)mmap(NULL, finfo.st_size, PROT_READ, MAP_SHARED, fh, 0);
if(mem == (char*)-1) return 0;
madvise(mem, finfo.st_size, POSIX_MADV_SEQUENTIAL);
row = mem;
while((end = strchr(row, '\n')) != 0) {
// do stuff
row = end + 1;
}
munmap(mem, finfo.st_size);
close(fh);
Runtime varies quite a bit, but never faster than fgets:
[juhani#xtest tests]$ time ./readfile_map /r/40/13479/14960
real 0m28.891s
user 0m0.252s
sys 0m0.732s
[juhani#xtest tests]$ time ./readfile_map /r/40/13479/14960
real 0m42.605s
user 0m0.144s
sys 0m0.472s
Other notes
Watching the process run in top, the memmapped version generated a few thousand page faults along the way.
CPU and memory usage are both very low for the fgets version.
Questions
Why is this the case? Is it just because the buffered file access implemented by fopen/fgets is better than the aggressive prefetching that mmap with madvise POSIX_MADV_SEQUENTIAL?
Is there an alternative method of possibly making this faster(Other than on-the-fly compression/decompression to shift IO load to the processor)? Looking at the runtime of 'wc -l' on the same file, I'm guessing this might not be the case.
POSIX_MADV_SEQUENTIAL is only a hint to the system and may be completely ignored by a particular POSIX implementation.
The difference between your two solutions is that mmap requires the file to be mapped into the virtual address space entierly, whereas fgets has the IO entirely done in kernel space and just copies the pages into a buffer that doesn't change.
This also has more potential for overlap, since the IO is done by some kernel thread.
You could perhaps increase the perceived performance of the mmap implementation by having one (or more) independent threads reading the first byte of each page. This (or these) thread then would have all the page faults and the time your application thread would come at a particular page it would already be loaded.
Reading the man pages of mmap reveals that the page faults could be prevented by adding MAP_POPULATE to mmap's flags:
MAP_POPULATE (since Linux 2.5.46): Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This way a page faulting pre-load thread (as suggested by Jens) will become obsolete.
Edit:
First of all the benchmarks you make should be done with the page cache flushed to get meaningful results:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Additionally: The MADV_WILLNEED advice with madvise will pre-fault the required pages in (same as the POSIX_FADV_WILLNEED with fadvise). Currently unfortunately these calls block until the requested pages are faulted in, even if the documentation tells differently. But there are kernel patches underway which queue the pre-fault requests into a kernel work-queue to make these calls asynchronous as one would expect - making a separate read-ahead user space thread obsolete.
What you're doing - reading through the entire mmap space - is supposed to trigger a series of page faults. with mmap, the OS only lazily loads pages of the mmap'd data into memory (loads them when you access them). So this approach is an optimization. Although you interface with mmap as if the entire thing is in RAM, it is not all in RAM - it is just a chunk set aside in virtual memory.
In contrast, when you do a read of a file into a buffer the OS pulls the entire structure into RAM (into your buffer). This can apply alot of memory pressure, crowding out other pages, forcing them to be written back to disk. It can lead to thrashing if you're low on memory.
A common optimization technique when using mmap is to page-walk the data into memory: loop through the mmap space, incrementing your pointer by the page size, accessing a single byte per page and triggering the OS to pull all the mmap's pages into memory; triggering all these page faults. This is an optimization technique to "prime the RAM", pulling the mmap in and readying it for future use. With this approach, the OS won't need to do as much lazy loading. You can do this on a separate thread to lead the pages in prior to your main threads access - just make sure you don't run out of RAM or get too far ahead of the main thread, you'll actually begin to degrade performance.
What is the difference between page walking w/ mmap and read() into a large buffer? That's kind of complicated.
Older versions of UNIX, and some current versions, don't always use demand-paging (where the memory is divided up into chunks and swapped in / out as needed). Instead, in some cases, the OS uses traditional swapping - it treats data structures in memory as monolithic, and the entire structure is swapped in / out as needed. This may be more efficient when dealing with large files, where demand-paging requires copying pages into the universal buffer cache, and may lead to frequent swapping or even thrashing. Swapping may avoid use of the universal buffer cache - reducing memory consumption, avoiding an extra copy operation and avoiding frequent writes. Downside is you can't benefit from demand-paging.
With mmap, you're guaranteed demand-paging; with read() you are not.
Also bear in mind that page-walking in a full mmap memory space is always about 60% slower than a flat out read (not counting if you use MADV_SEQUENTIAL or other optimizations).
One note when using mmap w/ MADV_SEQUENTIAL - when you use this, you must be absolutely sure your data IS stored sequentially, otherwise this will actually slow down the paging in of the file by about 10x. Usually your data is not mapped to a continuous section of the disk, it's written to blocks that are spread around the disk. So I suggest you be careful and look closely into this.
Remember, too much data in RAM will pollute the RAM, making page faults alot more common elsewhere. One common misconception about performance is that CPU optimization is more important than memory footprint. Not true - the time it takes to travel to disk exceeds the time of CPU operations by something like 8 orders of magnitude, even with todays SSDs. Therefor, when program execution speed is a concern, memory footprint and utilization is far more important.
A nice thing about read() is the data can be stored on the stack (assuming the stack is large enough), which will further speed up processing.
Using read() with a streaming approach is a good alternative to mmap, if it fits your use case. This is kind of what you're doing with fgets/fputs (fgets/fputs is internally implemented with read). Here what you do is, in a loop, read into a buffer, process the data, & then read in the next section / overwrite the old data. Streaming like this can keep your memory consumption very low, and can be the most efficient way of doing I/O. The only downside is that you never have the entire file in memory at once, and it doesn't persist in memory. So it's a one-off approach. If you can use it - great, do it. If not... use mmap.
So whether read or mmap is faster... it depends on many factors. Testing is probably what you need to do. Generally speaking, mmap is nice if you plan on using the data for an extended period, where you will benefit from demand-paging; or if you just can't handle that amount of data in memory at once. Read() is better if you are using a streaming approach - the data doesn't have to persist, or the data can fit in memory so memory pressure isn't a concern. Also if the data won't be in memory for very long, read() may be preferable.
Now, with your current implementation - which is a sort of streaming approach - you are using fgets() and stopping on \n. Large, bulk reads are more efficient than calling read() repeatedly a million times (which is what fgets does). You don't have to use a giant buffer - you don't want excess memory pressure (which can pollute your cache & other things), & the system also has some internal buffering it uses. But you do want to be reading into a buffer of... lets say 64k in size. You definitely dont want to be calling read line by line.
You could multithread the parsing of that buffer. Just make sure the threads access data in different cache blocks - so find the size of the cache block, get your threads working on different portions of the buffer distanced by at least the cache block size.
Some more specific suggestions for your particular problem:
You might try reformatting the data into some binary format. For example, try changing the file encoding to a custom format instead of UTF-8 or whatever it is. That could reduce its size. 3.5 million lines is quite alot of characters to loop through... it's probably ~150 million character comparisons that you are doing.
If you can sort the file by line length prior to the program running... you can write an algorithm to much more quickly parse the lines - just increment a pointer and test the character you arrive at, making sure it's '\n'. Then do whatever processing you need to do.
You'll need to find a way to maintain the sorted file by inserting new data into appropriate places with this approach.
You can take this a step further - after sorting your file, maintain a list of how many lines of a given length are in the file. Use that to guide your parsing of lines - jump right to the end of each line w/out having to do character comparisons.
If you can't sort the file, just create a list of all the offsets from the start of each line to its terminating newline. 3.5 million offsets.
Write algorithms to update that list on insertion/deletion of lines from the file
When you get into file processing algorithms such as this... it begins to resemble the implementation of a noSQL database. An alternative might just be to insert all this data into a noSQL database. Depends on what you need to do: believe it or not, sometimes just raw custom file manipulation & maintenance described above is faster than any database implementation, even noSQL databases.
A few more things:
When you use this streaming approach with read() you must take care to handle the edge cases - where you reach the end of one buffer, and start a new buffer - appropriately. That's called buffer-stitching.
Lastly, on most modern systems when you use read() the data still gets stored in the universal buffer cache and then copied into your process. That's an extra copy operation. You can disable the buffer cache to speed up the IO in certain cases where you're handling big files. Beware, this will disable paging. But if the data is only in memory for a brief time, this doesn't matter.
The buffer cache is important - find a way to reenable it after the IO was finished. Maybe disable it just for the particular process, do your IO in a separate process, or something... I'm not sure about the details, but this is something that can be done.
I don't think that's actually your problem, though, tbh I think the character comparisons - once you fix that it should just be fine.
That's the best I've got, maybe the experts will have other ideas.
Carry onward!

Resources