Store read/write data in assembly similar to filesystem

Store read/write data in assembly similar to filesystem - file

I am creating a custom operating system and is there any way to store data (almost like a filesystem) in assembly so that if the computer shuts off and turns back on the data will still be there?

You can write device drivers for SATA hard drives, USB mass storage, floppy disks, NVMe flash, or whatever else in asm. You might also be able to use BIOS functions to access them (especially if you're on x86). But then you have to manage writes in chunks of 512B or 4096B, because those kinds of storage are block-based.
A more literal interpretation of the question has an interesting answer: can a store instruction like mov [mem], eax put data into persistent storage where a load instruction can get it later (after a power cycle)?
Yes, if your hardware has some memory-mapped non-volatile RAM. (Physically memory-mapped NVRAM like an NVDIMM, not like mmap() to logically map a file into the virtual memory address space of a process). See this answer on Superuser about Intel Optane DC Persistent Memory
x86 for example has recently gotten more instructions to support NVRAM, like clwb to write-back a cache line (all the way to memory) without necessarily evicting it. Early implementations of clwb may just run it like clflushopt, though: #Ana reports that Skylake-X does evict.
Also, clflushopt is a more efficient way to force more cache lines to memory. Use a memory barrier like sfence after a weakly-ordered flush like clflushopt to make sure data is in non-volatile RAM before further writes appear.
For a while Intel was going to require pcommit as part of making sure data had hit non-volatile storage, but decided against it. With that in mind, see Why Intel added the CLWB and PCOMMIT instructions for more details about using persistent RAM.
IDK what the situation is on architectures other than x86, but presumably NV RAM is / will be usable with ARM and other CPUs, too.

Related

How does kernel restrict processes to their own memory pool?

This is purely academical question not related to any OS
We have x86 CPU and operating memory, this memory resembles some memory pool, that consist of addressable memory units that can be read or written to, using their address by MOV instruction of CPU (we can move memory from / to this memory pool).
Given that our program is the kernel, we have a full access to whole this memory pool. However if our program is not running directly on hardware, the kernel creates some "virtual" memory pool which lies somewhere inside the physical memory pool, our process consider it just as the physical memory pool and can write to it, read from it, or change its size usually by calling something like sbrk or brk (on Linux).
My question is, how is this virtual pool implemented? I know I can read whole linux source code and maybe one year I find it, but I can also ask here :)
I suppose that one of these 3 potential solutions is being used:
Interpret the instructions of program (very ineffective and unlikely): the kernel would just read the byte code of program and interpret each instruction individually, eg. if it saw a request to access memory the process isn't allowed to access it wouldn't let it.
Create some OS level API that would need to be used in order to read / write to memory and disallow access to raw memory, which is probably just as ineffective.
Hardware feature (probably best, but have no idea how that works): the kernel would say "dear CPU, now I will send you instructions from some unprivileged process, please restrict your instructions to memory area 0x00ABC023 - 0xDEADBEEF" the CPU wouldn't let the user process do anything wrong with the memory, except for that range approved by kernel.
The reason why am I asking, is to understand if there is any overhead in running program unprivileged behind the kernel (let's not consider overhead caused by multithreading implemented by kernel itself) or while running program natively on CPU (with no OS), as well as overhead in memory access caused by computer virtualization which probably uses similar technique.

You're on the right track when you mention a hardware feature. This is a feature known as protected mode and was introduced to x86 by Intel on the 80286 model. That evolved and changed over time, and currently x86 has 4 modes.
Processors start running in real mode and later a privileged software (ring0, your kernel for example) can switch between these modes.
The virtual addressing is implemented and enforced using the paging mechanism (How does x86 paging work?) supported by the processor.

On a normal system, memory protection is enforced at the MMU, or memory management unit, which is a hardware block that configurably maps virtual to physical addresses. Only the kernel is allowed to directly configure it, and operations which are illegal or go to unmapped pages raise exceptions to the kernel, which can then discipline the offending process or fetch the missing page from disk as appropriate.
A virtual machine typically uses CPU hardware features to trap and emulate privileged operations or those which would too literally interact with hardware state, while allowing ordinary operations to run directly and thus with moderate overall speed penalty. If those are unavailable, the whole thing must be emulated, which is indeed slow.

How to selectively store a variable in memory segments using C?

We know that there are levels in memory hierarchy,
cache,primary storage and secondary storage..
Can we use a c program to selectively store a variable in a specific block in memory hierarchy?

Reading your comments in the other answers, I would like to add a few things.
Inside an Operating System you can not restrict in which level of the memory hierarchy your variables will be stored, since the one who controls the metal is the Operating System and it enforces you to play by its rules.
Despite this, you can do something that MAY get you close to testing access time in cache (mostly L1, depending on your test algorithm) and in RAM memory.
To test cache access: warm up accessing a few times a variable. Them access a lot of times the same variable (cache is super fast, you need a lot of accesses to measure access time).
To test the main memory (aka RAM): disable the cache memory in the BIOS and run your code.
To test secondary memory (aka disk): disable disk cache for a given file (you can ask your Operating System for this, just Google about it), and start reading some data from the disk, always from the same position. This might or might not work depending on how much your OS will allow you to disable disk cache (Google about it).
To test other levels of memory, you must implement your own "Test Operating System", and even with that it may not be possible to disable some caching mechanisms due to hardware limitations (well, not actually limitations...).
Hope I helped.

Not really. Cache is designed to work transparently. Almost anything you do will end up in cache, because it's being operated on at the moment.
As for secondary storage, I assume you mean HDD, file, cloud, and so on. Nothing really ever gets stored there unless you do so explicitly, or set up a memory mapped region, or something gets paged to disk.

No. A normal computer program only has access to main memory. Even secondary storage (disk) is usually only available via operating system services, typically used via the stdio part of the library. There's basically no direct way to inspect or control the hierarchical caches closer to the CPU.
That said, there are cache profilers (like Valgrind's massif tool) which give you an idea about how well your program uses a given type of cache architecture (as well as speculative execution), and which can be very useful to help you spot code paths that have poor cache performance. They do this essentially by emulating the hardware.
There may also be architecture-specific instructions that give you some control over caching (such as "nontemporal hints" on x86, or "prefetch" instructions), but those are rare and peculiar, and not commonly exposed to C program code.

It depends on the specific architecture and compiler that you're using.
For example, on x86/x64, most compilers have a variety of levels of prefetch instructions which hint to the CPU
that a cache-line should be moved to a specific level in the cache from DRAM (or from higher-order caches - e.g. from L3 to L2).
On some CPUs, non-temporal prefetch instructions are available that when combined with non-temporal read instructions, allow you to bypass the caches and read directly into a register. (Some CPUs have non-temporals implemented by forcing the data to a specific way of the cache though; so you have to read the docs).
In general though, between L1 and L2 (or L3, or L4, or DRAM) it's a bit of a black-hole. You can't specifically store a value in one of these - some caches are inclusive of each other (so if a value is in L1, it's also in L2 and L3), some are not. And the caches are designed to periodically drain - so if a write goes to L1, eventually it works its way out to L2, then L3, then DRAM - especially in multi-core architectures with strong-memory models.
You can bypass them entirely on write (use streaming-store or mark the memory as write-combining).
You can measure the different access times by:
Using memory-mapped files as a backing store for your data (this will measure the time for first access to reach the CPU - just wrap it in a timer call, such as QueryPerformanceCounter or __rdtscp). Be sure to unmap and close the file from memory between each test, and turn off any caching. It'll take a while.
Flushing the cache between accesses to get the time-to-access from DRAM.
It's harder to measure the difference between different cache levels, but if your architecture supports it, you can prefetch into a cache level, spin in a loop for some period of time, and then attempt it.
It's going to be hard to do this on a system using a commercial OS, because they tend to do a LOT of work all the time, which will disturb your measurements.

How can I use DMA in linux kernel? [duplicate]

I am using memcpy() in my program. as I increase the number of variables, unfortunately the CPU usage increases. it is as if memcpy is run by using for loop iteration. is there a fast memcpy function in linux too? shall I use a patch and compile the kernel?

There are architectures where the bus between the CPU and memory is rather weak; some of those architectures add a DMA engine to allow big blocks of memory to be copied without having a loop running on the CPU.
In Linux, you would be able to access the DMA engine with the dmaengine subsystem, but it is very hardware-dependent whether such an engine is actually available.
X86 CPUs have a good memory subsystem, and also have special hardware support for copying large blocks, so using a DMA engine would be very unlikely to actually help.
(Intel added a DMA engine called I/OAT to some server boards, but the overall results were not much better than plain CPU copies.)
DMA forces the data out of the CPU caches, so doing DMA copies for your program's variables would be utterly pointless because the first CPU access afterwards would have to read them back into the cache.

how memcpy is handled by DMA in linux

I am using memcpy() in my program. as I increase the number of variables, unfortunately the CPU usage increases. it is as if memcpy is run by using for loop iteration. is there a fast memcpy function in linux too? shall I use a patch and compile the kernel?

There are architectures where the bus between the CPU and memory is rather weak; some of those architectures add a DMA engine to allow big blocks of memory to be copied without having a loop running on the CPU.
In Linux, you would be able to access the DMA engine with the dmaengine subsystem, but it is very hardware-dependent whether such an engine is actually available.
X86 CPUs have a good memory subsystem, and also have special hardware support for copying large blocks, so using a DMA engine would be very unlikely to actually help.
(Intel added a DMA engine called I/OAT to some server boards, but the overall results were not much better than plain CPU copies.)
DMA forces the data out of the CPU caches, so doing DMA copies for your program's variables would be utterly pointless because the first CPU access afterwards would have to read them back into the cache.

How to use more than 3 GB in a process on 32-bit PAE-enabled Linux app?

PAE (Physical Address Extension) was introduced in CPUs back in 1994. This allows a 32-bit processor to access 64 GB of memory instead of 4 GB. Linux kernels offer support for this starting with 2.3.23. Assume I am booting one of these kernels, and want to write an application in C that will access more than 3 GB of memory (why 3 GB? See this).
How would I go about accessing more than 3 GB of memory? Certainly, I could fork off multiple processes; each one would get access to 3 GB, and could communicate with each other. But that's not a realistic solution for most use cases. What other options are available?
Obviously, the best solution in most cases would be to simply boot in 64-bit mode, but my question is strictly about how to make use of physical memory above 4 GB in an application running on a PAE-enabled 32-bit kernel.

You don't, directly -- as long as you're running on 32-bit, each process will be subject to the VM split that the kernel was built with (2GB, 3GB, or if you have a patched kernel with the 4GB/4GB split, 4GB).
One of the simplest ways to have a process work with more data and still keep it in RAM is to create a shmfs and then put your data in files on that fs, accessing them with the ordinary seek/read/write primitives, or mapping them into memory one at a time with mmap (which is basically equivalent to doing your own paging). But whatever you do it's going to take more work than using the first 3GB.

Or you could fire up as many instances of memcached as needed until all physical memory is mapped. Each memcached instance could make 3GiB available on a 32 bit machine.
Then access memory in chunks via the APIs and language bindings for memcached. Depending on the application, it might be almost as fast as working on a 64-bit platform directly. For some applications you get the added benefit of creating a scalable program. Not many motherboards handle more than 64GiB RAM but with memcached you have easy access to as much RAM as you can pay for.
Edited to note, that this approach of course works in Windows too, or any platform which can run memcached.

PAE is an extension of the hardware's address bus, and some page table modifications to handle that. It doesn't change the fact that a pointer is still 32 bits, limiting you to 4G of address space in a single process. Honestly, in the modern world the proper way to write an application that needs more than 2G (windows) or 3G (linux) of address space is to simply target a 64 bit platform.

On Unix one way to access that more-than 32bit addressable memory in user space by using mmap/munmap if/when you want to access a subset of the memory that you aren't currently using. Kind of like manually paging. Another way (easier) is to implicitly utilize the memory by using different subsets of the memory in multiple processes (if you have a multi-process archeteticture for your code).
The mmap method is essentially the same trick as commodore 128 programmers used to do for bank switching. In these post commodore-64 days, with 64-bit support so readily available, there aren't many good reasons to even think about it;)
I had fun deleting all the hideous PAE code from our product a number of years ago.

You can't have pointers pointing to > 4G of address space, so you'd have to do a lot of tricks.
It should be possible to switch a block of address space between different physical pages by using mmap to map bits of a large file; you can change the mapping at any time by another call to mmap to change the offset into the file (in multiples of the OS page size).
However this is a really nasty technique and should be avoided. What are you planning on using the memory for? Surely there is an easier way?

Obviously, the best solution in most cases would be to simply boot in 64-bit mode, but my question is strictly about how to make use of physical memory above 4 GB in an application running on a PAE-enabled 32-bit kernel.
There's nothing special you need to do. Only the kernel needs to address physical memory, and with PAE, it knows how to address physical memory above 4 GB. The application will use memory above 4 GB automatically and with no issues.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight