I have a situation where I have around 100000 registers in a uvm_reg_block. I have three drivers that can drive transactions to these registers. As per standard UVM RAL methodology, I understand we need three separate uvm_reg_maps connected to three sequencers and drive. The problem is to duplicate the registers in all the three uvm_reg_maps which is eating the CPU time to run. It is taking one hour to even enter the data phase. Can you help me solve this? Is there a way to connect all three sequencers to one uvm_reg_map and somehow based on the argument, decide which physical sequencer it should pick up?
Thanks in advance
A uvm_reg_map can only work with one sequencer.
You mention that creating multiple reg maps is too slow, because the add_reg(...). It might be possible to separate the register map specification aspect (at what addresses the registers are) from the sequencer aspect. For this you would need one uvm_reg_map instance on which you do your add_reg(...) calls. Let's call this the specification map. For each sequencer you want to drive register accesses on, you would need another uvm_reg_map (sub-class) that somehow points to the specification map. Let's call these driving maps.
I don't have any code on how to do this at the moment. One would need to look at how uvm_reg_map is called by other code and override those functions. Instead of calling the implementations in uvm_reg_map, which deal with its own register storage, they would point to the specification map and interrogate it using get_reg(...) and so on. This might not work if the functions are not declared virtual in uvm_reg_map. UVM has a tendency to make extension impossible because code relies on implementations instead of abstractions.
From what I know, you can't. If one reg map is able to connected to more than one sequencer, then later how can you choose which sequencer to run? In addition, for each added reg map, its handle will be stored inside the uvm_reg's m_maps array via map.add_reg(). It is not like each map will create its own registers, so the registers will not be duplicated.
Other way to do is to create a driver which uses all these 3 sequencer/agent.(Let's call it reg_driver)
reg_driver will have one sequencer which gets generic reg. transactions.
From run time switch given, select the reg sequence of interface to drive particular transaction inside reg_driver.
Related
Task
I have a small kernel module I wrote for my RaspBerry Pi 2 which implements an additional system call for generating power consumption metrics. I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it. Otherwise, the call just skips the bulk of its body and returns success.
Background Work
I've read into the issue at length, and I've found a similar question on SO, but there are numerous problems with it, from my perspective (noted below).
Question
The linked question notes that struct task_struct contains a pointer element to struct cred, as defined in linux/sched.h and linux/cred.h. The latter of the two headers doesn't exist on my system(s), and the former doesn't show any declaration of a pointer to a struct cred element. Does this make sense?
Silly mistake. This is present in its entirety in the kernel headers (ie: /usr/src/linux-headers-$(uname -r)/include/linux/cred.h), I was searching in gcc-build headers in /usr/include/linux.
Even if the above worked, it doesn't mention if I would be getting the the real, effective, or saved UID for the process. Is it even possible to get each of these three values from within the system call?
cred.h already contains all of these.
Is there a safe way in the kernel module to quickly determine which groups the user belongs to without parsing /etc/group?
cred.h already contains all of these.
Update
So, the only valid question remaining is the following:
Note, that iterating through processes and reading process's
credentials should be done under RCU-critical section.
... how do I ensure my check is run in this critical section? Are there any working examples of how to accomplish this? I've found some existing kernel documentation that instructs readers to wrap the relevant code with rcu_read_lock() and rcu_read_unlock(). Do I just need to wrap an read operations against the struct cred and/or struct task_struct data structures?
First, adding a new system call is rarely the right way to do things. It's best to do things via the existing mechanisms because you'll benefit from already-existing tools on both sides: existing utility functions in the kernel, existing libc and high-level language support in userland. Files are a central concept in Linux (like other Unix systems) and most data is exchanged via files, either device files or special filesystems such as proc and sysfs.
I would like to modify the system call so that it only gets invoked if a special user (such as "root" or user "pi") issues it.
You can't do this in the kernel. Not only is it wrong from a design point of view, but it isn't even possible. The kernel knows nothing about user names. The only knowledge about users in the kernel in that some privileged actions are reserved to user 0 in the root namespace (don't forget that last part! And if that's new to you it's a sign that you shouldn't be doing advanced things like adding system calls). (Many actions actually look for a capability rather than being root.)
What you want to use is sysfs. Read the kernel documentation and look for non-ancient online tutorials or existing kernel code (code that uses sysfs is typically pretty clean nowadays). With sysfs, you expose information through files under /sys. Access control is up to userland — have a sane default in the kernel and do things like calling chgrp, chmod or setfacl in the boot scripts. That's one of the many wheels that you don't need to reinvent on the user side when using the existing mechanisms.
The sysfs show method automatically takes a lock around the file, so only one kernel thread can be executing it at a time. That's one of the many wheels that you don't need to reinvent on the kernel side when using the existing mechanisms.
The linked question concerns a fundamentally different issue. To quote:
Please note that the uid that I want to get is NOT of the current process.
Clearly, a thread which is not the currently executing thread can in principle exit at any point or change credentials. Measures need to be taken to ensure the stability of whatever we are fiddling with. RCU is often the right answer. The answer provided there is somewhat wrong in the sense that there are other ways as well.
Meanwhile, if you want to operate on the thread executing the very code, you can know it wont exit (because it is executing your code as opposed to an exit path). A question arises what about the stability of credentials -- good news, they are also guaranteed to be there and can be accessed with no preparation whatsoever. This can be easily verified by checking the code doing credential switching.
We are left with the question what primitives can be used to do the access. To that end one can use make_kuid, uid_eq and similar primitives.
The real question is why is this a syscall as opposed to just a /proc file.
See this blogpost for somewhat elaborated description of credential handling: http://codingtragedy.blogspot.com/2015/04/weird-stuff-thread-credentials-in-linux.html
Assume that a large file is saved on disk and I want to run a computation on every chunk of data contained in the file.
The C/C++ code that I would write to do so would load part of the file, then do the processing, then load the next part, then do the processing of this next part, and so on.
If I am, however, interested to do so in the shortest possible time, I could actually do the following: First, tell DMA-controller to load first part of the file. When this part is loaded tell the DMA-controller to load the second part (in some other part of the memory) and then immediately start processing the first part.
If I get an interrupt from the DMA during processing the first part, I finish the first part and afterwards tell the DMA to overwrite it with the third part of the file; then I process the second part.
If I do not get an interrupt from the DMA during processing the first part, I finish the first part and wait for the interrupt of the DMA.
Depending of how long the processing takes in relation to the disk-read, this should be up to twice as fast. In reality, of course, one would have to measure. But that is not the question I am asking.
The question is: Is it possible to do this a) in C using some non-standard extension or b) in assembly? Or do operating systems not allow such things in general? The question is meant primarily in a single-thread context, although I also would be interested to know how to do it with two threads. Also, I am interested in specific code; this is more of a theoretical question.
You're right that you will not get the benefit of this by default, because a blocking read stops your thread from doing any processing. Hans is right that modern OSes already take care of all the little details of DMA and interrupt completion routines.
You need to use the architecture you've described, of issuing a request in advance of when you will use the data. Issue asynchronous I/O requests (on Windows these are called OVERLAPPED). Then the flow will go exactly as you envisions, but the DMA and interrupts are handled in the drivers.
On Windows, take a look at FILE_FLAG_OVERLAPPED (to CreateFile) and ReadFile (if you like events) or ReadFileEx (if you like callbacks). If you don't have to process the data in any particular order, then add a completion port to the mix, which queues the completion responses.
On Linux, OSX, and many other Unix-like OSes, look at aio_read. Or fadvise. Or use mmap with madvise.
And you can get these benefits without even writing native code. .NET recently added the ReadAsync method to its FileStream, which can be used with continuation-passing style in the form of Task objects, with async/await syntactic sugar in the C# compiler.
Typically, in a multi-mode (user/system) operating system, you do not have access to direct dma or to interrupts. In systems that extend those features from kernel(system) mode down to user mode, the overhead eliminates the benefit of using them.
Ignoring that what you're asking to do requires a very specialized environment to support it, the idea is sound and common: declaring two (or more) buffers to enable DMA to the next while you process the first. When two buffers are used they're sometimes referred to as ping-pong buffers.
I am aware that one cannot listen for, detect, and perform some action upon encountering context switches on Windows machines via managed languages such as C#, Java, etc. However, I was wondering if there was a way of doing this using assembly (or some other language, perhaps C)? If so, could you provide a small code snippet that gives an idea of how to do this (as I am relatively new to kernel programming)?
What this code will essentially be designed to do is run in the background on a standard Windows UI and listen for when a particular process is either context switched in or out of the CPU. Upon hearing either of these actions, it will send a signal. To clarify, I am looking to detect only the context switches directly involving a specific process, not any context switches. What I ultimately would like to achieve is to be able to notify another machine (via the internet signal) whenever a specific process begins making use of the CPU, as well as when it ceases doing so.
My first attempt at doing this involved simply calculating the CPU usage percentage of the specific process, but this ultimately proved to be too course-grained to catch the most minute calculations. For example, I wrote a test program that simply performed the operation 2+2 and placed the answer inside of an int. The CPU usage method did not pick up on this. Thus, I am looking for something lower level, hence the origin of this question. If there are potential alternatives, I would be more than happy to field them.
There's Event Tracing for Windows (ETW), which you can configure to receive messages about a variety of events occurring in the system.
You should be able to receive messages about thread scheduling events. The CSwitch class of events is for that.
Sorry, I don't know any good ETW samples that you could easily reuse for your task. Read MSDN and look around.
Simon pointed out a good link explaining why ETW can be useful. Very enlightening: http://randomascii.wordpress.com/2012/05/11/the-lost-xperf-documentationcpu-scheduling/
Please see the edits below. In particular #3, ETW appears to be the way to go.
In theory you could install your own trap handler for the old int 2Eh and the new sysenter. However, in practice this isn't going to be as easy anymore as it used to be because of Patchguard (since Vista) and signing requirements. I'm not aware of any other generic means to detect context switches, meaning you'd have to roll your own. All context switches of the OS go through call gates (the aforementioned trap handlers) and ReactOS allows you to peek behind the scenes if you feel uncomfortable with debugging/disassembling.
However, in either case there shouldn't be a generic way to install something like this without kernel mode privileges (usually referred to as ring 0) - anything else would be a security flaw in Windows. I'm not aware of a Windows-supplied method to achieve what you want either.
The book "Undocumented Windows NT" has a pretty good chapter about the exact topic (although obviously targeted at the old int 2Eh method).
If you can live with hooking only certain functions, you may be able to get away with some filter driver(s) or user-mode API hooking. Depends on your exact requirements.
Update: reading your updated question, I think you need to read up on the internals, in particular on the concept of IRQLs (not to be confused with IRQs from DOS times) and the scheduler. The problem is that there can - and usually will - be literally hundreds of context switches every second. However, your watcher process (the one watching for context switches) will, like any user-mode process be preemptable. This means that there is no way for you to achieve real-time signaling or anything close to it, which puts a big question mark on the method.
What is it actually that you want to achieve? The number of context switches doesn't really give you anything. Every single SEH exception will cause a context switch. What is it that you are interested in? Perhaps performance counters cater your needs better?
Update 2: the sheer amount of context switches even for a single thread will be flabbergasting within a single second. So assuming you'd install your own trap handler, you'd still end up (adversely) affecting all other threads on the system (after all you'd catch every context switch and then see whether it's the process/threads you care about and then do your thing or pass it on).
If you could tell us what you ultimately want to achieve, not with the means already pre-defined, we may be able to suggest alternatives.
Update 3: so apparently I was wrong in one respect here. Windows comes with something on board that signals context switches. And ETW can be harnessed to tap into those. Thanks to Simon for pointing out.
I'm currently in the process of writing a state machine in C for a microcontroller (a TI MSP430). Now, I don't have any problems with writing the code and implementing my design, but I am wondering how to prove the state machine logic without having to use the actual hardware (which, of course, isn't yet available).
Using debugging features, I can simulate interrupts (although I haven't yet tried to do this, I'm just assuming it will be okay - it's documented after all) and I have defined and reserved a specific area of memory for holding TEST data, which using debugging macros, I can access at runtime outside of the application in a Python script. In other words, I have some test foundations in place. However, the focus of my question is this:
"How best do I force a certain state machine flow for decisions that require hardware input, e.g., for when an input pin is high or low". For example, "if some pin is high, follow this path, otherwise follow this path".
Again, using debugging macros, I can write to registers outside of the application (for example, to light an LED), but I can't (understandably) write to the read-only registers used for input, and so forcing a state machine flow in the way described above is proving taxing.
I had thought of using #ifdefs, where if I wanted to test flow I could use an output pin and check this value instead of the input pin that would ultimately be used. However, this will no doubt pepper my codebase with test-only code, which feels like the wrong approach to take. Does anyone have any advice on a good way of achieving this level of testing? I'm aware that I could probably just use a simulator, but I want to use real hardware wherever possible (albeit an evaluation board at this stage).
Sounds like you need abstraction.
Instead of, in the "application" code (the state machine) hard-coding input reading using e.g. GPIO register reads, encapsulate those reads into functions that do the check and return the value. Inside the function, you can put #ifdef:ed code that reads from your TEST memory area instead, and thus simulates a response from the GPIO pin that isn't there.
This should really be possible even if you're aiming for high performance, it's not a lot of overhead and if you work at it, you should be able to inline the functions.
Even though you don't have all the hardware yet, you can simulate pretty much everything.
A possible way of doing it in C...
Interrupt handlers = threads waiting on events.
Input devices = threads firing the above events. They can be "connected" to the PC keyboard, so you initiate "interrupts" manually. Or they can have their own state machines to do whatever necessary in an automated manner (you can script those too, they don't have to be hardwired to a fixed behavior!).
Output devices = likewise threads. They can be "connected" to the PC display, so you can see the "LED" states. You can log outputs to files as well.
I/O pins/ports can be just dedicated global variables. If you need to wake up I/O device threads upon reading/writing from/to them, you can do so too. Either wrap accesses to them into appropriate synchronization-and-communication code or even map the underlying memory in such a way that any access to these port variables would trigger a signal/page fault whose handler would do all the necessary synchronization and communication for you.
And the main part is in, well, main(). :)
This will create an environment very close to the real. You can even get race conditions!
If you want to be even more hardcode about it and if you have time, you can simulate the entire MSP430 as well. The instruction set is very compact and simple. Some simulators exist today, so you have some reference code to leverage.
If you want to test your code well, you will need to make it flexible enough for the purpose. This may include adding #ifdefs, macros, explicit parameters in functions instead of accessing global variables, pointers to data and functions, which you can override while testing, all kinds of test hooks.
You should also think of splitting the code into hardware-specific parts, very hardware-specific parts and plain business logic parts, which you can compile into separate libraries. If you do so, you'll be able to substitute the real hardware libs with test libs simulating the hardware.
Anyhow, you should abstract away the hardware devices and use test state machines to test production code and its state machines.
Build a test bench. First off I recommend when for example you read the input registers or whatever, use some sort of function call (vs some volatile this that the other address thing). Basically everything has at least one layer of abstraction. Now your main application can easily be lifted and placed anywhere with test functions for each of the abstractions. You can completely test that code without any of the real hardware. Also once on the real hardware you can use the abstraction (wrapper function, whatever you want to call it) as a way to change or fake the input.
switch(state)
{
case X:
r=read_gpio_port();
if(r&0x10) next_state = Y;
break;
}
In a test bench (or even on hardware):
unsigned int test_count;
unsigned read_gpio_port ( void )
{
test_count++;
return(test_count);
}
Eventually implement read_gpio_port in asm or C to access the gpio port, and link that in with the main application instead of the test code.
yes, you suffer a function call unless you inline, but in return your debugging and testing abilities are significantly greater.
Normally, if two applications send two write requests to the same place (lba) of the disk, applications or file systems will add lock for it, so only one request will be handled at a time.
But now there is a difficult problem. There may be multiple write requests that should be written to the same place, but applications don't handle the lock. There is no file system, because the data are directly written to the raw disk. What I can do is to modify the code of the storage system. Things are very complicated now. Suppose there are two requests, A and B. Then finally the data in the corresponding lba may be one of the three results:
All data are from A.
All data are from B.
Parts of data are from A; parts of data are from B.
In my opinion, result 1 & 2 are acceptable, but result 3 is not acceptable. But someone doesn't think so. How about you opinions?
I agree that it should be all of one or none of either. This can be done quite easily by using a form of storage system manager, and writing to the manager in large enough chunks. The manager can do appropriate locking internally so that only one block from one request is written at a time, and you don't get overlaps.