Workqueue implementation in Linux Kernel - c

Can any one help me to understand difference between below mentioned APIs in Linux kernel:
struct workqueue_struct *create_workqueue(const char *name);
struct workqueue_struct *create_singlethread_workqueue(const char *name);
I had written sample modules, when I try to see them using ps -aef, both have created a workqueue, but I was not able to see any difference.
I have referred to http://www.makelinux.net/ldd3/chp-7-sect-6, and according to LDD3:
If you use create_workqueue, you get a workqueue that has a dedicated thread for each processor on the system. In many cases, all those threads are simply overkill; if a single worker thread will suffice, create the workqueue with create_singlethread_workqueue instead.
But I was not able to see multiple worker threads (each for a processor).

Workqueues have changed since LDD3 was written.
These two functions are actually macros:
#define create_workqueue(name) \
alloc_workqueue("%s", WQ_MEM_RECLAIM, 1, (name))
#define create_singlethread_workqueue(name) \
alloc_workqueue("%s", WQ_UNBOUND | WQ_MEM_RECLAIM, 1, (name))
The alloc_workqueue documentation says:
Allocate a workqueue with the specified parameters. For detailed
information on WQ_* flags, please refer to Documentation/workqueue.txt.
That file is too big to quote entirely, but it says:
alloc_workqueue() allocates a wq. The original create_*workqueue()
functions are deprecated and scheduled for removal.
[...]
A wq no longer manages execution resources but serves as a domain for
forward progress guarantee, flush and work item attributes.

if(singlethread){
cwq = init_cpu_workqueue(wq, singlethread_cpu);
err = create_workqueue_thread(cwq, singlethread_cpu);
start_workqueue_thread(cwq, -1);
}else{
list_add(&wq->list, &workqueues);
for_each_possible_cpu(cpu) {
cwq = init_cpu_workqueue(wq, cpu);
err = create_workqueue_thread(cwq, cpu);
start_workqueue_thread(cwq, cpu);
}
}

Related

Linux kernel asynchronous AIO: do I need to copy over the struct iovec for later processing?

I have added support for AIO in my driver (the .aio_read , .aio_write calls in kernelland, libaio in userland) and looking at various sources I cannot find if in my aio_read, .aio_write calls I can just store a pointer to the iovector argument (in the assumption that this memory will remain untouched till after eg aio_complete is called), or that I need to deep copy over the iovector data structures.
static ssize_t aio_read( struct kiocb *iocb, const struct iovec *iovec, unsigned long nr_segs, loff_t pos );
static ssize_t aio_write( struct kiocb *iocb, const struct iovec *iovec, unsigned long nr_segs, loff_t pos );
Looking at the implementation of \drivers\usb\gadget\inode.c as an example, it seems they just copy the pointer in the ep_aio_rwtail function which has:
priv->iv = iv;
But when I try doing something similar it very regularly happens the data in the iovector has been "corrupted" by the time I process it.
Eg in the aio_read/write calls I log
iovector located at addr:0xbf1ebf04
segment 0: base: 0x76dbb468 len:512
But then when I do the real work in a kernel thread (after attaching to the user space mm) I logged the following:
iovector located at addr:0xbf1ebf04
segment 0: base: 0x804e00c8 len:-1088503900
This is with a very simple test case where I only submit 1 asynchronous command in my user application.
To make things more interesting:
I have the corruption about 80% of the time on a 3.13 kernel.
But I never saw it before on a 3.9 kernel (but I only used it for a short while before I upgraded to 3.13, and now reverted back as a sanity cnheck and tried a dozen times or so).
( An example run with a 3.9 kernel has twice
iovector located at addr:0xbf9ee054
segment 0: base: 0x76e28468 len:512)
Does this ring any bells ?
(The other possibility is that I am corrupting these addresses/lengths myself of course, but it is strange that I never had this with a 3.9)
EDIT:
To answer my own question after reviewing the 3.13 code for linux aio (which has changed significantly wrt the 3.9 that was working), in fs\aio.c you have:
static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
char __user *buf, bool compat)
{
...
struct iovec inline_vec, *iovec = &inline_vec;
...
ret = rw_op(req, iovec, nr_segs, req->ki_pos);
...
}
So this iovec structure is just on stack, and it will be lost as soon as the aio_read/write function exits.
And the gadget framework contains a bug (at least for 3.13) in \drivers\usb\gadget\inode.c...
From the man page for aio_read;
NOTES
It is a good idea to zero out the control block before use. The control block must not be changed while the read operation is in progress.
The buffer area being read into must not be accessed during the operation or undefined results may occur. The memory areas involved must
remain valid.
Simultaneous I/O operations specifying the same aiocb structure produce
undefined results.
This suggests the driver can rely on the user's data structures during the operation. It would be prudent to abandon the operation and return an asynchronous error if, during the operation, you detect those structures have changed.

How to set breakpoint to obtain address of a function in fork.c , in the kernel source?

Good day to all. I have this query which I hope someone is able to help me with. I forward my gratitude and thanks in advance. I had done hours of search but unable to find a solution.
My problem:
I need to obtain the address of the " security_task_create(clone_flags)" function the following code snippet (located in line 926 ,fork.c as per "/usr/src/linux-2.6.27/kernel/fork.c") -:
************************************ ************************************
static struct task_struct *copy_process(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace)
{
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
/*
* Thread groups must share signals as well, and detached threads
* can only be started up within the thread group.
*/
if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
return ERR_PTR(-EINVAL);
/*
* Shared signal handlers imply shared VM. By way of the above,
* thread groups also imply shared VM. Blocking this case allows
* for various simplifications in other code.
*/
if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
return ERR_PTR(-EINVAL);
****retval = security_task_create(clone_flags);****
if (retval)
goto fork_out;
retval = -ENOMEM;
p = dup_task_struct(current);
if (!p)
goto fork_out;
rt_mutex_init_task(p);
************************************ ************************************
I've enabled KDB access over keyboard in my Fedora Core 16 machine with kernel 3.1.7. Upon entering KDB console i.e. " kdb[0]> , I typed security_task_create and a hex address e.g. 0x0040118e is displayed.
My Questions:
Is the displayed hex address - the address of the security_task_create upon the kernel loaded?
2.If not, how am I able to obtain the address of the security_task_create function? How do I configure KDB to obtain the address of the security_task_create function?
What I have in mind is to insert a breakpoint at line 926 in fork.c using KDB when the kernel runs security_task_create in memory. If such is indeed the proper solution, how do I obtain the address of security_task_create using such method?
For getting address of any symbol in kernel use System.map file simply.
CONFIG_KALLSYMS is needs to be enabled in kernel configuration for getting all symbols in that file.
Just grep for printk in your source directory and I'm sure you'll find tons of examples.
printk(KERN_INFO "fork(): process `%s' used deprecated "
"clone flags 0x%lx\n",
get_task_comm(comm, current),
clone_flags & CLONE_STOPPED);

Move memory pages per-thread in NUMA architecture

i have 2 questions in one:
(i) Suppose thread X is running at CPU Y. Is it possible to use the syscalls migrate_pages - or even better move_pages (or their libnuma wrapper) - to move the pages associated with X to the node in which Y is connected?
This question arrises because first argument of both syscalls is PID (and i need a per-thread approach for some researching i'm doing)
(ii) in the case of positive answer for (i), how can i get all the pages used by some thread? My aim is, move the page(s) that contains array M[] for exemple...how to "link" data structures with their memory pages, for the sake of using the syscalls above?
An extra information: i'm using C with pthreads. Thanks in advance !
You want to use the higher level libnuma interfaces instead of the low level system calls.
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
Available policies are page interleaving (i.e., allocate in a round-robin fashion from all, or a subset, of the nodes on the system), preferred node allocation (i.e., preferably allocate on a particular node), local allocation (i.e., allocate on the node on which the task is currently executing), or allocation only on specific nodes (i.e., allocate on some subset of the available nodes). It is also possible to bind tasks to specific nodes.
The man pages for the low level numa_* system calls warn you away from using them:
Link with -lnuma to get the system call definitions. libnuma and the required <numaif.h> header are available in the numactl package.
However, applications should not use these system calls directly. Instead, the higher level interface provided by the numa(3) functions in the numactl package is recommended. The numactl package is available at <ftp://oss.sgi.com/www/projects/libnuma/download/>. The package is also included in some Linux distributions. Some distributions include the development library and header in the separate numactl-devel package.
Here's the code I use for pinning a thread to a single CPU and moving the stack to the corresponding NUMA node (slightly adapted to remove some constants defined elsewhere). Note that I first create the thread normally, and then call the SetAffinityAndRelocateStack() below from within the thread. I think this is much better then trying to create your own stack, since stacks have special support for growing in case the bottom is reached.
The code can also be adapted to operate on the newly created thread from outside, but this could give rise to race conditions (e.g. if the thread performs I/O into its stack), so I wouldn't recommend it.
void* PreFaultStack()
{
const size_t NUM_PAGES_TO_PRE_FAULT = 50;
const size_t size = NUM_PAGES_TO_PRE_FAULT * numa_pagesize();
void *allocaBase = alloca(size);
memset(allocaBase, 0, size);
return allocaBase;
}
void SetAffinityAndRelocateStack(int cpuNum)
{
assert(-1 != cpuNum);
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpuNum, &cpuset);
const int rc = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
assert(0 == rc);
pthread_attr_t attr;
void *stackAddr = nullptr;
size_t stackSize = 0;
if ((0 != pthread_getattr_np(pthread_self(), &attr)) || (0 != pthread_attr_getstack(&attr, &stackAddr, &stackSize))) {
assert(false);
}
const unsigned long nodeMask = 1UL << numa_node_of_cpu(cpuNum);
const auto bindRc = mbind(stackAddr, stackSize, MPOL_BIND, &nodeMask, sizeof(nodeMask), MPOL_MF_MOVE | MPOL_MF_STRICT);
assert(0 == bindRc);
PreFaultStack();
// TODO: Also lock the stack with mlock() to guarantee it stays resident in RAM
return;
}

pthread POSIX C library detachstate

I was asked from where do we know that when passing NULL as a second argument in pthread_create() function the thread is made joinable.
I mean, I know that man pages state so, but a justification in code was demanded.
I know that when NULL is passed in, default attributes are used:
const struct pthread_attr *iattr = (struct pthread_attr *) attr;
if (iattr == NULL)
/* Is this the best idea? On NUMA machines this could mean accessing far-away memory. */
iattr = &default_attr;
I know that it should be somewhere in the code of pthread library, but I don't know where exactly.
I know that the definition of default_attr is in pthread_create.c:
static const struct pthread_attr default_attr = { /* Just some value > 0 which gets rounded to the nearest page size. */ .guardsize = 1, };
http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_create.c;h=4fe0755079e5491ad360c3b4f26c182543a0bd6e;hb=HEAD#l457
but I do not know where is exactly stated in the code that this result in a joinable thread.
Thanks in advance.
First off, from the code you pasted you can see that default_attr contains zeroes in almost all fields (there's no such thing as a half-initialized variable in C: if you only initialize some fields, the others are set to 0).
Second, pthread_create contains this code:
/* Initialize the field for the ID of the thread which is waiting
for us. This is a self-reference in case the thread is created
detached. */
pd->joinid = iattr->flags & ATTR_FLAG_DETACHSTATE ? pd : NULL;
This line checks whether iattr->flags has the ATTR_FLAG_DETACHSTATE bit set, which (for default_attr) it doesn't because default_attr.flags is 0. Thus it sets pd->joinid to NULL and not to pd as for detached threads.
(Note that this answer only applies to GNU glibc and not to POSIX pthreads in general.)

Logging in ISR, sprintf(), printk(), else?

When trying to log/debug an ISR, I've seen:
1) sprintf() used as example in 'O'Reilly Linux Device Drivers'
irqreturn_t short_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
struct timeval tv;
int written;
do_gettimeofday(&tv);
/* Write a 16 byte record. Assume PAGE_SIZE is a multiple of 16 */
written = sprintf((char *)short_head,"%08u.%06u\n",
(int)(tv.tv_sec % 100000000), (int)(tv.tv_usec));
BUG_ON(written != 16);
short_incr_bp(&short_head, written);
wake_up_interruptible(&short_queue); /* awake any reading process */
return IRQ_HANDLED;
}
unlike printf(), sprintf() write to memory instead of to console, and does not seem to have re-entrant or blocking issue, correct? but I've seen words against sprintf() on other forum. I am not sure if it's only because of its performance overhead, or else?
2) printk() is another one i've seen people used but blamed for, again performance issue (maybe nothing else?)
What is a generally good method/function to use when logging, or debugging ISR in Linux these days?
Regarding sprintf(). Do the search in any LXR site, for example here:
Freetext search: sprintf (4096 estimated hits)
drivers/video/mbx/mbxdebugfs.c, line 100 (100%)
drivers/isdn/hisax/q931.c, line 1207 (100%)
drivers/scsi/aic7xxx_old/aic7xxx_proc.c, line 141
I think this eliminates any doubts.
As for printk(), printk.h says:
/* If you are writing a driver, please use dev_dbg instead */

Resources