While following path for wait system call, noticed that before calling
do_wait_thread we get hold of tasklist_lock. I am trying to understand the significance
of tasklist_lock & where its appropriate to use.
716 read_lock(&tasklist_lock);
1717 tsk = current;
1718 do {
1719 retval = do_wait_thread(wo, tsk);
1720 if (retval)
1721 goto end;
1722
1723 retval = ptrace_do_wait(wo, tsk);
1724 if (retval)
1725 goto end;
1726
1727 if (wo->wo_flags & __WNOTHREAD)
1728 break;
1729 } while_each_thread(current, tsk);
1730 read_unlock(&tasklist_lock);
I looked at the declaration of tasklist_lock, It is as follows.
/*
251 * This serializes "schedule()" and also protects
252 * the run-queue from deletions/modifications (but
253 * _adding_ to the beginning of the run-queue has
254 * a separate lock).
255 */
256 extern rwlock_t tasklist_lock;
257 extern spinlock_t mmlist_lock;
I am not able to understand where we should use this. Can you please let me know about it.
Appreciate your help.
The loop iterates over each thread of the current task. Holding the tasklist lock ensures that none of those threads disappear while the loop is running.
Related
This question already has answers here:
What's the need of array with zero elements?
(5 answers)
Closed 6 years ago.
In linux kernel (version 4.8),
"struct pid" is defined as following (from file: http://lxr.free-electrons.com/source/include/linux/pid.h). Here "numbers[1]" (at line 64) is a static array which can have only one element (because of array size is mentioned as 1).
57 struct pid
58 {
59 atomic_t count;
60 unsigned int level;
61 /* lists of tasks that use this pid */
62 struct hlist_head tasks[PIDTYPE_MAX];
63 struct rcu_head rcu;
64 struct upid numbers[1];
65 };
But then, in the following code at line 319 and 320 (from file: http://lxr.free-electrons.com/source/kernel/pid.c), array "numbers" is inside a for loop as 'numbers[i]'. How is it even correct because variable 'i' cannot have any value other than zero without causing segmentation fault? I have checked the value of 'i' during the loops to see if it ever goes more than 1. Yes it goes but still i don't see any segmentation fault. Am i missing something here?
297 struct pid *alloc_pid(struct pid_namespace *ns)
298 {
299 struct pid *pid;
300 enum pid_type type;
301 int i, nr;
302 struct pid_namespace *tmp;
303 struct upid *upid;
304 int retval = -ENOMEM;
305
306 pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
307 if (!pid)
308 return ERR_PTR(retval);
309
310 tmp = ns;
311 pid->level = ns->level;
312 for (i = ns->level; i >= 0; i--) {
313 nr = alloc_pidmap(tmp);
314 if (nr < 0) {
315 retval = nr;
316 goto out_free;
317 }
318
319 pid->numbers[i].nr = nr;
320 pid->numbers[i].ns = tmp;
321 tmp = tmp->parent;
322 }
Is it possible to have number of elements in an array more than array's size which is defined at compile time?
Yes. It is call undefined behavior and code should not be written to allow that.
How is it even correct because variable 'i' cannot have any value other than zero without causing segmentation fault?
It is possible; because code broke the contract. Writing outside an array's bounds may work. It may crash the program. It is undefined behavior.
C is not specified to prevent array access outside its bounds nor cause a seg fault. Such an access may be caught or not. Code itself needs to be responsible for insuring access is within bounds.
There are no training wheels and few safety nets specified in C
This is an algorithm that does not use OS synchronization primitives until two or more threads really access the critical section. Even in recursive "locks" of same thread, there is no real lock until a second thread is involved.
http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
There are two functions:
int qrlgeneric_acquire(qrlgeneric_lock *L, int id);
void qrlgeneric_release(qrlgeneric_lock *L, int acquiredquickly);
qrlgeneric_acquire: called when the thread wants to lock. id is thread id
qrlgeneric_release: called when the thread wants to unlock
Example:
Thread_1 which already locked calls qrlgeneric_acquire again, so a recursive lock will be performed. At the same time, Thread_2 calls qrlgeneric_acquire, so there will be contention (two threads wants to lock, real os sync primitive will be used).
Thread_1 will reach this condition on line 4.
04 if (BIASED(id) == status) // SO: this means this thread already has this lock
05 {
06 L->lockword.h.quicklock = 1;
07 if (BIASED(id) == HIGHWORD(L->lockword.data))
08 return 1;
09 L->lockword.h.quicklock = 0; /* I didn’t get the lock, so be sure */
10 /* not to block the process that did */
11 }
Thread_2 will reach this condition on line 35. CAS is compare_and_swap atomic operation.
34 unsigned short biaslock = L->lockword.h.quicklock;
35 if (CAS(&L->lockword,
36 MAKEDWORD(biaslock, status),
37 MAKEDWORD(biaslock, REVOKED)))
38 {
39 /* I’m the revoker. Set up the default lock. */
40 /* *** INITIALIZE AND ACQUIRE THE DEFAULT LOCK HERE *** */
41 /* Note: this is an uncontended acquire, so it */
42 /* can be done without use of atomics if this is */
43 /* desirable. */
44 L->lockword.h.status = DEFAULT;
45
46 /* Wait until quicklock is free */
47 while (LOWWORD(L->lockword.data))
48 ;
49 return 0; /* And then it’s mine */
50 }
From the comments on line 9 and 47, you can see that the statement at line 9 is there for support the statement on line 47 so the Thread_2 doesn't spin lock there forever.
QUESTION: It seems from those comments on line 9 and 47 that those two conditions above should never both succeed, otherwise the Thread_2 will spin lock on the line 47 because the statement on line 9 will not be executed. THE PROBLEM is I need to help to understand how it is possible that it will never happen that both them succeed, because I still think it can happen:
1. Thread_1: 06 L->lockword.h.quicklock = 1;
2. Thread_2: 34 unsigned short biaslock = L->lockword.h.quicklock;
3. Thread_1: if (BIASED(id) == HIGHWORD(L->lockword.data))
4. Thread_2: 35 if (CAS(&L->lockword,MAKEDWORD(biaslock, status),MAKEDWORD(biaslock, REVOKED)))
3. This condition will succeed because Thread_2 didn't change anything yet.
4. This condition will succeed, because the points 1 and 3 didn't affect it.
The result is I think they can both succeed, but this means that Thread_2 will spin lock on the line 47 until the Thread_1 releases the lock. I think this is definitely wrong and shouldn't happen, so I probably don't understand it. Can anyone help?
Whole algorithm:
/* statuses for qrl locks */
#define BIASED(id) ((int)(id) << 2)
#define NEUTRAL 1
#define DEFAULT 2
#define REVOKED 3
#define ISBIASED(status) (0 == ((status) & 3))
/* word manipulation (big-endian versions shown here) */
#define MAKEDWORD(low, high) (((unsigned int)(low) << 16) | (high))
#define HIGHWORD(dword) ((unsigned short)dword)
#define LOWWORD(dword) ((unsigned short)(((unsigned int)(dword)) >> 16))
typedef volatile struct tag_qrlgeneric_lock
{
volatile union
{
volatile struct
{
volatile short quicklock;
volatile short status;
}
h;
volatile int data;
}
lockword;
/* *** PLUS WHATEVER FIELDS ARE NEEDED FOR THE DEFAULT LOCK *** */
}
qrlgeneric_lock;
int qrlgeneric_acquire(qrlgeneric_lock *L, int id)
{
int status = L->lockword.h.status;
/* If the lock’s mine, I can reenter by just setting a flag */
if (BIASED(id) == status)
{
L->lockword.h.quicklock = 1;
if (BIASED(id) == HIGHWORD(L->lockword.data))
return 1;
L->lockword.h.quicklock = 0; /* I didn’t get the lock, so be sure */
/* not to block the process that did */
}
if (DEFAULT != status)
{
/* If the lock is unowned, try to claim it */
if (NEUTRAL == status)
{
if (CAS(&L->lockword, /* By definition, if we saw */
MAKEDWORD(0, NEUTRAL), /* neutral, the lock is unheld */
MAKEDWORD(1, BIASED(id))))
{
return 1;
}
/* If I didn’t bias the lock to me, someone else just grabbed
it. Fall through to the revocation code */
status = L->lockword.h.status; /* resample */
}
/* If someone else owns the lock, revoke them */
if (ISBIASED(status))
{
do
{
unsigned short biaslock = L->lockword.h.quicklock;
if (CAS(&L->lockword,
MAKEDWORD(biaslock, status),
MAKEDWORD(biaslock, REVOKED)))
{
/* I’m the revoker. Set up the default lock. */
/* *** INITIALIZE AND ACQUIRE THE DEFAULT LOCK HERE *** */
/* Note: this is an uncontended acquire, so it */
/* can be done without use of atomics if this is */
/* desirable. */
L->lockword.h.status = DEFAULT;
/* Wait until quicklock is free */
while (LOWWORD(L->lockword.data))
;
return 0; /* And then it’s mine */
}
/* The CAS could have failed and we got here for either of
two reasons. First, another process could have done the
revoking; in this case we need to fall through to the
default path once the other process is finished revoking.
Secondly, the bias process could have acquired or released
the biaslock field; in this case we need merely retry. */
status = L->lockword.h.status;
}
while (ISBIASED(L->lockword.h.status));
}
/* If I get here, the lock has been revoked by someone other
than me. Wait until they’re done revoking, then fall through
to the default code. */
while (DEFAULT != L->lockword.h.status)
;
}
/* Regular default lock from here on */
assert(DEFAULT == L->lockword.h.status);
/* *** DO NORMAL (CONTENDED) DEFAULT LOCK ACQUIRE FUNCTION HERE *** */
return 0;
}
void qrlgeneric_release(qrlgeneric_lock *L, int acquiredquickly)
{
if (acquiredquickly)
L->lockword.h.quicklock = 0;
else
{
/* *** DO NORMAL DEFAULT LOCK RELEASE FUNCTION HERE *** */
}
}
The ptrace system call allows the parent process to inspect the attached child. For example, in Linux, strace (which is implemented with the ptrace system call) can inspect the system calls invoked by the child process.
When the attached child process invokes a system call, the ptracing parent process can be notified. But how exactly does that happen? I want to know the technical details behind this mechanism.
Thank you in advance.
When the attached child process invokes a system call, the ptracing parent process can be notified. But how exactly does that happen?
Parent process calls ptrace with PTRACE_ATTACH, and his child calls ptrace with PTRACE_TRACEME option. This pair will connect two processes by filling some fields inside their task_struct (kernel/ptrace.c: sys_ptrace, child will have PT_PTRACED flag in ptrace field of struct task_struct, and pid of ptracer process as parent and in ptrace_entry list - __ptrace_link; parent will record child's pid in ptraced list).
Then strace will call ptrace with PTRACE_SYSCALL flag to register itself as syscall debugger, setting thread_flag TIF_SYSCALL_TRACE in child process's struct thread_info (by something like set_tsk_thread_flag(child, TIF_SYSCALL_TRACE);). arch/x86/include/asm/thread_info.h:
67 /*
68 * thread information flags
69 * - these are process state flags that various assembly files
70 * may need to access ...*/
75 #define TIF_SYSCALL_TRACE 0 /* syscall trace active */
99 #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
On every syscall entry or exit, architecture-specific syscall entry code will check this _TIF_SYSCALL_TRACE flag (directly in assembler implementation of syscall, for example x86 arch/x86/kernel/entry_32.S: jnz syscall_trace_entry in ENTRY(system_call) and similar code in syscall_exit_work), and if it is set, ptracer will be notified with signal (SIGTRAP) and child will be temporary stopped. This is done usually in syscall_trace_enter and syscall_trace_leave :
1457 long syscall_trace_enter(struct pt_regs *regs)
1483 if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
1484 tracehook_report_syscall_entry(regs))
1485 ret = -1L;
1507 void syscall_trace_leave(struct pt_regs *regs)
1531 if (step || test_thread_flag(TIF_SYSCALL_TRACE))
1532 tracehook_report_syscall_exit(regs, step);
The tracehook_report_syscall_* are actual workers here, they will call ptrace_report_syscall. include/linux/tracehook.h:
80 /**
81 * tracehook_report_syscall_entry - task is about to attempt a system call
82 * #regs: user register state of current task
83 *
84 * This will be called if %TIF_SYSCALL_TRACE has been set, when the
85 * current task has just entered the kernel for a system call.
86 * Full user register state is available here. Changing the values
87 * in #regs can affect the system call number and arguments to be tried.
88 * It is safe to block here, preventing the system call from beginning.
89 *
90 * Returns zero normally, or nonzero if the calling arch code should abort
91 * the system call. That must prevent normal entry so no system call is
92 * made. If #task ever returns to user mode after this, its register state
93 * is unspecified, but should be something harmless like an %ENOSYS error
94 * return. It should preserve enough information so that syscall_rollback()
95 * can work (see asm-generic/syscall.h).
96 *
97 * Called without locks, just after entering kernel mode.
98 */
99 static inline __must_check int tracehook_report_syscall_entry(
100 struct pt_regs *regs)
101 {
102 return ptrace_report_syscall(regs);
103 }
104
105 /**
106 * tracehook_report_syscall_exit - task has just finished a system call
107 * #regs: user register state of current task
108 * #step: nonzero if simulating single-step or block-step
109 *
110 * This will be called if %TIF_SYSCALL_TRACE has been set, when the
111 * current task has just finished an attempted system call. Full
112 * user register state is available here. It is safe to block here,
113 * preventing signals from being processed.
114 *
115 * If #step is nonzero, this report is also in lieu of the normal
116 * trap that would follow the system call instruction because
117 * user_enable_block_step() or user_enable_single_step() was used.
118 * In this case, %TIF_SYSCALL_TRACE might not be set.
119 *
120 * Called without locks, just before checking for pending signals.
121 */
122 static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
123 {
...
130
131 ptrace_report_syscall(regs);
132 }
And ptrace_report_syscall generates SIGTRAP for debugger or strace via ptrace_notify/ptrace_do_notify:
55 /*
56 * ptrace report for syscall entry and exit looks identical.
57 */
58 static inline int ptrace_report_syscall(struct pt_regs *regs)
59 {
60 int ptrace = current->ptrace;
61
62 if (!(ptrace & PT_PTRACED))
63 return 0;
64
65 ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
66
67 /*
68 * this isn't the same as continuing with a signal, but it will do
69 * for normal use. strace only continues with a signal if the
70 * stopping signal is not SIGTRAP. -brl
71 */
72 if (current->exit_code) {
73 send_sig(current->exit_code, current, 1);
74 current->exit_code = 0;
75 }
76
77 return fatal_signal_pending(current);
78 }
ptrace_notify is implemented in kernel/signal.c, it stops the child and pass sig_info to ptracer:
1961 static void ptrace_do_notify(int signr, int exit_code, int why)
1962 {
1963 siginfo_t info;
1964
1965 memset(&info, 0, sizeof info);
1966 info.si_signo = signr;
1967 info.si_code = exit_code;
1968 info.si_pid = task_pid_vnr(current);
1969 info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
1970
1971 /* Let the debugger run. */
1972 ptrace_stop(exit_code, why, 1, &info);
1973 }
1974
1975 void ptrace_notify(int exit_code)
1976 {
1977 BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
1978 if (unlikely(current->task_works))
1979 task_work_run();
1980
1981 spin_lock_irq(¤t->sighand->siglock);
1982 ptrace_do_notify(SIGTRAP, exit_code, CLD_TRAPPED);
1983 spin_unlock_irq(¤t->sighand->siglock);
1984 }
ptrace_stop is in the same signal.c file, line 1839 for 3.13.
I want to read to and write from process' memory through /dev/mem.
First, I get process' memory map through a linux kernel module coded by myself, output is like this:
start_code_segment 4000000000000000
end_code_segment 4000000000019c38
start_data_segment 6000000000009c38
end_data_segment 600000000000b21d
start_brk 6000000000010000
brk 6000000000034000
start_stack 60000fffffde7b00
Second, I can convert virtual address(VA) to PA thorough the linux kernel module, for example, I can convert VA:0x4000000000000008 to PA:0x100100c49f8008
Third, function read_phy_mem can get memory data in PA:0x100100c49f8008,code at the final.
Problem: My problem is when I read text segment PA memory, everything is OK, but if I read data segment PA memory, *((long *)mapAddr) in line 243 will cause system to go down. Also, I tried
memcpy( &data, (void *)mapAddr, sizeof(long) )
but it still make the system go down.
other info: my computer is IA64, OS is Linux 2.6.18, when system is down, I can get output Info from console like this, then system will restart.
Entered OS MCA handler. PSP=20010000fff21320 cpu=0 monarch=1
cpu 0, MCA occurred in user space, original stack not modified
All OS MCA slaves have reached rendezvous
MCA: global MCA
mlogbuf_finish: printing switched to urgent mode, MCA/INIT might be dodgy or fail.
Delaying for 5 seconds...
code of function read_phy_mem
/*
* pa: physical address
* data: memory data in pa
*
* return int: success or failed
*/
188 int read_phy_mem(unsigned long pa,long *data)
189 {
190 int memfd;
191 int pageSize;
192 int shift;
193 int do_mlock;
194 void volatile *mapStart;
195 void volatile *mapAddr;
196 unsigned long pa_base;
197 unsigned long pa_offset;
198
199 memfd = open("/dev/mem", O_RDWR | O_SYNC);
200 if(memfd == -1)
201 {
202 perror("Failed to open /dev/mem");
203 return FAIL;
204 }
205
206 shift = 0;
207 pageSize = PAGE_SIZE; //#define PAGE_SIZE 16384
208 while(pageSize > 0)
209 {
210 pageSize = pageSize >> 1;
211 shift ++;
212 }
213 shift --;
214 pa_base = (pa >> shift) << shift;
215 pa_offset = pa - pa_base;
224 mapStart = (void volatile *)mmap(0, PAGE_SIZE, PROT_READ | PROT_WRITE,MAP_SHARED | MAP_LOCKED, memfd, pa_base);
226 if(mapStart == MAP_FAILED)
227 {
228 perror("Failed to mmap /dev/mem");
229 close(memfd);
230 return FAIL;
231 }
232 if(mlock((void *)mapStart, PAGE_SIZE) == -1)
233 {
234 perror("Failed to mlock mmaped space");
235 do_mlock = 0;
236 }
237 do_mlock = 1;
238
239 mapAddr = (void volatile *)((unsigned long)mapStart + pa_offset);
243 printf("mapAddr %p %d\n", mapAddr, *((long *)mapAddr));
256 if(munmap((void *)mapStart, PAGE_SIZE) != 0)
257 {
258 perror("Failed to munmap /dev/mem");
259 }
260 close(memfd);
269 return OK;
270 }
Can anyone understand why text segment works well but data segment does not?
I guess, its happening because code-section remain in memory while process executes(if not a DLL code), Whereas data section leave in & out continuously.
Try with stack-Segment. And check if its working?
Write your own test program and allocate memory dynamically in KBs and keep that memory in use within a loop. Than try it with your code to read memory segments of test program. I think it will work.
I have done similar work in windows to replace BIOS address from IVT.
Should be root user.
I am breaking my head in understanding the BoehmGC allocation scheme - GC_malloc. I am not getting how it allocates memory, not seen any malloc or mmap which GC_malloc internally calls.
Can someone kindly help me? Any links or code snippet will be of big help.
Huge thanks in advance.
Boehm GC source code
enter code here
254 /* Allocate lb bytes of composite (pointerful) data */
255 #ifdef THREAD_LOCAL_ALLOC
256 void * GC_core_malloc(size_t lb)
257 #else
258 void * GC_malloc(size_t lb)
259 #endif
260 {
261 void *op;
262 void **opp;
263 size_t lg;
264 DCL_LOCK_STATE;
265
266 if(SMALL_OBJ(lb)) {
267 lg = GC_size_map[lb];
268 opp = (void **)&(GC_objfreelist[lg]);
269 LOCK();
270 if( EXPECT((op = *opp) == 0, 0) ) {
271 UNLOCK();
272 return(GENERAL_MALLOC((word)lb, NORMAL));
273 }
274 /* See above comment on signals. */
275 GC_ASSERT(0 == obj_link(op)
276 || (word)obj_link(op)
277 <= (word)GC_greatest_plausible_heap_addr
278 && (word)obj_link(op)
279 >= (word)GC_least_plausible_heap_addr);
280 *opp = obj_link(op);
281 obj_link(op) = 0;
282 GC_bytes_allocd += GRANULES_TO_BYTES(lg);
283 UNLOCK();
284 return op;
285 } else {
286 return(GENERAL_MALLOC(lb, NORMAL));
287 }
288 }
There are two possibilities:
It returns a pointer given by GENERAL_MALLOC (two returns)
it sets op = *opp (the line with the EXPECT) and then it returns op. I'll say that the second is to reuse freed blocks.
For the second case: look at the value of opp before the if:
opp = (void **)&(GC_objfreelist[lg]);
In opp there is a pointer to the "free" list of objects.
The if probably checks if there is any block in that list. If there isn't (== 0) then it uses GENERAL_MALLOC.
Look at the os_deps.c file where (most) of the OS-dependent functions are implemented.
mmap can be used by Boehm-GC if it's configured to use that. (See the various GC_unix_get_mem(bytes) functions.)
If mmap isn't used, the other (bare) allocator used sbrk.