What does sparc_do_fork() exactly do?

What does sparc_do_fork() exactly do? - c

I tried to find out the functionality of this function but I couldn't.. It is defined in Linux/arch/sparc/kernel/process_32.c Thanks
asmlinkage int sparc_do_fork(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size)
{
unsigned long parent_tid_ptr, child_tid_ptr;
unsigned long orig_i1 = regs->u_regs[UREG_I1];
long ret;
parent_tid_ptr = regs->u_regs[UREG_I2];
child_tid_ptr = regs->u_regs[UREG_I4];
ret = do_fork(clone_flags, stack_start, stack_size,
(int __user *) parent_tid_ptr,
(int __user *) child_tid_ptr);
/* If we get an error and potentially restart the system
* call, we're screwed because copy_thread() clobbered
* the parent's %o1. So detect that case and restore it
* here.
*/
if ((unsigned long)ret >= -ERESTART_RESTARTBLOCK)
regs->u_regs[UREG_I1] = orig_i1;
return ret;
}

It appears to be right there in the source code.
It's a wrapper around the regular Linux do_fork() call, one which saves and restores data (specifically, regs->u_regs[UREG_I1], which equates to the SPARC output register 1) that would otherwise be corrupted under certain circumstances:
/* If we get an error and potentially restart the system
* call, we're screwed because copy_thread() clobbered
* the parent's %o1. So detect that case and restore it
* here.
*/
It does this with:
unsigned long orig_i1 = regs->u_regs[UREG_I1]; // Save it.
ret = do_fork(...);
if ((unsigned long)ret >= -ERESTART_RESTARTBLOCK) // It may be corrupt
regs->u_regs[UREG_I1] = orig_i1; // so restore it.

Related

Syscall hijacking from kernel

I want to intercept the open() syscall, for testing each time a file is opened by the user, the message “OPEN IS!” should be displayed in dmesg.
The syscall table and open-call addresses in dmesg are displayed, but the message “OPEN IS!” is not visible. Kernel v. 4.18
I would like to know what the problem is. The code:
unsigned long cr0;
static unsigned long *__sys_call_table;
typedef asmlinkage int (*orig_open_t)(const char *, int, int);
orig_open_t orig_open;
unsigned long *
get_syscall_table_bf(void)
{
unsigned long *syscall_table;
unsigned long int i;
for (i = (unsigned long int)ksys_close; i < ULONG_MAX;
i += sizeof(void *)) {
syscall_table = (unsigned long *)i;
if (syscall_table[__NR_close] == (unsigned long)ksys_close) {
printk(KERN_INFO "syscall: %08lx\n", syscall_table);
return syscall_table;
}
}
return NULL;
}
asmlinkage int
hacked_open(const char *filename, int flags, int mode)
{
printk(KERN_INFO "OPEN IS!\n");
return 0;
}
static inline void
protect_memory(void)
{
write_cr0(cr0);
}
static inline void
unprotect_memory(void)
{
write_cr0(cr0 & ~0x00010000);
}
static int __init
diamorphine_init(void)
{
__sys_call_table = get_syscall_table_bf();
if (!__sys_call_table)
return -1;
cr0 = read_cr0();
orig_open = (orig_open_t)__sys_call_table[__NR_open];
unprotect_memory();
__sys_call_table[__NR_open] = (unsigned long)hacked_open;
printk(KERN_INFO "WE DO IT!\n");
printk(KERN_INFO "hacked is: %08lx\n", hacked_open);
protect_memory();
return 0;
}
static void __exit
diamorphine_cleanup(void)
{
unprotect_memory();
__sys_call_table[__NR_open] = (unsigned long)orig_open;
protect_memory();
}
module_init(diamorphine_init);
module_exit(diamorphine_cleanup);
MODULE_LICENSE("GPL");

I'm guessing something in your hooking is wrong. Either you're hooking a wrong offset of the syscall table or you're completely off. I couldn't understand why explicitly you start searching with ksys_close(), especially when it's an inlined function. You should try looking for the syscall table symbol as such:
typedef void (*_syscall_ptr_t)(void);
_syscall_ptr_t *_syscall_table = NULL;
_syscall_table=(_syscall_ptr_t *)kallsyms_lookup_name("sys_call_table");
A different (huge) issue I see with this is resetting CR0, which allows anything within your system to write to a read only memory at the time of your writing, instead of page-walking and setting the W bit on the specific page you're about to edit.
Additional one small word of advice: You should complete your hook to redirect to the original open syscall. Otherwise, you'll result in the entire system reading from STDIN for every newly opened file descriptor (which will kill your system, eventually)

Use of iret when returning from exec system call

I noticed that at the end of the start_thread function, which is called after most of the work of exec is done, there is a call to force_iret:
static void
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
unsigned long new_sp,
unsigned int _cs, unsigned int _ss, unsigned int _ds)
{
loadsegment(fs, 0);
loadsegment(es, _ds);
loadsegment(ds, _ds);
load_gs_index(0);
regs->ip = new_ip;
regs->sp = new_sp;
regs->cs = _cs;
regs->ss = _ss;
regs->flags = X86_EFLAGS_IF;
force_iret();
}
I presume that this is done to ensure that that sysexit is not used to return to user space. So why does iret have to be used when returning from exec?

This function modifies registers that sysret/sysexit would not restore.
Here's arch/x86/include/asm/thread_info.h:
/*
* Force syscall return via IRET by making it look as if there was
* some work pending. IRET is our most capable (but slowest) syscall
* return path, which is able to restore modified SS, CS and certain
* EFLAGS values that other (fast) syscall return instructions
* are not able to restore properly.
*/
#define force_iret() set_thread_flag(TIF_NOTIFY_RESUME)

How to get "valid data length" of a file?

There is a function to set the "valid data length" value: SetFileValidData, but I didn't find a way to get the "valid data length" value.
I want to know about given file if the EOF is different from the VDL, because writing after the VDL in case of VDL<EOF will cause a performance penalty as described here.

I found this page, claims that:
there is no mechanism to query the value of the VDL
So the answer is "you can't".
If you care about performance you can set the VDL to the EOF, but then note that you may allow access old garbage on your disk - the part between those two pointers, that supposed to be zeros if you would access that file without setting the VDL to point the EOF.

Looked into this. No way to get this information via any API, even the e.g. NtQueryInformationFile API (FileEndOfFileInformation only worked with NtSetInformationFile). So finally I read this by manually reading NTFS records. If anyone has a better way, please tell! This also obviously only works with full system access (and NTFS) and might be out of sync with the in-memory information Windows uses.
#pragma pack(push)
#pragma pack(1)
struct NTFSFileRecord
{
char magic[4];
unsigned short sequence_offset;
unsigned short sequence_size;
uint64 lsn;
unsigned short squence_number;
unsigned short hardlink_count;
unsigned short attribute_offset;
unsigned short flags;
unsigned int real_size;
unsigned int allocated_size;
uint64 base_record;
unsigned short next_id;
//char padding[470];
};
struct MFTAttribute
{
unsigned int type;
unsigned int length;
unsigned char nonresident;
unsigned char name_lenght;
unsigned short name_offset;
unsigned short flags;
unsigned short attribute_id;
unsigned int attribute_length;
unsigned short attribute_offset;
unsigned char indexed_flag;
unsigned char padding1;
//char padding2[488];
};
struct MFTAttributeNonResident
{
unsigned int type;
unsigned int lenght;
unsigned char nonresident;
unsigned char name_length;
unsigned short name_offset;
unsigned short flags;
unsigned short attribute_id;
uint64 starting_vnc;
uint64 last_vnc;
unsigned short run_offset;
unsigned short compression_size;
unsigned int padding;
uint64 allocated_size;
uint64 real_size;
uint64 initial_size;
};
#pragma pack(pop)
HANDLE GetVolumeData(const std::wstring& volfn, NTFS_VOLUME_DATA_BUFFER& vol_data)
{
HANDLE vol = CreateFileW(volfn.c_str(), GENERIC_WRITE | GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (vol == INVALID_HANDLE_VALUE)
return vol;
DWORD ret_bytes;
BOOL b = DeviceIoControl(vol, FSCTL_GET_NTFS_VOLUME_DATA,
NULL, 0, &vol_data, sizeof(vol_data), &ret_bytes, NULL);
if (!b)
{
CloseHandle(vol);
return INVALID_HANDLE_VALUE;
}
return vol;
}
int64 GetFileValidData(HANDLE file, HANDLE vol, const NTFS_VOLUME_DATA_BUFFER& vol_data)
{
BY_HANDLE_FILE_INFORMATION hfi;
BOOL b = GetFileInformationByHandle(file, &hfi);
if (!b)
return -1;
NTFS_FILE_RECORD_INPUT_BUFFER record_in;
record_in.FileReferenceNumber.HighPart = hfi.nFileIndexHigh;
record_in.FileReferenceNumber.LowPart = hfi.nFileIndexLow;
std::vector<BYTE> buf;
buf.resize(sizeof(NTFS_FILE_RECORD_OUTPUT_BUFFER) + vol_data.BytesPerFileRecordSegment - 1);
NTFS_FILE_RECORD_OUTPUT_BUFFER* record_out = reinterpret_cast<NTFS_FILE_RECORD_OUTPUT_BUFFER*>(buf.data());
DWORD bout;
b = DeviceIoControl(vol, FSCTL_GET_NTFS_FILE_RECORD, &record_in,
sizeof(record_in), record_out, 4096, &bout, NULL);
if (!b)
return -1;
NTFSFileRecord* record = reinterpret_cast<NTFSFileRecord*>(record_out->FileRecordBuffer);
unsigned int currpos = record->attribute_offset;
MFTAttribute* attr = nullptr;
while ( (attr==nullptr ||
attr->type != 0xFFFFFFFF )
&& record_out->FileRecordBuffer + currpos +sizeof(MFTAttribute)<buf.data() + bout)
{
attr = reinterpret_cast<MFTAttribute*>(record_out->FileRecordBuffer + currpos);
if (attr->type == 0x80
&& record_out->FileRecordBuffer + currpos + attr->attribute_offset+sizeof(MFTAttributeNonResident)
< buf.data()+ bout)
{
if (attr->nonresident == 0)
return -1;
MFTAttributeNonResident* dataattr = reinterpret_cast<MFTAttributeNonResident*>(record_out->FileRecordBuffer
+ currpos + attr->attribute_offset);
return dataattr->initial_size;
}
currpos += attr->length;
}
return -1;
}
[...]
NTFS_VOLUME_DATA_BUFFER vol_data;
HANDLE vol = GetVolumeData(L"\\??\\D:", vol_data);
if (vol != INVALID_HANDLE_VALUE)
{
int64 vdl = GetFileValidData(alloc_test->getOsHandle(), vol, vol_data);
if(vdl>=0) { [...] }
[...]
}
[...]

The SetValidData (according to MSDN) can be used to create for example a large file without having to write to the file. For a database this will allocate a (contiguous) storage area.
As a result, it seems the file size on disk will have changed without any data having been written to the file.
By implication, any GetValidData (which does not exist) just returns the size of the file, so you can use GetFileSize which returns the "valid" file size.

I think you are confused as to what "valid data length" actually means. Check this answer.
Basically, while SetEndOfFile lets you increase the length of a file quickly, and allocates the disk space, if you skip to the (new) end-of-file to write there, all the additionally allocated disk space would need to be overwritten with zeroes, which is kind of slow.
SetFileValidData lets you skip that zeroing-out. You're telling the system, "I am OK with whatever is in those disk blocks, get on with it". (This is why you need the SE_MANAGE_VOLUME_NAME priviledge, as it could reveal priviledged data to unpriviledged users if you don't overwrite the data. Users with this priviledge can access the raw drive data anyway.)
In either case, you have set the new effective size of the file. (Which you can read back.) What, exactly, should a seperate "read file valid data" report back? SetFileValidData told the system that whatever is in those disk blocks is "valid"...
Different approach of explanation:
The documentation mentions that the "valid data length" is being tracked; the purpose for this is for the system to know which range (from end-of-valid-data to end-of-file) it still needs to zero out, in the context of SetEndOfFile, when necessary (e.g. you closing the file). You don't need to read back this value, because the only way it could be different from the actual file size is because you, yourself, did change it via the aforementioned functions...

why is access_ok failing for this ioctl

EDIT: I don't have a good answer yet as to why I'm getting a failure here... So let me rephrase this a little. Do I even need the verify_area() check? What is the point of that? I have tested out the fact that my structure gets passed successfully to this ioctl, I'm thinking of just removing the failing check, but I'm not 100% what it's in there to do. Thoughts?
END EDIT
I'm working to update some older linux kernel drivers and while testing one out I'm getting a failure which seems odd to me. Here we go:
I have a simple ioctl call in user space:
Config_par_t cfg;
int ret;
cfg.target = CONF_TIMING;
cfg.val1 = nBaud;
ret = ioctl(fd, CAN_CONFIG, &cfg);
The Config_par_t is defined in can4linux.h file (this is the CAN driver that comes with uCLinux):
typedef struct Command_par {
int cmd; /**< special driver command */
int target; /**< special configuration target */
unsigned long val1; /**< 1. parameter for the target */
unsigned long val2; /**< 2. parameter for the target */
int error; /**< return value */
unsigned long retval; /**< return value */
} Command_par_t ;
In the kernel side of things, the ioctl function calls verify_area, which is the failing procedure:
long can_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
void *argp;
long retval = -EIO;
Message_par_t Message;
Command_par_t Command;
struct inode *inode = file->f_path.dentry->d_inode;
argp = &Message;
Can_errno = 0;
switch(cmd) {
case CONFIG:
if( verify_area(VERIFY_READ, (void *) arg, sizeof(Command_par_t))) {
return(retval);
}
Now I know that verify_area() isn't used anymore so I updated it in a header file with this macro to access_ok:
#if LINUX_VERSION_CODE > KERNEL_VERSION(2, 6, 0)
#define verify_area(type, addr, size) access_ok(type, addr, size)
#endif
I'm on a x86 platform so I'm pretty sure the actual access_ok() macro being called is the one in /usr/src/linux/arch/x86/include/asm/uaccess.h as defined here:
#define access_ok(type, addr, size) (likely(__range_not_ok(addr, size) == 0))
#define __range_not_ok(addr, size) \
({ \
unsigned long flag, roksum; \
__chk_user_ptr(addr); \
asm("add %3,%1 ; sbb %0,%0 ; cmp %1,%4 ; sbb $0,%0" \
: "=&r" (flag), "=r" (roksum) \
: "1" (addr), "g" ((long)(size)), \
"rm" (current_thread_info()->addr_limit.seg)); \
flag; \
})
I guess to me this looks like it should be working. Any ideas why I'm getting a 1 return from this verify_area if check? Or any ideas on how I can go about narrowing down the problem?
if( verify_area(VERIFY_READ, (void *) arg, sizeof(Command_par_t))) {

The macro access_ok returns 0 if the block is invalid and nonzero if it may be valid. So in your test, if the block is valid you immediately return -EIO. The way things look, you might want to negate the result of access_ok, something like:
if (!access_ok(...))

Is it necessary to disable IRQ during copying memory for sg_copy_buffer?

Here is the function of sg_copy_buffer for Linux kernel 2.6.32. Is it necessary to disable IRQ during copying memory?
static size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents,
void *buf, size_t buflen, int to_buffer)
{
unsigned int offset = 0;
struct sg_mapping_iter miter;
unsigned long flags;
unsigned int sg_flags = SG_MITER_ATOMIC;
if (to_buffer)
sg_flags |= SG_MITER_FROM_SG;
else
sg_flags |= SG_MITER_TO_SG;
sg_miter_start(&miter, sgl, nents, sg_flags);
local_irq_save(flags);
while (sg_miter_next(&miter) && offset < buflen) {
unsigned int len;
len = min(miter.length, buflen - offset);
if (to_buffer)
memcpy(buf + offset, miter.addr, len);
else
memcpy(miter.addr, buf + offset, len);
offset += len;
}
sg_miter_stop(&miter);
local_irq_restore(flags);
return offset;
}

The sg_miter_start() function that is called in this function calls kmap_atomic(), which can only be used inside atomic (non interruptible) code paths. kmap_atomic() in turn is being used since it is MUCH cheaper then a regular kmap, since it does not need to do a global TLB flush.
The original implementation of sg_copy_buffer() left disabling interrupts to the caller, but after some callers forgot, causing bugs (e.g. https://bugzilla.kernel.org/show_bug.cgi?id=11529) the decision was made to disable interrupt in the function itself (see: http://www.spinics.net/lists/linux-scsi/msg29428.html for the discussion).