For now, only the non-standard _SC_NPROCESSORS_CONF and
_SC_NPROCESSORS_ONLN are implemented.
Use them to make ninja pick a better default -j value.
While here, make the ninja package script not fail if
no other port has been built yet.
The AT_* entries are placed after the environment variables, so that
they can be found by iterating until the end of the envp array, and then
going even further beyond :^)
When delivering urgent signals to the current thread
we need to check if we should be unblocked, and if not
we need to yield to another process.
We also need to make sure that we suppress context switches
during Process::exec() so that we don't clobber the registers
that it sets up (eip mainly) by a context switch. To be able
to do that we add the concept of a critical section, which are
similar to Process::m_in_irq but different in that they can be
requested at any time. Calls to Scheduler::yield and
Scheduler::donate_to will return instantly without triggering
a context switch, but the processor will then asynchronously
trigger a context switch once the critical section is left.
These new syscalls allow you to send and receive file descriptors over
a local domain socket. This will enable various privilege separation
techniques and other good stuff. :^)
ppoll() is similar() to poll(), but it takes its timeout
as timespec instead of as int, and it takes an additional
sigmask parameter.
Change the sys$poll parameters to match ppoll() and implement
poll() in terms of ppoll().
It looks like they're considered a bad idea, so let's not add
them before we need them. I figured it's good to have them in
git history if we ever do need them though, hence the add/remove
dance.
Add seteuid()/setegid() under _POSIX_SAVED_IDS semantics,
which also requires adding suid and sgid to Process, and
changing setuid()/setgid() to honor these semantics.
The exact semantics aren't specified by POSIX and differ
between different Unix implementations. This patch makes
serenity follow FreeBSD. The 2002 USENIX paper
"Setuid Demystified" explains the differences well.
In addition to seteuid() and setegid() this also adds
setreuid()/setregid() and setresuid()/setresgid(), and
the accessors getresuid()/getresgid().
Also reorder uid/euid functions so that they are the
same order everywhere (namely, the order that
geteuid()/getuid() already have).
You now have to pledge "sigaction" to change signal handlers/dispositions. This
is to prevent malicious code from messing with assertions (and segmentation
faults), which are normally expected to instantly terminate the process but can
do other things if you change signal disposition for them.
This was a holdover from the old times when each Process had a special
main thread with TID 0. Using it was a total crapshoot since it would
just return whichever thread was first on the process's thread list.
Now that I've removed all uses of it, we don't need it anymore. :^)
Instead of falling back to the suspicious "any_thread()" mechanism,
just fail with ESRCH if you try to kill() a PID that doesn't have a
corresponding TID.
This was supposed to be the foundation for some kind of pre-kernel
environment, but nobody is working on it right now, so let's move
everything back into the kernel and remove all the confusion.
We stopped using gettimeofday() in Core::EventLoop a while back,
in favor of clock_gettime() for monotonic time.
Maintaining an optimization for a syscall we're not using doesn't make
a lot of sense, so let's go back to the old-style sys$gettimeofday().
Ultimately we should not panic just because we can't fully commit a VM
region (by populating it with physical pages.)
This patch handles some of the situations where commit() can fail.
This patch adds PageFaultResponse::OutOfMemory which informs the fault
handler that we were unable to allocate a necessary physical page and
cannot continue.
In response to this, the kernel will crash the current process. Because
we are OOM, we can't symbolicate the crash like we normally would
(since the ELF symbolication code needs to allocate), so we also
communicate to Process::crash() that we're out of memory.
Now we can survive "allocate 300 MB" (only the allocate process dies.)
This is definitely not perfect and can easily end up killing a random
innocent other process who happened to allocate one page at the wrong
time, but it's a *lot* better than panicking on OOM. :^)
This is a special case that was previously not implemented.
The idea is that you can dispatch a signal to all other processes
the calling process has access to.
There was some minor refactoring to make the self signal logic
into a function so it could easily be easily re-used from do_killall.
Previously, when returning from a pthread's start_routine, we would
segfault. Now we instead implicitly call pthread_exit as specified in
the standard.
pthread_create now creates a thread running the new
pthread_create_helper, which properly manages the calling and exiting
of the start_routine supplied to pthread_create. To accomplish this,
the thread's stack initialization has been moved out of
sys$create_thread and into the userspace function create_thread.
PT_SETTREGS sets the regsiters of the traced thread. It can only be
used when the tracee is stopped.
Also, refactor ptrace.
The implementation was getting long and cluttered the alraedy large
Process.cpp file.
This commit moves the bulk of the implementation to Kernel/Ptrace.cpp,
and factors out peek & poke to separate methods of the Process class.
This patch adds the minherit() syscall originally invented by OpenBSD.
Only the MAP_INHERIT_ZERO mode is supported for now. If set on an mmap
region, that region will be zeroed out on fork().
This commit adds a basic implementation of
the ptrace syscall, which allows one process
(the tracer) to control another process (the tracee).
While a process is being traced, it is stopped whenever a signal is
received (other than SIGCONT).
The tracer can start tracing another thread with PT_ATTACH,
which causes the tracee to stop.
From there, the tracer can use PT_CONTINUE
to continue the execution of the tracee,
or use other request codes (which haven't been implemented yet)
to modify the state of the tracee.
Additional request codes are PT_SYSCALL, which causes the tracee to
continue exection but stop at the next entry or exit from a syscall,
and PT_GETREGS which fethces the last saved register set of the tracee
(can be used to inspect syscall arguments and return value).
A special request code is PT_TRACE_ME, which is issued by the tracee
and causes it to stop when it calls execve and wait for the
tracer to attach.
This was only used by the mechanism for mapping executables into each
process's own address space. Now that we remap executables on demand
when needed for symbolication, this can go away.
Previously we would map the entire executable of a program in its own
address space (but make it unavailable to userspace code.)
This patch removes that and changes the symbolication code to remap
the executable on demand (and into the kernel's own address space
instead of the process address space.)
This opens up a couple of further simplifications that will follow.
Add an extra out-parameter to shbuf_get() that receives the size of the
shared buffer. That way we don't need to make a separate syscall to
get the size, which we always did immediately after.
This feels a lot more consistent and Unixy:
create_shared_buffer() => shbuf_create()
share_buffer_with() => shbuf_allow_pid()
share_buffer_globally() => shbuf_allow_all()
get_shared_buffer() => shbuf_get()
release_shared_buffer() => shbuf_release()
seal_shared_buffer() => shbuf_seal()
get_shared_buffer_size() => shbuf_get_size()
Also, "shared_buffer_id" is shortened to "shbuf_id" all around.
This allows a process wich has more than 1 thread to call exec, even
from a thread. This kills all the other threads, but it won't wait for
them to finish, just makes sure that they are not in a running/runable
state.
In the case where a thread does exec, the new program PID will be the
thread TID, to keep the PID == TID in the new process.
This introduces a new function inside the Process class,
kill_threads_except_self which is called on exit() too (exit with
multiple threads wasn't properly working either).
Inside the Lock class, there is the need for a new function,
clear_waiters, which removes all the waiters from the
Process::big_lock. This is needed since after a exit/exec, there should
be no other threads waiting for this lock, the threads should be simply
killed. Only queued threads should wait for this lock at this point,
since blocked threads are handled in set_should_die.
Replace Process::m_being_inspected with an inspector reference count.
This prevents an assertion from firing when inspecting the same process
in /proc from multiple processes at the same time.
It was trivially reproducible by opening multiple FileManagers.
This mechanism wasn't actually used to create any WeakPtr<Process>.
Such pointers would be pretty hard to work with anyway, due to the
multi-step destruction ritual of Process.
Calling shutdown prevents further reads and/or writes on a socket.
We should do a few more things based on the type of socket, but this
initial implementation just puts the basic mechanism in place.
Work towards #428.
sys$waitid() takes an explicit description of whether it's waiting for a single
process with the given PID, all of the children, a group, etc., and returns its
info as a siginfo_t.
It also doesn't automatically imply WEXITED, which clears up the confusion in
the kernel.
This patch introduces sys$perf_event() with two event types:
- PERF_EVENT_MALLOC
- PERF_EVENT_FREE
After the first call to sys$perf_event(), a process will begin keeping
these events in a buffer. When the process dies, that buffer will be
written out to "perfcore" in the current directory unless that filename
is already taken.
This is probably not the best way to do this, but it's a start and will
make it possible to start doing memory allocation profiling. :^)
When using dbg() in the kernel, the output is automatically prefixed
with [Process(PID:TID)]. This makes it a lot easier to understand which
thread is generating the output.
This patch also cleans up some common logging messages and removes the
now-unnecessary "dbg() << *current << ..." pattern.
This syscall is a complement to pledge() and adds the same sort of
incremental relinquishing of capabilities for filesystem access.
The first call to unveil() will "drop a veil" on the process, and from
now on, only unveiled parts of the filesystem are visible to it.
Each call to unveil() specifies a path to either a directory or a file
along with permissions for that path. The permissions are a combination
of the following:
- r: Read access (like the "rpath" promise)
- w: Write access (like the "wpath" promise)
- x: Execute access
- c: Create/remove access (like the "cpath" promise)
Attempts to open a path that has not been unveiled with fail with
ENOENT. If the unveiled path lacks sufficient permissions, it will fail
with EACCES.
Like pledge(), subsequent calls to unveil() with the same path can only
remove permissions, not add them.
Once you call unveil(nullptr, nullptr), the veil is locked, and it's no
longer possible to unveil any more paths for the process, ever.
This concept comes from OpenBSD, and their implementation does various
things differently, I'm sure. This is just a first implementation for
SerenityOS, and we'll keep improving on it as we go. :^)
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.
For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.
Going forward, all new source files should include a license header.
The syscall is now called sys$open(), but it behaves like the old sys$openat().
In userspace, open_with_path_length() is made a wrapper over openat_with_path_length().
This patch adds a new "accept" promise that allows you to call accept()
on an already listening socket. This lets programs set up a socket for
for listening and then dropping "inet" and/or "unix" so that only
incoming (and existing) connections are allowed from that point on.
No new outgoing connections or listening server sockets can be created.
In addition to accept() it also allows getsockopt() with SOL_SOCKET
and SO_PEERCRED, which is used to find the PID/UID/GID of the socket
peer. This is used by our IPC library when creating shared buffers that
should only be accessible to a specific peer process.
This allows us to drop "unix" in WindowServer and LookupServer. :^)
It also makes the debugging/introspection RPC sockets in CEventLoop
based programs work again.
This patch changes how exec() figures out which program image to
actually load. Previously, we opened the path to our main executable in
find_shebang_interpreter_for_executable, read the first page (or less,
if the file was smaller) and then decided whether to recurse with the
interpreter instead. We then then re-opened the main executable in
do_exec.
However, since we now want to parse the ELF header and Program Headers
of an elf image before even doing any memory region work, we can change
the way this whole process works. We open the file and read (up to) the
first page in exec() itself, then pass just the page and the amount read
to find_shebang_interpreter_for_executable. Since we now have that page
and the FileDescription for the main executable handy, we can do a few
things. First, validate the ELF header and ELF program headers for any
shenanigans. ELF32 Little Endian i386 only, please. Second, we can grab
the PT_INTERP interpreter from any ET_DYN files, and open that guy right
away if it exists. Finally, we can pass the main executable's and
optionally the PT_INTERP interpreter's file descriptions down to do_exec
and not have to feel guilty about opening the file twice.
In do_exec, we now have a choice. Are we going to load the main
executable, or the interpreter? We could load both, but it'll be way
easier for the inital pass on the RTLD if we only load the interpreter.
Then it can load the main executable itself like any old shared object,
just, the one with main in it :). Later on we can load both of them
into memory and the RTLD can relocate itself before trying to do
anything. The way it's written now the RTLD will get dibs on its
requested virtual addresses being the actual virtual addresses.
Right now there is a significant amount of boiler plate code required
to validate user mode parameters in syscalls. In an attempt to reduce
this a bit, introduce validate_read_and_copy_typed which combines the
usermode address check and does the copy internally if the validation
passes. This cleans up a little bit of code from a significant amount
of syscalls.
Since a chroot is in many ways similar to a separate root mount, we can also
apply mount flags to it as if it was an actual mount. These flags will apply
whenever the chrooted process accesses its root directory, but not when other
processes access this same directory for the outside. Since it's common to
chdir("/") immediately after chrooting (so that files accessed through the
current directory inherit the same mount flags), this effectively allows one to
apply additional limitations to a process confined inside a chroot.
To this effect, sys$chroot() gains a mount_flags argument (exposed as
chroot_with_mount_flags() in userspace) which can be set to all the same values
as the flags argument for sys$mount(), and additionally to -1 to keep the flags
set for that file system. Note that passing 0 as mount_flags will unset any
flags that may have been set for the file system, not keep them.
While I was updating syscalls to stop passing null-terminated strings,
I added some helpful struct types:
- StringArgument { const char*; size_t; }
- ImmutableBuffer<Data, Size> { const Data*; Size; }
- MutableBuffer<Data, Size> { Data*; Size; }
The Process class has some convenience functions for validating and
optionally extracting the contents from these structs:
- get_syscall_path_argument(StringArgument)
- validate_and_copy_string_from_user(StringArgument)
- validate(ImmutableBuffer)
- validate(MutableBuffer)
There's still so much code around this and I'm wondering if we should
generate most of it instead. Possible nice little project.
In order to preserve the absolute path of the process root, we save the
custody used by chroot() before stripping it to become the new "/".
There's probably a better way to do this.
The chroot() syscall now allows the superuser to isolate a process into
a specific subtree of the filesystem. This is not strictly permanent,
as it is also possible for a superuser to break *out* of a chroot, but
it is a useful mechanism for isolating unprivileged processes.
The VFS now uses the current process's root_directory() as the root for
path resolution purposes. The root directory is stored as an uncached
Custody in the Process object.
Note that I'm developing some helper types in the Syscall namespace as
I go here. Once I settle on some nice types, I will convert all the
other syscalls to use them as well.
The userspace execve() wrapper now measures all the strings and puts
them in a neat and tidy structure on the stack.
This way we know exactly how much to copy in the kernel, and we don't
have to use the SMAP-violating validate_read_str(). :^)
When loading a new executable, we now map the ELF image in kernel-only
memory and parse it there. Then we use copy_to_user() when initializing
writable regions with data from the executable.
Note that the exec() syscall still disables SMAP protection and will
require additional work. This patch only affects kernel-originated
process spawns.
This encourages callers to strongly reference file descriptions while
working with them.
This fixes a use-after-free issue where one thread would close() an
open fd while another thread was blocked on it becoming readable.
Test: Kernel/uaf-close-while-blocked-in-read.cpp
This code never worked, as was never used for anything. We can build
a much better SHM implementation on top of TmpFS or similar when we
get to the point when we need one.
Split a region into two/three if the desired mprotect range is a strict
subset of an existing region. We can then set the access bits on a new
region that is just our desired range and add both the new
desired subregion and the leftovers back to our page tables.
This patch introduces a syscall:
int set_thread_boost(int tid, int amount)
You can use this to add a permanent boost value to the effective thread
priority of any thread with your UID (or any thread in the system if
you are the superuser.)
This is quite crude, but opens up some interesting opportunities. :^)
Threads now have numeric priorities with a base priority in the 1-99
range.
Whenever a runnable thread is *not* scheduled, its effective priority
is incremented by 1. This is tracked in Thread::m_extra_priority.
The effective priority of a thread is m_priority + m_extra_priority.
When a runnable thread *is* scheduled, its m_extra_priority is reset to
zero and the effective priority returns to base.
This means that lower-priority threads will always eventually get
scheduled to run, once its effective priority becomes high enough to
exceed the base priority of threads "above" it.
The previous values for ThreadPriority (Low, Normal and High) are now
replaced as follows:
Low -> 10
Normal -> 30
High -> 50
In other words, it will take 20 ticks for a "Low" priority thread to
get to "Normal" effective priority, and another 20 to reach "High".
This is not perfect, and I've used some quite naive data structures,
but I think the mechanism will allow us to build various new and
interesting optimizations, and we can figure out better data structures
later on. :^)
This is memory that's loaded from an inode (file) but not modified in
memory, so still identical to what's on disk. This kind of memory can
be freed and reloaded transparently from disk if needed.
Dirty private memory is all memory in non-inode-backed mappings that's
process-private, meaning it's not shared with any other process.
This patch exposes that number via SystemMonitor, giving us an idea of
how much memory each process is responsible for all on its own.
We don't care about dead processes that were once members of a specific
process group.
This was causing us to try and send SIGINT to already-dead processes
when pressing Ctrl+C in a terminal whose pgrp they were once in.
Fixes#922.
This patch implements a simple version of the futex (fast userspace
mutex) API in the kernel and uses it to make the pthread_cond_t API's
block instead of busily sched_yield().
An arbitrary userspace address is passed to the kernel as a "token"
that identifies the futex and you can then FUTEX_WAIT and FUTEX_WAKE
that specific userspace address.
FUTEX_WAIT corresponds to pthread_cond_wait() and FUTEX_WAKE is used
for pthread_cond_signal() and pthread_cond_broadcast().
I'm pretty sure I'm missing something in this implementation, but it's
hopefully okay for a start. :^)
This is a little strange, but it's how I understand things should work.
The first thread in a new process now has TID == PID.
Additional threads subsequently spawned in that process all have unique
TID's generated by the PID allocator. TIDs are now globally unique.
The idea of all processes reliably having a main thread was nice in
some ways, but cumbersome in others. More importantly, it didn't match
up with POSIX thread semantics, so let's move away from it.
This thread gets rid of Process::main_thread() and you now we just have
a bunch of Thread objects floating around each Process.
When the finalizer nukes the last Thread in a Process, it will also
tear down the Process.
There's a bunch of more things to fix around this, but this is where we
get started :^)
This patch adds a single "kernel info page" that is mappable read-only
by any process and contains the current time of day.
This is then used to implement a version of gettimeofday() that doesn't
have to make a syscall.
To protect against race condition issues, the info page also has a
serial number which is incremented whenever the kernel updates the
contents of the page. Make sure to verify that the serial number is the
same before and after reading the information you want from the page.
The kernel now supports basic profiling of all the threads in a process
by calling profiling_enable(pid_t). You finish the profiling by calling
profiling_disable(pid_t).
This all works by recording thread stacks when the timer interrupt
fires and the current thread is in a process being profiled.
Note that symbolication is deferred until profiling_disable() to avoid
adding more noise than necessary to the profile.
A simple "/bin/profile" command is included here that can be used to
start/stop profiling like so:
$ profile 10 on
... wait ...
$ profile 10 off
After a profile has been recorded, it can be fetched in /proc/profile
There are various limits (or "bugs") on this mechanism at the moment:
- Only one process can be profiled at a time.
- We allocate 8MB for the samples, if you use more space, things will
not work, and probably break a bit.
- Things will probably fall apart if the profiled process dies during
profiling, or while extracing /proc/profile
This patch makes SharedBuffer use a PurgeableVMObject as its underlying
memory object.
A new syscall is added to control the volatile flag of a SharedBuffer.
It's now possible to get purgeable memory by using mmap(MAP_PURGEABLE).
Purgeable memory has a "volatile" flag that can be set using madvise():
- madvise(..., MADV_SET_VOLATILE)
- madvise(..., MADV_SET_NONVOLATILE)
When in the "volatile" state, the kernel may take away the underlying
physical memory pages at any time, without notifying the owner.
This gives you a guilt discount when caching very large things. :^)
Setting a purgeable region to non-volatile will return whether or not
the memory has been taken away by the kernel while being volatile.
Basically, if madvise(..., MADV_SET_NONVOLATILE) returns 1, that means
the memory was purged while volatile, and whatever was in that piece
of memory needs to be reconstructed before use.
The main thread of each kernel/user process will take the name of
the process. Extra threads will get a fancy new name
"ProcessName[<tid>]".
Thread backtraces now list the thread name in addtion to tid.
Add the thread name to /proc/all (should it get its own proc
file?).
Add two new syscalls, set_thread_name and get_thread_name.
It's now possible to load a .o file into the kernel via a syscall.
The kernel will perform all the necessary ELF relocations, and then
call the "module_init" symbol in the loaded module.
Add an initial implementation of pthread attributes for:
* detach state (joinable, detached)
* schedule params (just priority)
* guard page size (as skeleton) (requires kernel support maybe?)
* stack size and user-provided stack location (4 or 8 MB only, must be aligned)
Add some tests too, to the thread test program.
Also, LibC: Move pthread declarations to sys/types.h, where they belong.
This can be implemented entirely in userspace by calling tcgetattr().
To avoid screwing up the syscall indexes, this patch also adds a
mechanism for removing a syscall without shifting the index of other
syscalls.
Note that ports will still have to be rebuilt after this change,
as their LibC code will try to make the isatty() syscall on startup.
Have pthread_create() allocate a stack and passing it to the kernel
instead of this work happening in the kernel. The more of this we can
do in userspace, the better.
This patch also unexposes the raw create_thread() and exit_thread()
syscalls since they are now only used by LibPthread anyway.
It's now possible to block until another thread in the same process has
exited. We can also retrieve its exit value, which is whatever value it
passed to pthread_exit(). :^)
This patch adds pthread_create() and pthread_exit(), which currently
simply wrap our existing create_thread() and exit_thread() syscalls.
LibThread is also ported to using LibPthread.
POSIX's openat() is very similar to open(), except you also provide a
file descriptor referring to a directory from which relative paths
should be resolved.
Passing it the magical fd number AT_FDCWD means "resolve from current
directory" (which is indeed also what open() normally does.)
This fixes libarchive's bsdtar, since it was trying to do something
extremely wrong in the absence of openat() support. The issue has
recently been fixed upstream in libarchive:
https://github.com/libarchive/libarchive/issues/1239
However, we should have openat() support anyway, so I went ahead and
implemented it. :^)
Fixes#748.
Instead of the big ugly switch statement, build a lookup table using
the syscall enumeration macro.
This greatly simplifies the syscall implementation. :^)
It's not safe to use a raw pointer for Process::m_tty. A pseudoterminal
pair will disappear when file descriptors are closed, and we'd end up
looking dangly. Just use a RefPtr.
Scheduling priority is now set at the thread level instead of at the
process level.
This is a step towards allowing processes to set different priorities
for threads. There's no userspace API for that yet, since only the main
thread's priority is affected by sched_setparam().
Add the ability to both pass arguments to scripts with shebangs
(./script argument1 argument2) and to specify them in the shebang line
(#!/usr/local/bin/bash -x -e)
Fixes#585
The way it gets the entropy and blasts it to the buffer is pretty
ugly IMHO, but it does work for now. (It should be replaced, by
not truncating a u32.)
It implements an (unused for now) flags argument, like Linux but
instead of OpenBSD's. This is in case we want to distinguish
between entropy sources or any other reason and have to implement
a new syscall later. Of course, learn from Linux's struggles with
entropy sourcing too.
This patch adds three separate per-process fault counters:
- Inode faults
An inode fault happens when we've memory-mapped a file from disk
and we end up having to load 1 page (4KB) of the file into memory.
- Zero faults
Memory returned by mmap() is lazily zeroed out. Every time we have
to zero out 1 page, we count a zero fault.
- CoW faults
VM objects can be shared by multiple mappings that make their own
unique copy iff they want to modify it. The typical reason here is
memory shared between a parent and child process.
We were always returning the full VM range of the partially-unmapped
Region to the range allocator. This caused us to re-use those addresses
for subsequent VM allocations.
This patch also skips creating a new VMObject in partial munmap().
Instead we just make split regions that point into the same VMObject.
This fixes the mysterious GCC ICE on large C++ programs.
This simplifies the ownership model and makes Region easier to reason
about. Userspace Regions are now primarily kept by Process::m_regions.
Kernel Regions are kept in various OwnPtr<Regions>'s.
Regions now only ever get unmapped when they are destroyed.
This patch makes it possible to *run* text files that start with the
characters "#!" followed by an interpreter.
I've tested this with both the Serenity built-in shell and the Bash
shell, and it works as expected. :^)
The fchdir() function is equivalent to chdir() except that the
directory that is to be the new current working directory is
specified by a file descriptor.
This patch adds support for TLS according to the x86 System V ABI.
Each thread gets a thread-specific memory region, and the GS segment
register always points _to a pointer_ to the thread-specific memory.
In other words, to access thread-local variables, userspace programs
start by dereferencing the pointer at [gs:0].
The Process keeps a master copy of the TLS segment that new threads
should use, and when a new thread is created, they get a copy of it.
It's basically whatever the PT_TLS program header in the ELF says.
This was a workaround to be able to build on case-insensitive file
systems where it might get confused about <string.h> vs <String.h>.
Let's just not support building that way, so String.h can have an
objectively nicer name. :^)
This commit drastically changes how signals are handled.
In the case that an unblocked thread is signaled it works much
in the same way as previously. However, when a blocking syscall
is interrupted, we set up the signal trampoline on the user
stack, complete the blocking syscall, return down the kernel
stack and then jump to the handler. This means that from the
kernel stack's perspective, we only ever get one system call deep.
The signal trampoline has also been changed in order to properly
store the return value from system calls. This is necessary due
to the new way we exit from signaled system calls.
You can now munmap() a part of a region. The kernel will then create
one or two new regions around the "hole" and re-map them using the same
physical pages as before.
This goes towards fixing #175, but not all the way since we don't yet
do munmap() across multiple mappings.
It is now possible to unmount file systems from the VFS via `umount`.
It works via looking up the `fsid` of the filesystem from the `Inode`'s
metatdata so I'm not sure how fragile it is. It seems to work for now
though as something to get us going.
This patch adds the mprotect() syscall to allow changing the protection
flags for memory regions. We don't do any region splitting/merging yet,
so this only works on whole mmap() regions.
Added a "crash -r" flag to verify that we crash when you attempt to
write to read-only memory. :^)
This is not perfect as it uses a lot of VM, but since the buffers are
supposed to be temporary it's not super terrible.
This could be improved by giving back the unused VM to the kernel's
RangeAllocator after finishing the buffer building.
In the userspace, this mimics the Linux pipe2() syscall;
in the kernel, the Process::sys$pipe() now always accepts
a flags argument, the no-argument pipe() syscall is now a
userspace wrapper over pipe2().
It is now possible to mount ext2 `DiskDevice` devices under Serenity on
any folder in the root filesystem. Currently any user can do this with
any permissions. There's a fair amount of assumptions made here too,
that might not be too good, but can be worked on in the future. This is
a good start to allow more dynamic operation under the OS itself.
It is also currently impossible to unmount and such, and devices will
fail to mount in Linux as the FS 'needs to be cleaned'. I'll work on
getting `umount` done ASAP to rectify this (as well as working on less
assumption-making in the mount syscall. We don't want to just be able
to mount DiskDevices!). This could probably be fixed with some `-t`
flag or something similar.
Processes can now have an icon assigned, which is essentially a 16x16 RGBA32
bitmap exposed as a shared buffer ID.
You set the icon ID by calling set_process_icon(int) and the icon ID will be
exposed through /proc/all.
To make this work, I added a mechanism for making shared buffers globally
accessible. For safety reasons, each app seals the icon buffer before making
it global.
Right now the first call to GWindow::set_icon() is what determines the
process icon. We'll probably change this in the future. :^)
This makes assertion failures generate backtraces again. Sorry to everyone
who suffered from the lack of backtraces lately. :^)
We share code with the /proc/PID/stack implementation. You can now get the
current backtrace for a Thread via Thread::backtrace(), and all the traces
for a Process via Process::backtrace().
The syscall is quite simple:
int watch_file(const char* path, int path_length);
It returns a file descriptor referring to a "InodeWatcher" object in the
kernel. It becomes readable whenever something changes about the inode.
Currently this is implemented by hooking the "metadata dirty bit" in
Inode which isn't perfect, but it's a start. :^)
The "stddbg" stream was a cute idea but we never ended up using it in
practice, so let's simplify this and implement userspace dbgprintf() on top
of a simple dbgputch() syscall instead.
This makes debugging LibC startup a little bit easier. :^)
This is very simple but already very useful. Now you're able to call to
dump_backtrace() from anywhere userspace to get a nice symbolicated
backtrace in the debugger output. :^)
Generate a special page containing the "return from signal" trampoline code
on startup and then route signalled threads to it. This avoids a page
allocation in every process that ever receives a signal.
Region now has is_user_accessible(), which informs the memory manager how
to map these pages. Previously, we were just passing a "bool user_allowed"
to various functions and I'm not at all sure that any of that was correct.
All the Region constructors are now hidden, and you must go through one of
these helpers to construct a region:
- Region::create_user_accessible(...)
- Region::create_kernel_only(...)
That ensures that we don't accidentally create a Region without specifying
user accessibility. :^)
Rolling with the theme of adding a dialog to shutdown the machine, it is
probably nice to have a way to reboot the machine without performing a full
system powerdown.
A reboot program has been added to `/bin/` as well as a corresponding
`syscall` (SC_reboot). This syscall works by attempting to pulse the 8042
keyboard controller. Note that this is NOT supported on new machines, and
should only be a fallback until we have proper ACPI support.
The implementation causes a triple fault in QEMU, which then restarts the
system. The filesystems are locked and synchronized before this occurs,
so there shouldn't be any corruption etctera.
This allows us to seal a buffer *before* anyone else has access to it
(well, ok, the creating process still does, but you can't win them all).
It also means that a SharedBuffer can be shared with multiple clients:
all you need is to have access to it to share it on again.
This makes waitpid() return when a child process is stopped via a signal.
Use this in Shell to catch stopped children and return control to the
command line. :^)
Fixes#298.
This is obviously more readable. If we ever run into a situation where
ref count churn is actually causing trouble in the future, we can deal with
it then. For now, let's keep it simple. :^)
Instead of computing the path length inside the syscall handler, let the
caller do that work. This allows us to implement to new variants of open()
and creat(), called open_with_path_length() and creat_with_path_length().
These are suitable for use with e.g StringView.
String&& is just not very practical. Also return const String& when the
returned string is a member variable. The call site is free to make a copy
if he wants, but otherwise we can avoid the retain count churn.
After reading a bunch of POSIX specs, I've learned that a file descriptor
is the number that refers to a file description, not the description itself.
So this patch renames FileDescriptor to FileDescription, and Process now has
FileDescription* file_description(int fd).
The current working directory is now stored as a custody. Likewise for a
process executable file. This unbreaks /proc/PID/fd which has not been
working since we made the filesystem bigger.
This still needs a bunch of work, for instance when renaming or removing
a file somewhere, we have to update the relevant custody links.
Right now, we allow anything inside a user to raise or lower any other process's
priority. This feels simple enough to me. Linux disallows raising, but
that's annoying in practice.
Also run it across the whole tree to get everything using the One True Style.
We don't yet run this in an automated fashion as it's a little slow, but
there is a snippet to do so in makeall.sh.
Since we transition to a new PageDirectory on exec(), we need a matching
RangeAllocator to go with the new directory. Instead of juggling this in
Process and MemoryManager, simply attach the RangeAllocator to the
PageDirectory instead.
Fixes#61.
There are now two thread lists, one for runnable threads and one for non-
runnable threads. Thread::set_state() is responsible for moving threads
between the lists.
Each thread also has a back-pointer to the list it's currently in.
These functions were doing exactly the same thing for range allocation, so
share that code in an allocate_range() helper.
Region allocation will now also fail if range allocation fails, which means
that mmap() can actually fail without falling apart. Exciting times!
This replaces the previous virtual address allocator which was basically
just "m_next_address += size;"
With this in place, virtual addresses can get reused, which cuts down on
the number of page tables created. When we implement ASLR some day, we'll
probably have to do page table deallocation, but for now page tables are
only deallocated once the process dies.
Stash away the ELFLoader used to load an executable in Process so we can use
it for symbolicating userspace addresses later on. This will make debugging
userspace programs a lot nicer. :^)
Calling systrace(pid) gives you a file descriptor with a stream of the
syscalls made by a peer process. The process must be owned by the same
UID who calls systrace(). :^)
We can't have multiple threads in the same process running in the kernel
at the same time, so let's have a per-process lock that threads have to
acquire on syscall entry/exit (and yield while blocked.)
It's basically a userspace port of the kernel's Lock class.
Added gettid() and donate() syscalls to support the timeslice donation
feature we already enjoyed in the kernel.
The scheduler now operates on threads, rather than on processes.
Each process has a main thread, and can have any number of additional
threads. The process exits when the main thread exits.
This patch doesn't actually spawn any additional threads, it merely
does all the plumbing needed to make it possible. :^)
This is accomplished using a new Alarm class and a BlockedSnoozing state.
Basically, you call Process::snooze_until(some_alarm) and then the scheduler
won't wake up the process until some_alarm.is_ringing() returns true.
Only the receive timeout is hooked up yet. You can change the timeout by
calling setsockopt(..., SOL_SOCKET, SO_RCVTIMEO, ...).
Use this mechanism to make /bin/ping report timeouts.
Finally fixed the weird flaky crashing when resizing Terminal windows.
It was because we were dispatching a signal to "current" from the scheduler.
Yet another thing I dislike about even having a "current" process while
we're in the scheduler. Not sure yet how to fix this.
Let the signal handler's kernel stack be a kmalloc() allocation for now.
Once we can do allocation of consecutive physical pages in the supervisor
memory region, we can use that for all types of kernel stacks.
It's now possible to create symbolic links! :^)
This exposed an issue in Ext2FS where we'd write uninitialized data past
the end of an inode's content. Fix this by zeroing out the tail end of
the last block in a file.
When the kernel performs a successful exec(), whatever was on the kernel
stack for that process before goes away. For this reason, we need to make
sure we don't have any stack objects holding onto kmalloc memory.
This is a monster patch that required changing a whole bunch of things.
There are performance and stability issues all over the place, but it works.
Pretty cool, I have to admit :^)
Currently you can only mmap the entire framebuffer.
Using this when starting up the WindowServer gets us yet another step
closer towards it moving into userspace. :^)
This is really cool! :^)
Apps currently refuse to start if the WindowServer isn't listening on the
socket in /wsportal. This makes sense, but I guess it would also be nice
to have some sort of "wait for server on startup" mode.
This has performance issues, and I'll work on those, but this stuff seems
to actually work and I'm very happy with that.
In order to move the WindowServer to userspace, I have to eliminate its
dependence on system call facilities. The communication channel with each
client needs to be message-based in both directions.
For now, the WindowServer process will run with high priority,
while the Finalizer process will run with low priority.
Everyone else gets to be "normal".
At the moment, priority simply determines the size of your time slices.
Since we know who's holding the lock, and we're gonna have to yield anyway,
we can just ask the scheduler to donate any remaining ticks to that process.
Instead of processes themselves getting scheduled to finish dying,
let's have a Finalizer process that wakes up whenever someone is dying.
This way we can do all kinds of lock-taking in process cleanup without
risking reentering the scheduler.
- Don't cli() in Process::do_exec() unless current is execing.
Eventually this should go away once the scheduler is less retarded
in the face of interrupts.
- Improved memory access validation for ring0 processes.
We now look at the kernel ELF header to determine if an access
is appropriate. :^) It's very hackish but also kinda neat.
- Have Process::die() put the process into a new "Dying" state where
it can still get scheduled but no signals will be dispatched.
This way we can keep executing in die() but won't get our EIP
hijacked by signal dispatch. The main problem here was that die()
wanted to take various locks.
Instead of cowboy-calling the VESA BIOS in the bootloader, find the emulator
VGA adapter by scanning the PCI bus. Then set up the desired video mode by
sending device commands.
Font now uses the same in-memory format as the font files we have on disk.
This allows us to simply mmap() the font files and not use any additional
memory for them. Very cool! :^)
Hacking on this exposed a bug in file-backed VMObjects where the first client
to instantiate a VMObject for a specific inode also got to decide its size.
Since file-backed VMObjects always have the same size as the underlying file,
this made no sense, so I removed the ability to even set a size in that case.
Also use an enum for the rather-confusing return value in dispatch_signal().
I will go through the rest of the signals and set them up with the
appropriate default dispositions at some other point.
Now the filesystem is generated on-the-fly instead of manually adding and
removing inodes as processes spawn and die.
The code is convoluted and bloated as I wrote it while sleepless. However,
it's still vastly better than the old ProcFS, so I'm committing it.
I also added /proc/PID/fd/N symlinks for each of a process's open fd's.
GObjects can now register a timer with the GEventLoop. This will eventually
cause GTimerEvents to be dispatched to the GObject.
This needed a few supporting changes in the kernel:
- The PIT now ticks 1000 times/sec.
- select() now supports an arbitrary timeout.
- gettimeofday() now returns something in the tv_usec field.
With these changes, the clock window in guitest2 finally ticks on its own.
Use this to fix a use-after-free in ~GraphicsBitmap(). We'd hit this when
the WindowServer was doing a deferred destruction of a WSWindow whose
backing store referred to a now-reaped Process.
This required a fair bit of plumbing. The CharacterDevice::close() virtual
will now be closed by ~FileDescriptor(), allowing device implementations to
do custom cleanup at that point.
One big problem remains: if the master PTY is closed before the slave PTY,
we go into crashy land.
Only raw octal modes are supported right now.
This patch also changes mode_t from 32-bit to 16-bit to match the on-disk
type used by Ext2FS.
I also ran into EPERM being errno=0 which was confusing, so I inserted an
ESUCCESS in its place.
It's really only supported in Ext2FS since SynthFS doesn't really want you
mucking around with its files. This is pretty neat though :^)
I ran into some trouble with HashMap while working on this but opted to work
around it and leave that for a separate investigation.
Instead of clients painting whenever they feel like it, we now ask that they
paint in response to a paint message.
After finishing painting, clients notify the WindowServer about the rect(s)
they painted into and then flush eventually happens, etc.
This stuff leaves us with a lot of badly named things. Need to fix that.
This means we only have to do one fill_rect() per line and the whole process
ends up being ~10% faster than before.
Also added a read_tsc() syscall to give userspace access to the TSC.
To start painting, call:
gui$get_window_backing_store()
Then finish up with:
gui$release_window_backing_store()
Process will retain the underlying GraphicsBitmap behind the scenes.
This fixes racing between the WindowServer and GUI clients.
This patch also adds a WSWindowLocker that is exactly what it sounds like.
This patch adds most of the plumbing for working file deletion in Ext2FS.
Directory entries are removed and inode link counts updated.
We don't yet update the inode or block bitmaps, I will do that separately.
Make PageDirectory retainable and have each Region co-own the PageDirectory
they're mapped into. When unmapped, Region has no associated PageDirectory.
This allows Region to automatically unmap itself when destroyed.
The system can finally idle without burning CPU. :^)
There are some issues with scheduling making the mouse cursor sloppy
and unresponsive that need to be dealt with.
This is pretty cool. :^)
GraphicsBitmaps are now mapped into both the server and the client address
space (usually at different addresses but that doesn't matter.)
Added a GUI syscall for getting a window's backing store, and another one
for invalidating a window so that the server redraws it.
Userspace programs can now open /dev/gui_events and read a stream of GUI_Event
structs one at a time.
I was stuck on a stupid problem where we'd reenter Scheduler::yield() due to
having one of the has_data_available_for_reading() implementations using locks.
This is a lot better than having them in kmalloc memory. I'm gonna need
a way to keep track of which process owns which bitmap eventually,
maybe through some sort of resource keying system. We'll see.
This synchronous approach to inodes is silly, obviously. I need to rework
it so that the in-memory CoreInode object is the canonical inode, and then
we just need a sync() that flushes pending changes to disk.
The kernel now bills processes for time spent in kernelspace and userspace
separately. The accounting is forwarded to the parent process in reap().
This makes the "time" builtin in bash work.
This way the scheduler doesn't need to plumb the exit status into the waiter.
We still plumb the waitee pid though, I don't love it but it can be fixed.
mmap() will now map uncommitted pages that get allocated and zeroed upon the
first access. I also made /proc/PID/vm show number of "committed" bytes in
each region. This is so cool! :^)
I was surprised to find that dup()'ed fds don't share the close-on-exec flag.
That means it has to be stored separately from the FileDescriptor object.