Linux creates holes in block lists for all-zero content. This is very
reasonable and we can now handle that situation as well.
Note that we're not smart enough to generate these holes ourselves yet,
but now we can at least read from such files.
The kernel sampling profiler will walk thread stacks during the timer
tick handler. Since it's not safe to trigger page faults during IRQ's,
we now avoid this by checking the page tables manually before accessing
each stack location.
We're not equipped to deal with page faults during an IRQ handler,
so add an assertion so we can immediately tell what's wrong.
This is why profiling sometimes hangs the system -- walking the stack
of the profiled thread causes a page fault and things fall apart.
If we get an -ENOENT when resolving the target because of some part, that is not
the very last part, missing, we should just return the error instead of panicking
later :^)
To test:
$ mkdir /tmp/foo/
$ mv /tmp/foo/ /tmp/bar/
Related to https://github.com/SerenityOS/serenity/issues/1253
This is apparently a special case unlike any other, so let's handle it
directly in VFS::mkdir() instead of adding an alternative code path into
VFS::resolve_path().
Fixes https://github.com/SerenityOS/serenity/issues/1253
Anonymous VM objects should never have null entries in their physical
page list. Instead, "empty" or untouched pages should refer to the
shared zero page.
Fixes#1237.
This will allow us to run the system menu as any user. It will also
enable further lockdown of the WindowServer process since it should no
longer need to pledge proc and exec. :^)
Note that this program is not finished yet.
Work towards #1231.
Process teardown is divided into two main stages: finalize and reap.
Finalization happens in the "Finalizer" kernel and runs with interrupts
enabled, allowing destructors to take locks, etc.
Reaping happens either in sys$waitid() or in the scheduler for orphans.
The more work we can do in finalization, the better, since it's fully
pre-emptible and reduces the amount of time the system runs without
interrupts enabled.
Replace Process::m_being_inspected with an inspector reference count.
This prevents an assertion from firing when inspecting the same process
in /proc from multiple processes at the same time.
It was trivially reproducible by opening multiple FileManagers.
This patch adds NotificationServer, which runs as the "notify" user
and provides an IPC API for desktop notifications.
LibGUI gains the GUI::Notification class for showing notifications.
NotificationServer is spawned on demand and will unspawn after
dimissing all visible notifications. :^)
Finally, this also comes with a small /bin/notify utility.
This was actually rather painless and straightforward. WindowServer now
runs as the "window" user. Users in the "window" group can connect to
it via the socket in /tmp/portal/window as usual.
This mechanism wasn't actually used to create any WeakPtr<Process>.
Such pointers would be pretty hard to work with anyway, due to the
multi-step destruction ritual of Process.
A 16-bit refcount is just begging for trouble right nowl.
A 32-bit refcount will be begging for trouble later down the line,
so we'll have to revisit this eventually. :^)
This patch adds a globally shared zero-filled PhysicalPage that will
be mapped into every slot of every zero-filled AnonymousVMObject until
that page is written to, achieving CoW-like zero-filled pages.
Initial testing show that this doesn't actually achieve any sharing yet
but it seems like a good design regardless, since it may reduce the
number of page faults taken by programs.
If you look at the refcount of MM.shared_zero_page() it will have quite
a high refcount, but that's just because everything maps it everywhere.
If you want to see the "real" refcount, you can build with the
MAP_SHARED_ZERO_PAGE_LAZILY flag, and we'll defer mapping of the shared
zero page until the first NP read fault.
I've left this behavior behind a flag for future testing of this code.
This was only used by HashTable::dump() which I used when doing the
first HashTable implementation. Removing this allows us to also remove
most includes of <AK/kstdio.h>.
Warnings fixed:
* SC2086: Double quote to prevent globbing and word splitting.
* SC2006: Use $(...) notation instead of legacy backticked `...`
* SC2039: In POSIX sh, echo flags are undefined
* SC2209: Use var=$(command) to assign output (or quote to assign string)
* SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails
* SC2166: Prefer [ p ] && [ q ] as [ p -a q ] is not well defined.
* SC2034: i appears unused. Verify use (or export if used externally)
* SC2046: Quote this to prevent word splitting.
* SC2236: Use -z instead of ! -n.
There are still a lot of warnings in Kernel/run about:
- SC2086: Double quote to prevent globbing and word splitting.
However, splitting on space is intentional in this case, and not trivial to
change. Therefore ignore the warning for now - but we should fix this in
the future.
This server listens on port 8000 and serves HTML files from /www.
It's very simple and quite naive, but I think we can start here and
build our way to something pretty neat.
Work towards #792.
We can now participate in the TCP connection closing handshake. :^)
This implementation is definitely not complete and needs to handle a
bunch of other cases. But it's a huge improvement over not being able
to close connections at all.
Note that we hold on to pending-close sockets indefinitely, until they
are moved into the Closed state. This should also have a timeout but
that's still a FIXME. :^)
Fixes#428.
It doesn't look healthy to create raw references into an array before
a temporary unlock. In fact, that temporary unlock looks generally
unhealthy, but it's a different problem.
Calling shutdown prevents further reads and/or writes on a socket.
We should do a few more things based on the type of socket, but this
initial implementation just puts the basic mechanism in place.
Work towards #428.
The idea behind WeakPtr<NetworkAdapter> was to support hot-pluggable
network adapters, but on closer thought, that's super impractical so
let's not go down that road.
If there's not enough space in the output buffer for the whole sockaddr
we now simply truncate the address instead of returning EINVAL.
This patch also makes getpeername() actually return the peer address
rather than the local address.. :^)
According to POSIX, waitid() should fill si_signo and si_pid members
with zeroes if there are no children that have already changed their
state by the time of the call. Let's just fill the whole structure
with zeroes to avoid leaking kernel memory.
sys$waitid() takes an explicit description of whether it's waiting for a single
process with the given PID, all of the children, a group, etc., and returns its
info as a siginfo_t.
It also doesn't automatically imply WEXITED, which clears up the confusion in
the kernel.
We add this feature together with the VMWareBackdoor class.
VMWareBackdoor class is responsible for enabling the vmmouse, and then
controlling it from the PS2 mouse IRQ handler.
On my system (Void Linux) the root user has a default umask of 0077,
causing files and directories in the disk image to have zero group and
world permissions.
This patch introduces sys$perf_event() with two event types:
- PERF_EVENT_MALLOC
- PERF_EVENT_FREE
After the first call to sys$perf_event(), a process will begin keeping
these events in a buffer. When the process dies, that buffer will be
written out to "perfcore" in the current directory unless that filename
is already taken.
This is probably not the best way to do this, but it's a start and will
make it possible to start doing memory allocation profiling. :^)
Before putting itself back on the wait queue, the finalizer task will
now check if there's more work to do, and if so, do it first. :^)
This patch also puts a bunch of process/thread debug logging behind
PROCESS_DEBUG and THREAD_DEBUG since it was unbearable to debug this
stuff with all the spam.
Since we scrub both kmalloc() and kfree() with predictable values, we
can log a helpful message when hitting a crash that looks like it might
be a dereference of such scrubbed data.
DoubleBuffer is the internal buffer for things like TTY, FIFO, sockets,
etc. If you try to write more than the buffer can hold, it will now
do a short write instead of asserting.
This is likely to expose issues at higher levels, and we'll have to
deal with them as they are discovered.
Previously this API would return an InodeIdentifier, which meant that
there was a race in path resolution where an inode could be unlinked
in between finding the InodeIdentifier for a path component, and
actually resolving that to an Inode object.
Attaching a test that would quickly trip an assertion before.
Test: Kernel/path-resolution-race.cpp
Previously we were only checking that each of the virtual pages in the
specified range were valid.
This made it possible to pass in negative buffer sizes to some syscalls
as long as (address) and (address+size) were on the same page.
There's no sense in allowing arbitrarily huge resolutions. Instead, we
now cap the screen size at 4K DCI resolution and will reject attempts
to go bigger with EINVAL.
Previously it was not possible for this function to fail. You could
exploit this by triggering the creation of a VMObject whose physical
memory range would wrap around the 32-bit limit.
It was quite easy to map kernel memory into userspace and read/write
whatever you wanted in it.
Test: Kernel/bxvga-mmap-kernel-into-userspace.cpp
If the waitee process is dead, we don't need to inspect the thread.
This fixes an issue with sys$waitpid() failing before reap() since
dead processes will have no remaining threads alive.
There was a race window in a bunch of syscalls between calling
Thread::from_tid() and checking if the found thread was in the same
process as the calling thread.
If the found thread object was destroyed at that point, there was a
use-after-free that could be exploited by filling the kernel heap with
something that looked like a thread object.
While I was bringing up multitasking, I put the current PID in the SS2
(ring 2 stack segment) slot of the TSS. This was so I could see which
PID was currently running when just inspecting the CPU state.
Memory validation is used to verify that user syscalls are allowed to
access a given memory range. Ring 0 threads never make syscalls, and
so will never end up in validation anyway.
The reason we were allowing kmalloc memory accesses is because kernel
thread stacks used to be allocated in kmalloc memory. Since that's no
longer the case, we can stop making exceptions for kmalloc in the
validation code.
Move timeout management to the ReadBlocker and WriteBlocker classes.
Also get rid of the specialized ReceiveBlocker since it no longer does
anything that ReadBlocker can't do.
Vector::ensure_capacity() makes sure the underlying vector buffer can
contain all the data, but it doesn't update the Vector::size().
As a result, writev() would simply collect all the buffers to write,
and then do nothing.
It was possible to read uninitialized kernel memory via getsockname().
Of course, kmalloc() is a good boy and scrubs new allocations with 0xBB
so all you got was a bunch of 0xBB.
This implementation uses the new helper method of Bitmap called
find_longest_range_of_unset_bits. This method looks for the biggest
range of contiguous bits unset in the bitmap and returns the start of
the range back to the caller.
We should use dbg() instead of dbgprintf() as much as possible to
protect ourselves against format string bugs. Here's a bunch of
conversions in Ext2FS.
Move all the fork-specific inheritance logic to sys$fork(), and all the
stuff for setting up stdio for non-fork ring 3 processes moves to
Process::create_user_process().
Also: we were setting up the PGID, SID and umask twice. Also the code
for copying the open file descriptors was overly complicated. Now it's
just a simple Vector copy assignment. :^)
Since these are not part of the system call convention, we don't care
what userspace had in there. Might as well scrub it before entering
the kernel.
I would scrub EBP too, but that breaks the comfy kernel-thru-userspace
stack traces we currently get. It can be done with some effort.
This changes copyright holder to myself for the source code files that I've
created or have (almost) completely rewritten. Not included are the files
that were significantly changed by others even though it was me who originally
created them (think HtmlView), or the many other files I've contributed code to.
System components that need an IRQ handling are now inheriting the
InterruptHandler class.
In addition to that, the initialization process of PATAChannel was
changed to fit the changes.
PATAChannel, E1000NetworkAdapter and RTL8139NetworkAdapter are now
inheriting from PCI::Device instead of InterruptHandler directly.
The support is very basic - Each component that needs to handle IRQs
inherits from InterruptHandler class. When the InterruptHandler
constructor is called it registers itself in a SharedInterruptHandler.
When an IRQ is fired, the SharedInterruptHandler is invoked, then it
iterates through a list of the registered InterruptHandlers.
Also, InterruptEnabler class was created to provide a way to enable IRQ
handling temporarily, similar to InterruptDisabler (in CPU.h, which does
the opposite).
In addition to that a PCI::Device class has been added, that inherits
from InterruptHandler.
A process has one of three veil states:
- None: unveil() has never been called.
- Dropped: unveil() has been called, and can be called again.
- Locked: unveil() has been called, and cannot be called again.
When using dbg() in the kernel, the output is automatically prefixed
with [Process(PID:TID)]. This makes it a lot easier to understand which
thread is generating the output.
This patch also cleans up some common logging messages and removes the
now-unnecessary "dbg() << *current << ..." pattern.
Sergey suggested that having a non-zero O_RDONLY would make some things
less confusing, and it seems like he's right about that.
We can now easily check read/write permissions separately instead of
dancing around with the bits.
This patch also fixes unveil() validation for O_RDWR which previously
forgot to check for "r" permission.
We don't need to have this method anymore. It was a hack that was used
in many components in the system but currently we use better methods to
create virtual memory mappings. To prevent any further use of this
method it's best to just remove it completely.
Also, the APIC code is disabled for now since it doesn't help booting
the system, and is broken since it relies on identity mapping to exist
in the first 1MB. Any call to the APIC code will result in assertion
failed.
In addition to that, the name of the method which is responsible to
create an identity mapping between 1MB to 2MB was changed, to be more
precise about its purpose.
The problem was mostly in the initialization code, since in that stage
the parser assumed that there is an identity mapping in the first 1MB of
the address space. Now during initialization the parser will create the
correct mappings to locate the required data.
This syscall is a complement to pledge() and adds the same sort of
incremental relinquishing of capabilities for filesystem access.
The first call to unveil() will "drop a veil" on the process, and from
now on, only unveiled parts of the filesystem are visible to it.
Each call to unveil() specifies a path to either a directory or a file
along with permissions for that path. The permissions are a combination
of the following:
- r: Read access (like the "rpath" promise)
- w: Write access (like the "wpath" promise)
- x: Execute access
- c: Create/remove access (like the "cpath" promise)
Attempts to open a path that has not been unveiled with fail with
ENOENT. If the unveiled path lacks sufficient permissions, it will fail
with EACCES.
Like pledge(), subsequent calls to unveil() with the same path can only
remove permissions, not add them.
Once you call unveil(nullptr, nullptr), the veil is locked, and it's no
longer possible to unveil any more paths for the process, ever.
This concept comes from OpenBSD, and their implementation does various
things differently, I'm sure. This is just a first implementation for
SerenityOS, and we'll keep improving on it as we go. :^)
Background: DoubleBuffer is a handy buffer class in the kernel that
allows you to keep writing to it from the "outside" while the "inside"
reads from it. It's used for things like LocalSocket and TTY's.
Internally, it has a read buffer and a write buffer, but the two will
swap places when the read buffer is exhausted (by reading from it.)
Before this patch, it was internally implemented as two Vector<u8>
that we would swap between when the reader side had exhausted the data
in the read buffer. Now instead we preallocate a large KBuffer (64KB*2)
on DoubleBuffer construction and use that throughout its lifetime.
This removes all the kmalloc heap traffic caused by DoubleBuffers :^)
This broke with the >3GB paging overhaul. It's no longer possible to
write directly to physical addresses below the 8MB mark. Physical pages
need to be mapped into kernel VM by using a Region.
Fixes#1099.
There is no real "read protection" on x86, so we have no choice but to
map write-only pages simply as "present & read/write".
If we get a read page fault in a non-readable region, that's still a
correctness issue, so we crash the process. It's by no means a complete
protection against invalid reads, since it's trivial to fool the kernel
by first causing a write fault in the same region.
uintptr_t is 32-bit or 64-bit depending on the target platform.
This will help us write pointer size agnostic code so that when the day
comes that we want to do a 64-bit port, we'll be in better shape.
The userspace locks are very aggressively calling sys$gettid() to find
out which thread ID they have.
Since syscalls are quite heavy, this can get very expensive for some
programs. This patch adds a fast-path for sys$gettid(), which makes it
skip all of the usual syscall validation and just return the thread ID
right away.
This cuts Kernel/Process.cpp compile time by ~18%, from ~29 to ~24 sec.
Before this, we would end up in memcpy() churn hell when a program was
doing repeated write() calls to a file in /tmp.
An even better solution will be to only grow the VM allocation of the
underlying buffer and keep using the same physical pages. This would
eliminate all the memcpy() work.
I've benchmarked this using g++ to compile Kernel/Process.cpp.
With these changes, compilation goes from ~35 sec to ~31 sec. :^)
Instead of restoring CR3 to the current process's paging scope when a
ProcessPagingScope goes out of scope, we now restore exactly whatever
the CR3 value was when we created the ProcessPagingScope.
This fixes breakage in situations where a process ends up with nested
ProcessPagingScopes. This was making profiling very fragile, and with
this change it's now possible to profile g++! :^)