This patch adds sys$msyscall() which is loosely based on an OpenBSD
mechanism for preventing syscalls from non-blessed memory regions.
It works similarly to pledge and unveil, you can call it as many
times as you like, and when you're finished, you call it with a null
pointer and it will stop accepting new regions from then on.
If a syscall later happens and doesn't originate from one of the
previously blessed regions, the kernel will simply crash the process.
We had two ways of creating a new Ext2FS inode. Either they were empty,
or they started with some pre-allocated size.
In practice, the pre-sizing code path was only used for new directories
and it didn't actually improve anything as far as I can tell.
This patch simplifies inode creation by simply always allocating empty
inodes. Block allocation and block list generation now always happens
on the same code path.
This prevented from dmidecode to get the right PAGE_SIZE when using the
sysconf syscall.
I found this bug, when I tried to figure why dmidecode fails to mmap
/dev/mem when I passed --no-procfs, and the conclusion is that it tried
to mmap unaligned physical address 0xf5ae0 (SMBIOS data), and that was
caused by a wrong value returned after using the sysconf syscall to get
the plaform page size, therefore, allowing to send an unaligned address
to the mmap syscall.
Because it was 'static const' and also shared with userland programs,
the default keymap was defined in multiple places. This commit should
save several kilobytes! :^)
The enumeration code is already enumerating all buses, recursively
enumerating bridges (which are buses) makes devices on bridges being
enumerated multiple times. Also, the PCI code was incorrectly mixing up
terminology; let's settle down on bus, device and function because ever
since PCIe came along "slots" isn't really a thing anymore.
We had an exception that allowed SOL_SOCKET + SO_PEERCRED on local
socket to support LibIPC's PID exchange mechanism. This is no longer
needed so let's just remove the exception.
It's useful for programs to change their thread names to say something
interesting about what they are working on. Let's not require "thread"
for this since single-threaded programs may want to do it without
pledging "thread".
This used to be in Kernel/, next to the build-root-filesystem.sh script,
which was then moved to Meta/ during the transition to CMake but has the
working directory set to Build/, effectively expecting it there - which
seems silly.
TL;DR: Very confusing. Use an explicit path relative to SERENITY_ROOT
instead and update the .gitignore files.
This replaces the current disk detection and disk access code with
code based on https://wiki.osdev.org/IDE
This allows the system to boot on VirtualBox with serial debugging
enabled and VMWare Player.
I believe there were several issues with the current code:
- It didn't utilise the last 8 bits of the LBA in 24-bit mode.
- {read,write}_sectors_with_dma was not setting the obsolete bits,
which according to OSdev wiki aren't used but should be set.
- The PIO and DMA methods were using slightly different copy
and pasted access code, which is now put into a single
function called "ata_access"
- PIO mode doesn't work. This doesn't fix that and should
be looked into in the future.
- The detection code was not checking for ATA/ATAPI.
- The detection code accidentally had cyls/heads/spt as 8-bit,
when they're 16-bit.
- The capabilities of the device were not considered. This is now
brought in and is currently used to check if the device supports
LBA. If not, use CHS.
This prevents sys$mmap() and sys$mprotect() from creating executable
memory mappings in pledged programs that don't have this promise.
Note that the dynamic loader runs before pledging happens, so it's
unaffected by this.
The random address proposals were not checked with the size so it was
increasely likely to try to allocate outside of available space with
larger and larger sizes.
Now they will be ignored instead of triggering a Kernel assertion
failure.
This is a continuation of: c8e7baf4b8
This adds another layer of defense against introducing new code into a
running process. The only permitted way of doing so is by mmapping an
open file with PROT_READ | PROT_EXEC.
This does make any future JIT implementations slightly more complicated
but I think it's a worthwhile trade-off at this point. :^)
This patch adds enforcement of two new rules:
- Memory that was previously writable cannot become executable
- Memory that was previously executable cannot become writable
Unfortunately we have to make an exception for text relocations in the
dynamic loader. Since those necessitate writing into a private copy
of library code, we allow programs to transition from RW to RX under
very specific conditions. See the implementation of sys$mprotect()'s
should_make_executable_exception_for_dynamic_loader() for details.
When mounting an Ext2FS, a block device source is required. All other
filesystem types are unaffected, as most of them ignore the source file
descriptor anyway.
Fixes#5153.
`allocate_randomized` assert an already sanitized size but `mmap` were
just forwarding whatever the process asked so it was possible to
trigger a kernel panic from an unpriviliged process just by asking some
randomly placed memory and a size non alligned with the page size.
This fixes this issue by rounding up to the next page size before
calling `allocate_randomized`.
Fixes#5149
This allows us to get rid of the thread lists in SchedulerData.
Also, instead of iterating over all threads to find a thread by id,
just use a lookup table. In the rare case of having to iterate over
all threads, just iterate the lookup table.
This can be used to request random VM placement instead of the highly
predictable regular mmap(nullptr, ...) VM allocation strategy.
It will soon be used to implement ASLR in the dynamic loader. :^)
This broke with the change that gave each process a list of its own
threads. Since threads are removed slightly earlier from that list
during process teardown, we're not able to use it for generating
coredump backtraces. Fortunately we have the "threads for coredump"
list for just this purpose. :^)
This adds an optional argument to get_good_random_bytes that can be
used to only return randomness if it doesn't have to block.
Also add a SpinLock around using FortunaPRNG.
Fixes#5132
Since each Process now has its own list of threads, we don't need
to treat colonel any different anymore. This also means that it
reports all kernel threads, not just the idle threads.
In ab14b0ac64, mmap was changed so that
the size of the region is aligned before it was passed to the device
driver. The previous logic would assert when the framebuffer size was
not a multiple of the page size. I've also taken the liberty of
returning an error on mmap failure rather than asserting.
We need to make sure other processors can grab the MM lock while we
wait, so release it when we might block. Reading the page from
disk may also block, so release it during that time as well.
Rather than walking all Thread instances and putting them into
a vector to be sorted by priority, queue them into priority sorted
linked lists as soon as they become ready to be executed.
Attempt to wake idle processors to get threads to be scheduled more quickly.
We don't want to wait until the next timer tick if we have processors that
aren't doing anything.
This eliminates the window between calling Processor::current and
the member function where a thread could be moved to another
processor. This is generally not as big of a concern as with
Processor::current_thread, but also slightly more light weight.
Change Thread::current to be a static function and read using the fs
register, which eliminates a window between Processor::current()
returning and calling a function on it, which can trigger preemption
and a move to a different processor, which then causes operating
on the wrong object.
We also need to store m_in_critical in the Thread upon switching,
and we need to restore it. This solves a problem where threads
moving between different processors could end up with an unexpected
value.
This allows us to determine what the previous mode (user or kernel)
was, e.g. in the timer interrupt. This is used e.g. to determine
whether a signal handler should be set up.
Fixes#5096
If we find ourselves with a user-accessible, non-shared Region backed by
a SharedInodeVMObject, that's pretty bad news, so let's just panic the
kernel instead of getting abused.
There might be a better place for this kind of check, so I've added a
FIXME about putting more thought into that.
This was exploitable since the shared flag determines whether inode
permission checks are applied in sys$mprotect().
The bug was pretty hard to spot due to default arguments being used
instead. This patch removes the default arguments to make explicit
at each call site what's being done.
When passing nullptr for either promises or execpromises to pledge(),
the expected behaviour is to not change their current value at all - we
were accidentally resetting them to 0, effectively dropping previously
pledge()'d promises.
We now move the execpromises state into the regular promises, and clear
the execpromises state.
Also make sure to duplicate the promise state on fork.
This fixes an issue where "su" would launch a shell which immediately
crashed due to not having pledged "stdio".
Let's force callers to provide a VM range when allocating a region.
This makes ENOMEM error handling more visible and removes implicit
VM allocation which felt a bit magical.
This tells the kernel that the process wants to use pledge, but without
pledging anything - effectively restricting it to syscalls that don't
require a certain promise. This is part of OpenBSD's pledge() as well,
which served as basis for Serenity's.
We were enabling interrupts too early, before the first context switch to
a thread was complete. This could then trigger another context switch
within the context switch, which lead to a crash.
There is a window between acquiring/releasing the lock with the atomic
variables and subsequently waiting or waking threads. With a single
processor this window was closed by using a critical section, but
this doesn't prevent other processors from running these code paths.
To solve this, set a flag in the WaitQueue while holding m_lock which
determines if threads should be blocked at all.
Instead of letting each File subclass do range allocation in their
mmap() override, do it up front in sys$mmap().
This makes us honor alignment requests for file-backed memory mappings
and simplifies the code somwhat.
This was done with the help of several scripts, I dump them here to
easily find them later:
awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in
for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
do
find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
done
# Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
Booting old computers without RDRAND/RDSEED and without a disk makes
the system severely starved for entropy. Uses interrupts as a source
to side-step that issue.
Also warn whenever the system is starved of entropy, because that's
a non-obvious failure mode.
For some reason we were keeping the bits 04777 in file modes. That
doesn't seem right and I can't think of a reason why the set-uid bit
should be allowed to slip through.
(mode & S_IFDIR) is not enough to check if "mode" is a directory,
we have to check all the bits in the S_IFMT mask.
Use the is_directory() helper to fix this bug.
Since devices are enumerable and can compute their own name inside the
/dev hierarchy, there is no need to try and parse "root=/dev/xxx" by
hand.
This also makes any block device a candidate for the boot device, which
now includes ramdisk devices, so SerenityOS can now boot diskless too.
The disk image generated for QEMU is suitable, as long as it fits in
memory with room to spare for the rest of the system.
Besides removing the monolithic DevFSDeviceInode::determine_name()
method, being able to determine a device's name inside the /dev
hierarchy outside of DevFS has its uses.
The kernel ignored the first 8 MiB of RAM while parsing the memory map
because the kmalloc heaps and the super physical pages lived here. Move
all that stuff inside the .bss segment so that those memory regions are
accounted for, otherwise we risk overwriting boot modules placed next
to the kernel.
This was just an alias for "unix" that I added early on back when there
was some belief that we might be compatible with OpenBSD. We're clearly
never going to be compatible with their pledges so just drop the alias.
It was possible to signal a process while it was paging in an inode
backed VM object. This would cause the inode read to EINTR, and the
page fault handler would assert.
Solve this by simply not unblocking threads due to signals if they are
currently busy handling a page fault. This is probably not the best way
to solve this issue, so I've added a FIXME to that effect.
..and allow implicit creation of KResult and KResultOr from ErrnoCode.
This means that kernel functions that return those types can finally
do "return EINVAL;" and it will just work.
There's a handful of functions that still deal with signed integers
that should be converted to return KResults.
This way, if something goes wrong, we get to keep the actual error.
Also, KResults are nodiscard, so we have to deal with that in Ext2FS
instead of just silently ignoring I/O errors(!)
Similar to LibC storing an assertion message before aborting, process
death by pledge violation now sets a "pledge_violation" key with the
respective pledge name as value in its coredump metadata, which the
CrashReporter will then show.
Path resolution will now refuse to follow symlinks in some cases where
you don't own the symlink, or when it's in a sticky world-writable
directory and the link has a different owner than the directory.
The point of all this is to prevent classic TOCTOU bugs in /tmp etc.
Fixes#4934
Both ESP and GDTR are left undefined by the Multiboot specification and
OS images must not rely on these values to be valid. Fix the undefined
behaviors so that booting with PXELINUX does not triple-fault the CPU.
This file was useful for debugging a long time ago, but has bitrotted
at this point. Instead of updating it, let's just remove it since
nothing is using it.
This adds support for FUTEX_WAKE_OP, FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET,
FUTEX_REQUEUE, and FUTEX_CMP_REQUEUE, as well well as global and private
futex and absolute/relative timeouts against the appropriate clock. This
also changes the implementation so that kernel resources are only used when
a thread is blocked on a futex.
Global futexes are implemented as offsets in VMObjects, so that different
processes can share a futex against the same VMObject despite potentially
being mapped at different virtual addresses.
This sort-of matches what some other systems do and seems like a
generally sane thing to do instead of allowing programs to spawn a
child with a nearly full stack.
When freeing an inode, we were checking if it's a directory *after*
wiping the inode metadata. This caused us to forget updating the block
group descriptor with the new directory count.
We forgot to remove the automatic SMAP disablers after fixing up all
this code to not access userspace memory directly. Let's lock things
down at last. :^)
All users of this mechanism have been switched to anonymous files and
passing file descriptors with sendfd()/recvfd().
Shbufs got us where we are today, but it's time we say good-bye to them
and welcome a much more idiomatic replacement. :^)
The priority boosting mechanism has been broken for a very long time.
Let's remove it from the codebase and we can bring it back the day
someone feels like implementing it in a working way. :^)
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.
This commit touches some dbg() calls which are enclosed in macros. This
should be fine because with the new constexpr stuff, we ensure that the
stuff actually compiles.
Killing remaining threads already happens in Process::die(), but
coredumps are only written in Process::finalize(). We need to keep a
reference to each of those threads to prevent them from being destructed
between those two functions, otherwise coredumps will only ever contain
information about the last remaining thread.
Fixes the underlying problem of #4778, though the UI will need
refinements to not show every thread's backtrace mashed together.
This is in preparation of adding (much) more process information to
coredumps. As we can only have one null-terminated char[] of arbitrary
length in each struct it's now a single JSON blob, which is a great fit:
easily extensible in the future and allows for key/value pairs and even
nested objects, which will be used e.g. for the process environment, for
example.
This patch adds a new AnonymousFile class which is a File backed by
an AnonymousVMObject that can only be mmap'ed and nothing else, really.
I'm hoping that this can become a replacement for shbufs. :^)
Problem:
- Many constructors are defined as `{}` rather than using the ` =
default` compiler-provided constructor.
- Some types provide an implicit conversion operator from `nullptr_t`
instead of requiring the caller to default construct. This violates
the C++ Core Guidelines suggestion to declare single-argument
constructors explicit
(https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c46-by-default-declare-single-argument-constructors-explicit).
Solution:
- Change default constructors to use the compiler-provided default
constructor.
- Remove implicit conversion operators from `nullptr_t` and change
usage to enforce type consistency without conversion.
The vast majority of programs don't ever need to use sys$ptrace(),
and it seems like a high-value system call to prevent a compromised
process from using.
This patch moves sys$ptrace() from the "proc" promise to its own,
new "ptrace" promise and updates the affected apps.
Problem:
- The implementation of `find` is coupled to the implementation of
`SinglyLinkedList`.
Solution:
- Decouple the implementation of `find` from the class by using a
generic `find` algorithm.
Problem:
- The implementation of `find` is coupled to the implementation of
`DoublyLinkedList`.
- `append` and `prepend` are implemented multiple times so that
r-value references can be moved from into the new node. This is
probably not called very often because a pr-value or x-value needs
to be used here.
Solution:
- Decouple the implementation of `find` from the class by using a
generic `find` algorithm.
- Make `append` and `prepend` be function templates so that they can
have binding references which can be forwarded.
This patch merges the profiling functionality in the kernel with the
performance events mechanism. A profiler sample is now just another
perf event, rather than a dedicated thing.
Since perf events were already per-process, this now makes profiling
per-process as well.
Processes with perf events would already write out a perfcore.PID file
to the current directory on death, but since we may want to profile
a process and then let it continue running, recorded perf events can
now be accessed at any time via /proc/PID/perf_events.
This patch also adds information about process memory regions to the
perfcore JSON format. This removes the need to supply a core dump to
the Profiler app for symbolication, and so the "profiler coredump"
mechanism is removed entirely.
There's still a hard limit of 4MB worth of perf events per process,
so this is by no means a perfect final design, but it's a nice step
forward for both simplicity and stability.
Fixes#4848Fixes#4849
When loading non position-independent programs, we now take care not to
load the dynamic loader at an address that collides with the location
the main program wants to load at.
Fixes#4847.
This will enable us to take the desired load address of non-position
independent programs into account when randomizing the load address
of the dynamic loader.
Trying to pass these onto the Terminal while handling an IRQ is a recipe
for disaster. Use Processor::deferred_call_queue to create an ad-hoc
"second half" of the interrupt handler.
Fixes#4889
SystemServer now creates the /tmp/coredump and /tmp/profiler_coredumps
directories at startup, ensuring that they are owned by root, and with
basic 0755 permissions.
The kernel will also now refuse to put core dumps in a directory that
doesn't fulfill the following criteria:
- Owned by 0:0
- Directory with sticky bit not set
- 0755 permissions
Fixes#4435Fixes#4850
We were not handling sticky parents properly in sys$rmdir(). Child
directories of a sticky parent should not be rmdir'able by just anyone.
Only the owner and root.
Fixes#4875.
Before this change, truncating an Ext2FS inode to a larger size than it
was before would give you uninitialized on-disk data.
Fix this by zeroing out all the new space when doing an inode resize.
This is pretty naively implemented via Inode::write_bytes() and there's
lots of room for cleverness here in the future.
These changes are arbitrarily divided into multiple commits to make it
easier to find potentially introduced bugs with git bisect.Everything:
The modifications in this commit were automatically made using the
following command:
find . -name '*.cpp' -exec sed -i -E 's/dbg\(\) << ("[^"{]*");/dbgln\(\1\);/' {} \;