Sockets remember their last error code in the SO_ERROR field, so we need
to take special care to remember this when returning an error.
This patch adds a SOCKET_TRY() that works like TRY() but also calls
set_so_error() on the failure path.
There's probably a lot more code that should be using this, but that's
outside the scope of this patch.
We don't really have anywhere to propagate the error in NetworkTask at
the moment, since it runs in its own kernel thread and has no direct
userspace caller.
A couple of things were changed:
1. Semantic changes - PCI segments are now called PCI domains, to better
match what they are really. It's also the name that Linux gave, and it
seems that Wikipedia also uses this name.
We also remove PCI::ChangeableAddress, because it was used in the past
but now it's no longer being used.
2. There are no WindowedMMIOAccess or MMIOAccess classes anymore, as
they made a bunch of unnecessary complexity. Instead, Windowed access is
removed entirely (this was tested, but never was benchmarked), so we are
left with IO access and memory access options. The memory access option
is essentially mapping the PCI bus (from the chosen PCI domain), to
virtual memory as-is. This means that unless needed, at any time, there
is only one PCI bus being mapped, and this is changed if access to
another PCI bus in the same PCI domain is needed. For now, we don't
support mapping of different PCI buses from different PCI domains at the
same time, because basically it's still a non-issue for most machines
out there.
2. OOM-safety is increased, especially when constructing the Access
object. It means that we pre-allocating any needed resources, and we try
to find PCI domains (if requested to initialize memory access) after we
attempt to construct the Access object, so it's possible to fail at this
point "gracefully".
3. All PCI API functions are now separated into a different header file,
which means only "clients" of the PCI subsystem API will need to include
that header file.
4. Functional changes - we only allow now to enumerate the bus after
a hardware scan. This means that the old method "enumerate_hardware"
is removed, so, when initializing an Access object, the initializing
function must call rescan on it to force it to find devices. This makes
it possible to fail rescan, and also to defer it after construction from
both OOM-safety terms and hotplug capabilities.
This expands the reach of error propagation greatly throughout the
kernel. Sadly, it also exposes the fact that we're allocating (and
doing other fallible things) in constructors all over the place.
This patch doesn't attempt to address that of course. That's work for
our future selves.
This commit moves the KResult and KResultOr objects to Kernel/API to
signify that they may now be freely used by userspace code at points
where a syscall-related error result is to be expected. It also exposes
KResult and KResultOr to the global namespace to make it nicer to use
for userspace code.
NetworkOrdered is a non trivial type, and it's undefined behavior to
cast a random pointer to it and then pretend it's that type.
Instead just call AK::convert_between_host_and_network_endian on the
individual u16*. This suppresses static analysis warnings.
I don't think there was a "bug" in the previous code, it worked, but
it was very brittle.
Prior to this change, both uid_t and gid_t were typedef'ed to `u32`.
This made it easy to use them interchangeably. Let's not allow that.
This patch adds UserID and GroupID using the AK::DistinctNumeric
mechanism we've already been employing for pid_t/ProcessID.
The `m_should_block` member variable that many of the Thread::Blocker
subclasses had was really only used to carry state from the constructor
to the immediate-unblock-without-blocking escape hatch.
This patch refactors the blockers so that we don't need to hold on
to this flag after setup_blocker(), and instead the return value from
setup_blocker() is the authority on whether the unblock conditions
are already met.
This was previously used after construction to check for early unblock
conditions that couldn't be communicated from the constructor.
Now that we've moved early unblock checks from the constructor into
setup_blocker(), we don't need should_block() anymore.
Instead of registering with blocker sets and whatnot in the various
Blocker subclass constructors, this patch moves such initialization
to a separate setup_blocker() virtual.
setup_blocker() returns false if there's no need to actually block
the thread. This allows us to bail earlier in Thread::block().
Namely, will_unblock_immediately_without_blocking(Reason).
This virtual function is called on a blocker *before any block occurs*,
if it turns out that we don't need to block the thread after all.
This can happens for one of two reasons:
- UnblockImmediatelyReason::UnblockConditionAlreadyMet
We don't need to block the thread because the condition for
unblocking it is already met.
- UnblockImmediatelyReason::TimeoutInThePast
We don't need to block the thread because a timeout was specified
and that timeout is already in the past.
This patch does not introduce any behavior changes, it's only meant to
clarify this part of the blocking logic.
Now that the old PCI::Device was removed, we can complete the PCI
changes by making the PCI::DeviceController to be named PCI::Device.
Really the entire purpose and the distinction between the two was about
interrupts, but since this is no longer a problem, just rename it to
simplify things further.
I created this class a long time ago just to be able to quickly make a
PCI device to also represent an interrupt handler (because PCI devices
have this capability for most devices).
Then after a while I introduced the PCI::DeviceController, which is
really almost the same thing (a PCI device class that has Address member
in it), but is not tied to interrupts so it can have no interrupts, or
spawn interrupt handlers however it wants to seems fit.
However I decided it's time to say goodbye for this class for
a couple of reasons:
1. It made a whole bunch of weird patterns where you had a PCI::Device
and a PCI::DeviceController being used in the topic of implementation,
where originally, they meant to be used mutually exclusively (you
can't and really don't want to use both).
2. We can really make all the classes that inherit from PCI::Device
to inherit from IRQHandler at this point. Later on, when we have MSI
interrupts support, we can go further and untie things even more.
3. It makes it possible to simplify the VirtIO implementation to a great
extent. While this commit almost doesn't change it, future changes
can untangle some complexity in the VirtIO code.
For UHCIController, E1000NetworkAdapter, NE2000NetworkAdapter,
RTL8139NetworkAdapter, RTL8168NetworkAdapter, E1000ENetworkAdapter we
are simply making them to inherit the IRQHandler. This makes some sense,
because the first 3 devices will never support anything besides IRQs.
For the last 2, they might have MSI support, so when we start to utilize
those, we might need to untie these classes from IRQHandler and spawn
IRQHandler(s) or MSIHandler(s) as needed.
The VirtIODevice class is also a case where we currently need to use
both PCI::DeviceController and IRQHandler classes as parents, but it
could also be untied from the latter.
Namely, unblock_all_blockers_whose_conditions_are_met().
The old name made it sound like things were getting unblocked no matter
what, but that's not actually the case.
What this actually does is iterate through the set of blockers,
unblocking those whose conditions are met. So give it a (very) verbose
name that errs on the side of descriptiveness.
This has several benefits:
1) We no longer just blindly derefence a null pointer in various places
2) We will get nicer runtime error messages if the current process does
turn out to be null in the call location
3) GCC no longer complains about possible nullptr dereferences when
compiling without KUBSAN
Previously when trying to debug the system's routing, the ARP
information would clutter the output and make it difficult to focus on
the routing decisions. It would be better to specify these
debug messages under ARP_DEBUG.
This patch removes KResult::operator int() and deals with the fallout.
This forces a lot of code to be more explicit in its handling of errors,
greatly improving readability.
When TCP sockets successfully establish a connection, any SO_ERROR
should be cleared back to success. For example, SO_ERROR gets set to
EINPROGRESS on asynchronous connect calls and should be cleared when
the socket moves to the Established state.
The only two paths for copying strings in the kernel should be going
through the existing Userspace<char const*>, or StringArgument methods.
Lets enforce this by removing the option for using the raw cstring APIs
that were previously available.
This fixes the placeholder stub for the SO_ERROR via getsockopt. It
leverages the m_so_error value that each socket maintains. The SO_ERROR
option obtains and then clears this field, which is useful when checking
for errors that occur between socket calls. This uses an integer value
to return the SO_ERROR status.
Resolves#146
Note: TCPSocket::create_client() has a dubious locking process where
the sockets by tuple table is first shared lock to check if the socket
exists and bail out if it does, then unlocks, then exclusively locks to
add the tuple. There could be a race condition where two client
creation requests for the same tuple happen at the same time and both
cleared the shared lock check. When in doubt, lock exclusively the
whole time.
Now that all KResult and KResultOr are used consistently throughout the
kernel, it's no longer necessary to return negative error codes.
However, we were still doing that in some places, so let's fix all those
(bugs) by removing the minuses. :^)
The IPv4Socket requires a DoubleBuffer for storage of any data it
received on the socket. However it was previously using the default
constructor which can not observe allocation failure. Address this by
plumbing the receive buffer through the various derived classes.
LocalSockets keep a DoubleBuffer for both client and server usage.
This change converts the usage from using the default constructor
which is unable to observe OOM, to the new try_create factory and
plumb the result through the constructor.
Read the appropriate registers for RTL8139, RTL8168 and E1000.
For NE2000 just assume 10mbit full duplex as there is no indicator
for it in the pure NE2000 spec. Mock values for loopback.
Previously there was no way for Serenity to send a packet without an
established socket connection, and there was no way to appropriately
respond to a SYN packet on a non-listening port. This patch will respond
to any non-established socket attempts with the appropraite RST/ACK,
letting the client know to close the connection.
In accordance with RFC 793, if the receiver is in the SYN-SENT state
it should respond to a RST by aborting the connection and immediately
move to the CLOSED state.
Previously the system would ACK all RST/ACKs, and the remote peer would
just respond with more RST packets.
Previously it was possible to leak the file descriptor if we error out
after allocating the first descriptor. Now we perform both fd
allocations back to back so we can handle the potential error when
processing the second fd allocation.
It's easy to forget the responsibility of validating and safely copying
kernel parameters in code that is far away from syscalls. ioctl's are
one such example, and bugs there are just as dangerous as at the root
syscall level.
To avoid this case, utilize the AK::Userspace<T> template in the ioctl
kernel interface so that implementors have no choice but to properly
validate and copy ioctl pointer arguments.
Right now, NE2000 NICs don't work because the link is down by default
and this will never change. Of all the NE2000 documentation I looked
at I could not find a link status indicator, so just assume the link
is up.
next_packet_page points to a page, but was being compared to a byte
offset rather than a page offset when adjusting the BOUNDARY register
when the ring buffer wraps around.
Fixes#8327.
Allocate all the RX buffers in one big memory region (and same for TX.)
This removes 38 lines from every crash dump (and just seems like a
reasonable idea in general.)
This commit converts naked `new`s to `AK::try_make` and `AK::try_create`
wherever possible. If the called constructor is private, this can not be
done, so we instead now use the standard-defined and compiler-agnostic
`new (nothrow)`.
If we are in a shared interrupt handler, the called handlers might
indicate it was not their interrupt, so we should not increment the
call counter of these handlers.
Previously we would not block the caller until the connection was
established and would instead return EPIPE for the first send() call
which then likely caused the caller to abandon the socket.
This was broken by 0625342.
These are pretty common on older LGA1366 & LGA1150 motherboards.
NOTE: Since the registers datasheets for all versions of the chip
besides versions 1 - 3 are still under NDAs i had to collect
several "magical vendor constants" from the *BSD driver and the
linux driver that i was not able to name verbosely, and as such
these are labeled with the comment "vendor magic values".
We call it E1000E, because the layout for these cards is somewhat not
the same like E1000 supported cards.
Also, this card supports advanced features that are not supported on
8254x cards.
Instead of initializing network adapters in init.cpp, let's move that
logic into a separate class to handle this.
Also, it seems like a good idea to shift responsiblity on enumeration
of network adapters after the boot process, so this singleton will take
care of finding the appropriate network adapter when asked to with an
IPv4 address or interface name.
With this change being merged, we simplify the creation logic of
NetworkAdapter derived classes, so we enumerate the PCI bus only once,
searching for driver candidates when doing so, and we let each driver
to test if it is resposible for the specified PCI device.
When attempting to write to a socket that is not connected or - for
connection-less protocols - doesn't have a peer address set we should
return EPIPE instead of blocking the thread.
When receiving a SYN packet for a connection that's in the "SYN
received" state we should ignore the duplicate SYN packet instead of
closing the connection. This can happen when we didn't accept the
connection in time and our peer has sent us another SYN packet because
it thought that the initial SYN packet was lost.
Previously we wouldn't release the buffer back to the network adapter
in all cases. While this didn't leak the buffer it would cause the
buffer to not be reused for other packets.
Previously we'd just dump those packets into the network adapter's
send queue and hope for the best. Instead we should wait until the peer
has sent TCP ACK packets.
Ideally this would parse the TCP window size option from the SYN or
SYN|ACK packet, but for now we just assume the window size is 64 kB.
Previously we'd allocate buffers when sending packets. This patch
avoids these allocations by using the NetworkAdapter's packet queue.
At the same time this also avoids copying partially constructed
packets in order to prepend Ethernet and/or IPv4 headers. It also
properly truncates UDP and raw IP packets.
Previously TCPSocket::send_tcp_packet() would try to send TCP packets
which matched whatever size the userspace program specified. We'd try to
break those packets up into smaller fragments, however a much better
approach is to limit TCP packets to the maximum segment size and
avoid fragmentation altogether.
There's no good reason to distinguish between network interfaces based
on their model. It's probably a good idea to try keep the names more
persistent so scripts written for a specific network interface will be
useable after hotplug event (or after rebooting with new hardware
setup).
Problem:
- `static` variables consume memory and sometimes are less
optimizable.
- `static const` variables can be `constexpr`, usually.
- `static` function-local variables require an initialization check
every time the function is run.
Solution:
- If a global `static` variable is only used in a single function then
move it into the function and make it non-`static` and `constexpr`.
- Make all global `static` variables `constexpr` instead of `const`.
- Change function-local `static const[expr]` variables to be just
`constexpr`.
This avoids two allocations when receiving network packets. One for
inserting a PacketWithTimestamp into m_packet_queue and another one
when inserting buffers into the list of unused packet buffers.
With this fixed the only allocations in NetworkTask happen when
initially allocating the PacketWithTimestamp structs and when switching
contexts.
Occasionally we'll see messages in the serial console like:
handle_tcp: unexpected flags in FinWait1 state
In these cases it would be nice to know what flags we are receiving that
we aren't expecting.
Previously we didn't retransmit lost TCP packets which would cause
connections to hang if packets were lost. Also we now time out
TCP connections after a number of retransmission attempts.
This wakes up NetworkTask every 500 milliseconds so that it can send
pending delayed TCP ACKs and isn't forced to send all of them early
when it goes to sleep like it did before.
When establishing the connection we should send ACKs right away so
as to not delay the connection process. This didn't previously
matter because we'd flush all delayed ACKs when NetworkTask waits
for incoming packets.
Avoid holding the sockets_by_tuple lock while allocating the TCPSocket.
While checking if the list contains the item we can also hold the lock
in shared mode, as we are only reading the hash table.
In addition the call to from_tuple appears to be superfluous, as we
created the socket, so we should be able to just return it directly.
This avoids the recursive lock acquisition, as well as the unnecessary
hash table lookups.
Note that the changes to IPv4Socket::create are unfortunately needed as
the return type of TCPSocket::create and IPv4Socket::create don't match.
- KResultOr<NonnullRefPtr<TcpSocket>>>
vs
- KResultOr<NonnullRefPtr<Socket>>>
To handle this we are forced to manually decompose the KResultOr<T> and
return the value() and error() separately.
This avoids allocations for initializing the Function<T>
for the NetworkAdapter::for_each callback argument.
Applying this patch decreases CPU utilization for NetworkTask
from 40% to 28% when receiving TCP packets at a rate of 100Mbit/s.
We already have another limit for the total number of packet buffers
allowed (max_packet_buffers). This second limit caused us to
repeatedly allocate and then free buffers.
This matches what other operating systems like Linux do:
$ ip route get 0.0.0.0
local 0.0.0.0 dev lo src 127.0.0.1 uid 1000
cache <local>
$ ssh 0.0.0.0
gunnar@0.0.0.0's password:
$ ss -na | grep :22 | grep ESTAB
tcp ESTAB 0 0 127.0.0.1:43118 127.0.0.1:22
tcp ESTAB 0 0 127.0.0.1:22 127.0.0.1:43118
When we receive a TCP packet with a sequence number that is not what
we expected we have lost one or more packets. We can signal this to
the sender by sending a TCP ACK with the previous ack number so that
they can resend the missing TCP fragments.
Previously we'd process TCP packets in whatever order we received
them in. In the case where packets arrived out of order we'd end
up passing garbage to the userspace process.
This was most evident for TLS connections:
courage:~ $ git clone https://github.com/SerenityOS/serenity
Cloning into 'serenity'...
remote: Enumerating objects: 178826, done.
remote: Counting objects: 100% (1880/1880), done.
remote: Compressing objects: 100% (907/907), done.
error: RPC failed; curl 56 OpenSSL SSL_read: error:1408F119:SSL
routines:SSL3_GET_RECORD:decryption failed or bad record mac, errno 0
error: 1918 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
When the MSS option header is missing the default maximum segment
size is 536 which results in lots of very small TCP packets that
NetworkTask has to handle.
This adds the MSS option header to outbound TCP SYN packets and
sets it to an appropriate value depending on the interface's MTU.
Note that we do not currently do path MTU discovery so this could
cause problems when hops don't fragment packets properly.
This increases the default TCP window size to a more reasonable
value of 64k. This allows TCP peers to send us more packets before
waiting for corresponding ACKs.
This increases the buffer size for connection-oriented sockets
to 256kB. In combination with the other patches in this series
I was able to receive TCP packets at a rate of about 120Mbps.
Previously we'd incorrectly use the default gateway's MAC address.
Instead we must use destination MAC addresses that are derived from
the multicast IPv4 address.
With this patch applied I can query mDNS on a real network.
When reading UDP packets from userspace with recvmsg()/recv() we
would hit a VERIFY() if the supplied buffer is smaller than the
received UDP packet. Instead we should just return truncated data
to the caller.
This can be reproduced with:
$ dd if=/dev/zero bs=1k count=1 | nc -u 192.168.3.190 68
An IP socket can now join a multicast group by using the
IP_ADD_MEMBERSHIP sockopt, which will cause it to start receiving
packets sent to the multicast address, even though this address does
not belong to this host.
The previous patch already helped with this, however my idea of only
reading a few packets didn't work and we'd still sometimes end up not
receiving any more packets from the E1000 interface.
With this patch applied my NIC seems to receive packets just fine, at
least for now.
Raw sockets don't need a local port, so we shouldn't fail operations
if allocation yields an ENOPROTOOPT.
I'm not in love with the factoring here, just patching up the bug.
Without this patch we end up with sockets in the listener's accept
queue with state 'closed' when doing stealth SYN scans:
Client -> Server: SYN for port 22
Server -> Client: SYN/ACK
Client -> Server: RST (i.e. don't complete the TCP handshake)
This commit will add MSG_PEEK support, which allows a package to be
seen without taking it from the buffer, so that a subsequent recv()
without the MSG_PEEK flag can pick it up.
We had some inconsistencies before:
- Sometimes "The", sometimes "the"
- Sometimes trailing ".", sometimes no trailing "."
I picked the most common one (lowecase "the", trailing ".") and applied
it to all copyright headers.
By using the exact same string everywhere we can ensure nothing gets
missed during a global search (and replace), and that these
inconsistencies are not spread any further (as copyright headers are
commonly copied to new files).
Previously the E1000 network adapter would stop receiving further
packets when an RX buffer overrun occurred. This was the case
when connecting the adapter to a real network where enough broadcast
traffic caused the buffer to be full before the kernel had a chance
to clear the RX buffer.
The last IP address in an IPv4 subnet is considered the directed
broadcast address, e.g. for 192.168.3.0/24 the directed broadcast
address is 192.168.3.255. We need to consider this address as
belonging to the interface.
Here's an example with this fix applied, SerenityOS has 192.168.3.190:
[gunnar@nyx ~]$ ping -b 192.168.3.255
WARNING: pinging broadcast address
PING 192.168.3.255 (192.168.3.255) 56(84) bytes of data.
64 bytes from 192.168.3.175: icmp_seq=1 ttl=64 time=0.950 ms
64 bytes from 192.168.3.188: icmp_seq=1 ttl=64 time=2.33 ms
64 bytes from 192.168.3.46: icmp_seq=1 ttl=64 time=2.77 ms
64 bytes from 192.168.3.41: icmp_seq=1 ttl=64 time=4.15 ms
64 bytes from 192.168.3.190: icmp_seq=1 ttl=64 time=29.4 ms
64 bytes from 192.168.3.42: icmp_seq=1 ttl=64 time=30.8 ms
64 bytes from 192.168.3.55: icmp_seq=1 ttl=64 time=31.0 ms
64 bytes from 192.168.3.30: icmp_seq=1 ttl=64 time=33.2 ms
64 bytes from 192.168.3.31: icmp_seq=1 ttl=64 time=33.2 ms
64 bytes from 192.168.3.173: icmp_seq=1 ttl=64 time=41.7 ms
64 bytes from 192.168.3.43: icmp_seq=1 ttl=64 time=47.7 ms
^C
--- 192.168.3.255 ping statistics ---
1 packets transmitted, 1 received, +10 duplicates, 0% packet loss,
time 0ms, rtt min/avg/max/mdev = 0.950/23.376/47.676/16.539 ms
[gunnar@nyx ~]$
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
Binding to port 0 is used to signal to listen() to bind to any port
that is available. (in serenity's case, to the port range of 32768 to
60999, which are not privileged ports)
This patch adds a few missing ioctls which were required by Wine.
SIOCGIFNETMASK, SIOCGIFBRDADDR and SIOCGIFMTU are fully implemented,
while SIOCGIFFLAGS and SIOCGIFCONF are stubs.