Commit graph

102 commits

Author SHA1 Message Date
Andreas Kling
6c51ba27a2 AK: Make URL percent encoding faster by exploiting ASCII knowledge
Once we know that a code point must be a valid ASCII character,
we now cast it to `char` and avoid the expensive generic
StringView::contains(u32 code_point) checks.

This dramatically speeds up URL parsing.
2023-12-30 13:49:50 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Shannon Booth
ef27750982 AK: Port URL m_paths to String 2023-10-26 11:11:41 +02:00
Shannon Booth
e9670862a0 AK: Simplify "cannot have a username/password/port"
This is an editorial change in the URL spec, see:

https://github.com/whatwg/url/commit/f78785
2023-10-26 11:11:41 +02:00
Ali Mohammad Pur
aeee98b3a1 AK+Everywhere: Remove the null state of DeprecatedString
This commit removes DeprecatedString's "null" state, and replaces all
its users with one of the following:
- A normal, empty DeprecatedString
- Optional<DeprecatedString>

Note that null states of DeprecatedFlyString/StringView/etc are *not*
affected by this commit. However, DeprecatedString::empty() is now
considered equal to a null StringView.
2023-10-13 18:33:21 +03:30
Shannon Booth
daf6d8173c AK: Remove uneeded deprecated_string_percent_encode
This function was an artifact from when we were using DeprecatedString
much more in URL class before porting to Optional<String>, where we no
longer rely on the null state of DeprecatedString.
2023-10-13 18:33:21 +03:30
Kemal Zebari
f6c52f622d AK: Number the spec step comments in URL::serialize_path() 2023-09-19 21:51:31 +01:00
Kemal Zebari
b6b4e59bf7 AK: Implement URL::serialize_path() to spec
This commit also reverts db5ad0c since code outside of the web spec
expects serialized paths to be percent decoded.

Also, there are issues trying to implement the concept "opaque
path". For now, we still use the old cannot_be_a_base_url(), but its
usage needs to be removed in favor of a has_opaque_path() as the spec
has changed since then.
2023-09-15 11:15:43 -06:00
Shannon Booth
23e82114b4 AK: Do not consider port of 0 as a null port
This fixes an issue where if a port number of 0 was given for a non
special scheme the port number was being dropped.
2023-08-31 11:02:18 +02:00
Shannon Booth
9d60f23abc AK: Port URL::m_fragment from DeprecatedString to String 2023-08-13 15:03:53 -06:00
Shannon Booth
5663a2d3b4 AK+LibWeb: Do not percent encode/decode in URL fragment setter/getters
The web specs do not expect decoding or decoding to happen when calling
these helpers. This allows us to remove the raw_fragment helper function
from the URL class.
2023-08-13 15:03:53 -06:00
Shannon Booth
c25485700a AK: Port URL scheme from DeprecatedString to String 2023-08-13 15:03:53 -06:00
Shannon Booth
21fe86d235 AK: Port URL::m_query from DeprecatedString to String 2023-08-13 15:03:53 -06:00
Shannon Booth
55a01e72ca AK: Port URL username/password from DeprecatedString to String
And for cases that just need to check whether the password/username is
empty, add a raw_{password,username} helper to avoid any allocation.
2023-08-13 15:03:53 -06:00
Sam Atkins
28aa4ca767 AK: Add URL::raw_fragment()
This is a little hackish. Part of the algorithm to get the indicated
part of a DOM::Document
https://html.spec.whatwg.org/multipage/browsing-the-web.html#the-indicated-part-of-the-document
wants to first get the URL's fragment, use it, and then later
percent-decode it and use that. So, we need a way to get that
un-decoded fragment.
2023-08-12 08:39:04 +02:00
Shannon Booth
db5ad0c2b0 AK: Remove ApplyPercentDecoding from URL
Nowhere was setting this flag from the default.
2023-08-06 08:57:23 +02:00
Shannon Booth
98666b012d AK: Remove URL::ApplyPercentEncoding
Everywhere only ever expects percent encoding to occur, so let's just
remove this flag altogether. At the same time, replace some
DeprecatedString with StringView.
2023-08-06 08:57:23 +02:00
Karol Kosek
eb41f0144b AK: Decode data URLs to separate class (and parse like every other URL)
Parsing 'data:' URLs took it's own route. It never set standard URL
fields like path, query or fragment (except for scheme) and instead
gave us separate methods called `data_payload()`, `data_mime_type()`,
and `data_payload_is_base64()`.

Because parsing 'data:' didn't use standard fields, running the
following JS code:

    new URL('#a', 'data:text/plain,hello').toString()

not only cleared the path as URLParser doesn't check for data from
data_payload() function (making the result be 'data:#a'), but it also
crashes the program because we forbid having an empty MIME type when we
serialize to string.

With this change, 'data:' URLs will be parsed like every other URLs.
To decode the 'data:' URL contents, one needs to call process_data_url()
on a URL, which will return a struct containing MIME type with already
decoded data! :^)
2023-08-01 14:19:05 +02:00
Shannon Booth
4fdd4dd979 AK: Add missing default port definitions for FTP scheme URLs
This is defined in the spec, but was missing in our table. Fix this, and
add a spec comment for what is missing. Also begin a basic text based
test for URL, so we can get some coverage of LibWeb's usage of URL too.
2023-07-31 14:48:24 +02:00
Shannon Booth
8751be09f9 AK: Serialize URL hosts with 'concept-host-serializer'
In order to follow spec text to achieve this, we need to change the
underlying representation of a host in AK::URL to deserialized format.
Before this, we were parsing the host and then immediately serializing
it again.

Making that change resulted in a whole bunch of fallout.

After this change, callers can access the serialized data through
this concept-host-serializer. The functional end result of this
change is that IPv6 hosts are now correctly serialized to be
surrounded with '[' and ']'.
2023-07-31 05:18:51 +02:00
Shannon Booth
a1ae701a7d AK: Move URL::cannot_have_a_username_or_password_or_port out of line
This doesn't seem trivial enough to be defining in the header like this,
and should not be a performance critical function anyhow.

Also add spec comments while we are at it, and a FIXME since we do not
seem to exactly align.
2023-07-31 05:18:51 +02:00
Shannon Booth
7b3902e3d5 AK: Remove unused URL::scheme_requires_port
THis function does not seem to be used anywhere, and I cannot find any
spec equivalent for this function.
2023-07-25 06:43:50 -04:00
Shannon Booth
c8da880806 AK: Add spec comments to URL::serialize 2023-07-25 06:43:50 -04:00
Shannon Booth
5625ca5cb9 AK: Rename URLParser::parse to URLParser::basic_parse
To make it more clear that this function implements
'concept-basic-url-parser' instead of 'concept-url-parser'.
2023-07-15 09:45:16 +02:00
Kenneth Myhra
c03c0ec900 AK: Add URL::to_string() returning a String
For convenience wire up a to_string() method which returns a new String.
2023-06-17 20:38:20 +02:00
MacDue
5db1eb9961 AK+Everywhere: Replace URL::paths() with path_segment_at_index()
This allows accessing and looping over the path segments in a URL
without necessarily allocating a new vector if you want them percent
decoded too (which path_segment_at_index() has an option for).
2023-04-15 06:37:04 +02:00
MacDue
35612c6a7f AK+Everywhere: Change URL::path() to serialize_path()
This now defaults to serializing the path with percent decoded segments
(which is what all callers expect), but has an option not to. This fixes
`file://` URLs with spaces in their paths.

The name has been changed to serialize_path() path to make it more clear
that this method will generate a new string each call (except for the
cannot_be_a_base_url() case). A few callers have then been updated to
avoid repeatedly calling this function.
2023-04-15 06:37:04 +02:00
MacDue
5acd40c525 AK+Everywhere: Add ApplyPercentDecoding option to URL getters
The defaults selected for this are based on the behaviour of URL
when it applied percent decoding during parsing. This does mean now
in some cases the getters will allocate, but percent_decode() checks
if there's anything to decode first, so in many cases still won't.
2023-04-15 06:37:04 +02:00
MacDue
8283e8b88c AK: Don't store parts of URLs percent decoded
As noted in serval comments doing this goes against the WC3 spec,
and breaks parsing then re-serializing URLs that contain percent
encoded data, that was not encoded using the same character set as
the serializer.

For example, previously if you had a URL like:

https:://foo.com/what%2F%2F (the path is what + '//' percent encoded)

Creating URL("https:://foo.com/what%2F%2F").serialize() would return:

https://foo.com/what//

Which is incorrect and not the same as the URL we passed. This is
because the re-serializing uses the PercentEncodeSet::Path which
does not include '/'.

Only doing the percent encoding in the setters fixes this, which
is required to navigate to Google Street View (which includes a
percent encoded URL in its URL).

Seems to fix #13477 too
2023-04-12 07:40:22 +02:00
networkException
9915fa72fb AK+Everywhere: Use Optional for URLParser::parse's base_url parameter 2023-04-11 16:28:20 +02:00
Sam Atkins
abc01cc9fe AK+Tests+LibWeb: Make URL::complete_url() take a StringView
All it does is pass this to `URLParser::parse()` which takes a
StringView, so we might as well take one here too.
2023-02-15 12:48:26 -05:00
Linus Groh
6e7459322d AK: Remove StringBuilder::build() in favor of to_deprecated_string()
Having an alias function that only wraps another one is silly, and
keeping the more obvious name should flush out more uses of deprecated
strings.
No behavior change.
2023-01-27 20:38:49 +00:00
Alec Murphy
da4067a75f AK: Remove tilde from URL::PercentEncodeSet::EncodeURI 2022-12-26 04:51:55 +03:30
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Ben Wiederhake
3aeb57ed09 AK+Everywhere: Fix data corruption due to code-point-to-char conversion
In particular, StringView::contains(char) is often used with a u32
code point. When this is done, the compiler will for some reason allow
data corruption to occur silently.

In fact, this is one of two reasons for the following OSS Fuzz issue:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=49184
This is probably a very old bug.

In the particular case of URLParser, AK::is_url_code_point got confused:
    return /* ... */ || "!$&'()*+,-./:;=?@_~"sv.contains(code_point);
If code_point is a large code point that happens to have the correct
lower bytes, AK::is_url_code_point is then convinced that the given
code point is okay, even if it is actually problematic.

This commit fixes *only* the silent data corruption due to the erroneous
conversion, and does not fully resolve OSS-Fuzz#49184.
2022-10-09 10:37:20 -06:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Karol Kosek
65afa113e5 AK: Make URL ApplicationXWWWFormUrlencoded encoding closer to spec
It was mostly implemented based on a spec note, that described only
allowed characters, but instead of allowing some special characters not
to be escaped, we escaped every special character except those 'new in
this encode set' disallowed characters from the spec definition.
2022-06-10 22:32:29 +01:00
Karol Kosek
9e69a89f8e AK: Append correct number of port characters when serializing a URL
Instead of formatting a port string, it put bytes from stack, using the
port number as a length (so for port 8000 it appended 8000 bytes).
2022-06-10 22:32:29 +01:00
ForLoveOfCats
a7fe3183f5 AK: Add URL::create_with_help_scheme helper function 2022-04-21 09:12:37 +04:30
Andreas Kling
79c77debb0 AK: Don't destructively re-encode query strings in the URL parser
We were decoding and then re-encoding the query string in URLs.
This round-trip caused us to lose information about plus ('+')
ASCII characters encoded as "%2B".
2022-04-10 01:37:45 +02:00
Andreas Kling
3724ce765e AK+LibWeb: Encode ' ' as '+' in application/x-www-form-urlencoded
This matches what the URL and HTML specifications ask us to do.
2022-04-10 01:37:45 +02:00
GeekFiftyFive
832920c003 AK+LibHTTP: Revert prior change to percent encode plus signs
A change was made prior to percent encode plus signs in order to fix an
issue with the Google cookie consent page.

Unforunately, this was treating a symptom of a problem and not the root
cause and is incorrect behavior.
2022-04-08 20:44:49 +02:00
GeekFiftyFive
737f5b26b7 AK+LibHTTP: Ensure plus signs are percent encoded in query string
Adds a new optional parameter 'reserved_chars' to
AK::URL::percent_encode. This new optional parameter allows the caller
to specify custom characters to be percent encoded. This is then used
to percent encode plus signs by HttpRequest::to_raw_request.
2022-04-02 18:43:15 +02:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Idan Horowitz
d6cfa34667 AK: Make URL::m_port an Optional<u16>, Expose raw port getter
Our current way of signalling a missing port with m_port == 0 was
lacking, as 0 is a valid port number in URLs.
2021-09-14 00:14:45 +02:00
Idan Horowitz
55b67ba7a7 AK: Accept optional url and state override parameters in URLParser
These are required in the specification and used by the web's URL
built-in, this commit also removes the Badge<AK::URL> from URLParser
to allow other classes that need to call the parser directly like the
web's URL built-in to do so.
2021-09-14 00:14:45 +02:00
Idan Horowitz
6fa4fc8353 AK: Add URL::serialize_origin based on HTML's origin definition 2021-09-14 00:14:45 +02:00
Max Wipfli
9b8f35259c AK: Remove the LexicalPath::is_valid() API
Since this is always set to true on the non-default constructor and
subsequently never modified, it is somewhat pointless. Furthermore,
there are arguably no invalid relative paths.
2021-06-30 11:13:54 +02:00
Max Wipfli
bc8d16ad28 Everywhere: Replace ctype.h to avoid narrowing conversions
This replaces ctype.h with CharacterType.h everywhere I could find
issues with narrowing conversions. While using it will probably make
sense almost everywhere in the future, the most critical places should
have been addressed.
2021-06-03 13:31:46 +02:00