Removing State and Restore reduces painting commands count by 30-50%
on an average website.
Due to this change, it was also necessary to replace AddClipRect with
SetClipRect command.
This patch fixes the value of a module map entry being wrong in the
callbacks invoked in the set call. Previously we would set the value
in only after invoking the callbacks, leading to crashes when a
callback implementation would rightfully assume the value to be set
already.
Resolves#20994
0000459.pdf in 0000.zip in the pdfa dataset contains this as the
very first object:
```
1 0 obj
<<
/Creator (Developer 2000)
/CreatorDate (
/Author (Oracle Reports)
/Producer (Oracle PDF driver)
/Title (2021_06_29 Tutoritzacions APTES.PDF)
>>
endobj
```
The `/CreatorDate` value string is unterminated.
Before, we'd assert when trying to check if the first object is
a linearization dict.
Now, we never read the first object (an error during the linearization
dict reading is treated as "file is not linearized") unless we try
to print the document's metadata -- and there we now show an error
instead of asserting.
By storing painting command coordinates relative to the nearest
stacking context we can get rid of the Translate command.
Additionally, this allows us to easily check if the bounding
rectangles of the commands cover or intersect within a stacking
context. This should be useful if we decide to optimize by avoiding
the execution of commands that will be overpainted by the results of
subsequent commands.
Actually using separation color spaces still doesn't work, but we
now no longer assert on them when they're used.
Fixes 2 crashes on the `-n 500` 0000.zip pdfa dataset.
We perform such a check in other users of the paintable box in this file
as the box may be null before layout completes. This prevents UB seen in
some CI runs.
Several files have a comment after the trailer dict and the
`startxref` after it.
We really should add a consume_whitespace_and_comments() function
and call that in most places we currently call consume_whitespace().
But in this case, for non-linearized files, we first jump to the
end of the file, read `startxref`, then jump to `xref` from the
offset there, and then read the trailer after the `xref`,
only to read `startxref` again. So we can just not do that.
(For linearized files, we now completely ignore `startxref`.
But we don't use the data in `startxref` in linearized files
anyways, so it's fine to not read it there too.)
Reduces number of crashes on 300 random PDFs from the web (the first 300
from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 25 (8%) to 23 (7%).
This commit limits `WOFF::Header::num_tables` to 4096. This limitation
is not explicitly mentioned in the specification, but allowing numbers
larger than this results in an overflow when calculating
`search_range` and `range_shift`.
This will create a string of the form:
Search DuckDuckGo for "Ladybird is awesome!"
If the provided query URL is unknown, the engine name is excluded (e.g.
for custom search URLs).
As spec comment in the code says we should use item’s max-content
contribution to calculate flex fraction.
Likely, it was calculate_max_content_size() because we didn't have
calculate_max_content_contribution() when this function was implemented
initially.
Without ENABLE_TIME_ZONE_DATA the user is able to update the time zone
selection when using "on_mousewheel" or "on_{up,down}_pressed" even
though UTC is supposed to be the only option. Due to an invalid model
index, the selection is set to "[null]".
Prior to this patch, the ComboBox text in "selection_updated" is set
at each call.
A page's /Contents can be an array of streams, and the page's contents
are then as if those streams are concatenated.
Most of the time, a stream ends with whitespace. But in some cases
(e.g. 0000642.pdf from 0000.zip from the pdfa dataset), the first
stream ends with an operator (`Q`) and the next stream starts with
one (`q`), and the concatenation would form a new, unkonwn operator
(`Qq`). Separate the streams' contents with a space to prevent that.
Reduces numbers of PDF files we fail to open in the -n 500 case
from 11 to 10 (in either case, we then crash on 18 of the PDFs
that we do manage to open).
These engines and their query URLs are duplicated in several places.
Before implementing search support in the AppKit chrome, let's move
these engines to LibWebView.
Type 1 fonts usually have a m_font_program and no m_font -- they only
have m_font if we're using a replacement font for the fonts that
were built-in to PDFs before Acrobat 4.0 (and must still work to
show existing files).
However, SimpleFont::get_glyph_width() used to always return a
float, which in Type1Font was only implemented if m_font was set.
Per spec, we're supposed to just use /MissingWidth for fonts that
are missing an entry in the descriptor's /Width array. However, for
built-in fonts, no explicit /Width array is needed (PDF 1.7 spec,
Appendix H.3, 5.5.1). So if we just always use /MissingWidth,
then PDFs that use a built-in font draw all their text on top
of each other (e.g. 000333.pdf from stillhq.com-pdfdb).
So change get_glyph_width() to return Optional<float>, return
it only in Type1Font if m_font is set, and use MissingWidth
if it isn't set.
That way, replacement fonts still return a width, and real
fonts that are supposed to have /Width and use /MissingWidth
for missing entries do what they're supposed to too, instead
of crashing.
From 20 (6%) to 16 (5%) crashes on the 300 first PDFs, and from
39 (7.8%) to 31 (6.2%) on the 500-random PDFs test.
`left` might be a number bigger than there are actually glyphs in the
CFF.
The spec says "The number of ranges is not explicitly specified in the
font. Instead, software utilizing this data simply processes ranges
until all glyphs in the font are covered." Apparently we have to check
for this within each range as well.
Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.
Together with the previous commit:
From 21 (7%) to 20 (6%) crashes on the 300 first PDFs, and from
41 (8.2%) to 39 (7.8%) on the 500-random PDFs test.
...and replace template instantiations with a loop, to make this
easily possible.
Vaguely nice for code size as well.
Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.
We used to use an u8 as loop counter, which would overflow
if there were more than 255 glyphs, producing hundreds of megabytes
of
Couldn't find string for SID x, going with space
output in the process, while all data until the end of the CFF
section got interpreted as SIDs, until a try_read() would finally
fail.
We now no longer fail miserably trying to render page 2 of
0000352.pdf of 0000.zip from the pdfa dataset.
Fixes just one crash of the larger 500-document test set, but
when I tweak test_pdf.py to print all stacks instead of just the
top 5, it no longer produces 260 MB of output.
As of https://tc39.es/ecma262/#sec-yearfromtime, YearFromTime(t) should
return `y` such that `TimeFromYear(YearFromTime(t)) <= t`. This wasn't
held, since the approximation contained decimal digits that would nudge
the final value in the wrong direction.
Adapted from Kiesel:
6548a85743
Co-authored-by: Linus Groh <mail@linusgroh.de>
Inline images can contain arbitrary binary data in the operator stream,
greatly confusing the operator parser.
Just skip them for now. They'll produce a
`Rendering of feature not supported: draw operation: inline_image_begin`
diag as usual, so we won't forget about it.
After #21536, reduces number of crashes on 300 random PDFs from the web
(the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 23 (7%) to 22 (7%).
On a larger sample (`Meta/test_pdf.py -n 500 ~/Downloads/0000`),
reduces number of crashes from 53 (10.6%) with 36 distinct crash
stacks to 46 (9.2%) with 33 distinct stacks.
Rewrites the grid area building to accurately identify areas that span
multiple rows. Also now we can recognize invalid areas but do not
handle them yet.
With the recording painter the actual painting operations are delayed,
so now if multiple corner clippers are constructed, and they use a
shared bitmap they can interfere with each other. The use of this shared
bitmap was somewhat questionable anyway, so this is not much of a loss.
This fixes the border-radius.html test page.
If a PDF uses `/CustomName cs` and `/CustomName` then points at just a
name like `/DeviceGray` instead of an array, that's ok. Just using
`/DeviceGray cs` is simpler, so this extra level of indirection is
somewhat rare in practice, but it's valid and it does happen. So support
it.
We already have a helper that does the right thing that we just need to
call.
Together with #21524 and #21525, reduces number of crashes on 300 random
PDFs from the web (the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 29 (9%) to 25 (8%).
This fixes a small bug from 39b2eed3f6: That commit tried to disable
filters for the very first object read, for the case covered in
Tests/LibPDF/password-is-sup.pdf.
However, it accidentally also disabled filters by default.
Most of the time, this isn't really a difference: We call
`set_filters_enabled(true);` very early in
`DocumentParser::initialize_linearization_dict()`, which explicitly
enables filters, and `initialize_linearization_dict()` is the very
first thing called in `DocumentParser::initialize()`.
But there's an early exit in `initialize_linearization_dict()`
for if there's nothing looking like an indirect object right
after the header, and in this case we used to not enable
filtering, and would hand compressed streams to the operand parser.
(And due to a 2nd bug, we'd even do this if the header line was
followed by an empty line.)
0000990.pdf from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
starts like so:
```
%PDF-1.7
4 0 obj
```
parse_heaader() used to put the cursor at the start of the 2nd,
empty, line. initialize_linearization_dict() would then check
if `m_reader.matches_number()` to see if there could possibly
be a linearization dict.
In this case, there isn't one, but we should detect linearization
dicts even if they're separated by whitespace from the first line.
Grid items should respect alignment properties if top/right/bottom/left
are not specified.
This change adds a separate implementation of
layout_absolutely_positioned_element that is extended with support for
alignment.
If the first pass of rows sizing results in the container's automatic
height being less than the specified min-height, we need to run a
second pass using the updated available space.
Per spec:
"If the color space is one that can be specified by a name and no
additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain
cases of Pattern), the name may be specified directly."
We still don't implement /Pattern color spaces, but now we no longer
crash trying to look up the potentially-nonexistent /ColorSpace
dictionary on the page object when /Pattern is used directly as color
space name.
On top of #21514, reduces number of crashes on 300 random PDFs from the
web (the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 42 (14%) to 34 (11%).
It used to be called ColorSpaceFamily::never_needs_parameters().
But in the cpp file, the macro arg was called ever_needs_parameters,
and the spec says
"If the color space is one that can be specified by a name and no
additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain
cases of Pattern), the name may be specified directly."
so let's use that language here.
No behavior change.
We now no longer crash on images that use an ICC-based color space.
Reduces number of crashes on 300 random PDFs from the web (the first 300
from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 81 (27%) to 64 (21%).
Also fixes all remaining crashes in
411_getting_started_with_instruments.pdf and
513_high_efficiency_image_file_format.pdf.
This is meant to serve as the method all Ladybird chromes can use to
highlight the eTLD+1 substring of the URL. It uses the Public Suffix
List to break the URL into 3 parts: the scheme and subdomain, the
eTLD+1, and all remaining parts (port, path, query, etc.).
Fixes two places in child navigable destroy procedure where we used
content navigable instead of container's navigable.
With this change, iframe's nested histories are actually destroyed
along with the document that created them.
Implements the `ri` operator, and the `RI` key in a graphics state
dictionary.
We don't do anything yet with the color rendering intent except
store it.
No behavior change except removing a few "not yet implemented"
messages.
Follow-up to #21489. There, I made us use a RAII object.
That's great, but if the embedded instruction stream pushes
its own graphics state, then an early return would cause us to
not process graphics state pop instructions in the embedded stream.
To fix this, remember the graphics stack depth before entering
the nested instruction stream, and explicitly shrink the stack back
to that size upon exit.
Enables us to render all pages of
https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf
without crashing.
BMP files encode the direction of the rows with the sign of the height.
Our BMP decoder already makes all the proper checks, however when
constructing the Gfx::Bitmap, didn't actually make the height positive.
Boog neutralized :^)
This change separates the box outer shadow metrics calculations into a
separate function. This function is then used to obtain the shadow
bounding rectangle and skip painting if the entire shadow is outside
of the viewport.