ladybird/Userland/Libraries/LibRegex/RegexByteCode.cpp

1007 lines
37 KiB
C++
Raw Normal View History

LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
/*
* Copyright (c) 2020, Emanuel Sprung <emanuel.sprung@gmail.com>
*
* SPDX-License-Identifier: BSD-2-Clause
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
*/
#include "RegexByteCode.h"
#include "AK/StringBuilder.h"
#include "RegexDebug.h"
#include <AK/BinarySearch.h>
#include <AK/CharacterTypes.h>
#include <AK/Debug.h>
#include <LibUnicode/CharacterTypes.h>
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
namespace regex {
2021-07-23 15:55:14 +00:00
char const* OpCode::name(OpCodeId opcode_id)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
switch (opcode_id) {
#define __ENUMERATE_OPCODE(x) \
case OpCodeId::x: \
return #x;
ENUMERATE_OPCODES
#undef __ENUMERATE_OPCODE
default:
VERIFY_NOT_REACHED();
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return "<Unknown>";
}
}
2021-07-23 15:55:14 +00:00
char const* OpCode::name() const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
return name(opcode_id());
}
2021-07-23 15:55:14 +00:00
char const* execution_result_name(ExecutionResult result)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
switch (result) {
#define __ENUMERATE_EXECUTION_RESULT(x) \
case ExecutionResult::x: \
return #x;
ENUMERATE_EXECUTION_RESULTS
#undef __ENUMERATE_EXECUTION_RESULT
default:
VERIFY_NOT_REACHED();
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return "<Unknown>";
}
}
char const* opcode_id_name(OpCodeId opcode)
{
switch (opcode) {
#define __ENUMERATE_OPCODE(x) \
case OpCodeId::x: \
return #x;
ENUMERATE_OPCODES
#undef __ENUMERATE_OPCODE
default:
VERIFY_NOT_REACHED();
return "<Unknown>";
}
}
2021-07-23 15:55:14 +00:00
char const* boundary_check_type_name(BoundaryCheckType ty)
{
switch (ty) {
#define __ENUMERATE_BOUNDARY_CHECK_TYPE(x) \
case BoundaryCheckType::x: \
return #x;
ENUMERATE_BOUNDARY_CHECK_TYPES
#undef __ENUMERATE_BOUNDARY_CHECK_TYPE
default:
VERIFY_NOT_REACHED();
return "<Unknown>";
}
}
2021-07-23 15:55:14 +00:00
char const* character_compare_type_name(CharacterCompareType ch_compare_type)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
switch (ch_compare_type) {
#define __ENUMERATE_CHARACTER_COMPARE_TYPE(x) \
case CharacterCompareType::x: \
return #x;
ENUMERATE_CHARACTER_COMPARE_TYPES
#undef __ENUMERATE_CHARACTER_COMPARE_TYPE
default:
VERIFY_NOT_REACHED();
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return "<Unknown>";
}
}
2021-07-23 15:55:14 +00:00
static char const* character_class_name(CharClass ch_class)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
switch (ch_class) {
#define __ENUMERATE_CHARACTER_CLASS(x) \
case CharClass::x: \
return #x;
ENUMERATE_CHARACTER_CLASSES
#undef __ENUMERATE_CHARACTER_CLASS
default:
VERIFY_NOT_REACHED();
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return "<Unknown>";
}
}
static void advance_string_position(MatchState& state, RegexStringView view, Optional<u32> code_point = {})
{
++state.string_position;
if (view.unicode()) {
if (!code_point.has_value() && (state.string_position_in_code_units < view.length_in_code_units()))
code_point = view[state.string_position_in_code_units];
if (code_point.has_value())
state.string_position_in_code_units += view.length_of_code_point(*code_point);
} else {
++state.string_position_in_code_units;
}
}
static void advance_string_position(MatchState& state, RegexStringView, RegexStringView advance_by)
{
state.string_position += advance_by.length();
state.string_position_in_code_units += advance_by.length_in_code_units();
}
static void reverse_string_position(MatchState& state, RegexStringView view, size_t amount)
{
VERIFY(state.string_position >= amount);
state.string_position -= amount;
if (view.unicode())
state.string_position_in_code_units = view.code_unit_offset_of(state.string_position);
else
state.string_position_in_code_units -= amount;
}
static void save_string_position(MatchInput const& input, MatchState const& state)
{
input.saved_positions.append(state.string_position);
input.saved_code_unit_positions.append(state.string_position_in_code_units);
}
static bool restore_string_position(MatchInput const& input, MatchState& state)
{
if (input.saved_positions.is_empty())
return false;
state.string_position = input.saved_positions.take_last();
state.string_position_in_code_units = input.saved_code_unit_positions.take_last();
return true;
}
OwnPtr<OpCode> ByteCode::s_opcodes[(size_t)OpCodeId::Last + 1];
bool ByteCode::s_opcodes_initialized { false };
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
void ByteCode::ensure_opcodes_initialized()
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
if (s_opcodes_initialized)
return;
for (u32 i = (u32)OpCodeId::First; i <= (u32)OpCodeId::Last; ++i) {
switch ((OpCodeId)i) {
#define __ENUMERATE_OPCODE(OpCode) \
case OpCodeId::OpCode: \
s_opcodes[i] = make<OpCode_##OpCode>(); \
break;
ENUMERATE_OPCODES
#undef __ENUMERATE_OPCODE
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
}
s_opcodes_initialized = true;
}
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
ALWAYS_INLINE OpCode& ByteCode::get_opcode_by_id(OpCodeId id) const
{
VERIFY(id >= OpCodeId::First && id <= OpCodeId::Last);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto& opcode = s_opcodes[(u32)id];
opcode->set_bytecode(*const_cast<ByteCode*>(this));
return *opcode;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
OpCode& ByteCode::get_opcode(MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
OpCodeId opcode_id;
if (auto opcode_ptr = static_cast<DisjointChunks<ByteCodeValueType> const&>(*this).find(state.instruction_position))
opcode_id = (OpCodeId)*opcode_ptr;
else
opcode_id = OpCodeId::Exit;
auto& opcode = get_opcode_by_id(opcode_id);
opcode.set_state(state);
return opcode;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
ALWAYS_INLINE ExecutionResult OpCode_Exit::execute(MatchInput const& input, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
if (state.string_position > input.view.length() || state.instruction_position >= m_bytecode->size())
return ExecutionResult::Succeeded;
return ExecutionResult::Failed;
}
ALWAYS_INLINE ExecutionResult OpCode_Save::execute(MatchInput const& input, MatchState& state) const
{
save_string_position(input, state);
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_Restore::execute(MatchInput const& input, MatchState& state) const
{
if (!restore_string_position(input, state))
return ExecutionResult::Failed;
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_GoBack::execute(MatchInput const& input, MatchState& state) const
{
if (count() > state.string_position)
return ExecutionResult::Failed_ExecuteLowPrioForks;
reverse_string_position(state, input.view, count());
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_FailForks::execute(MatchInput const& input, MatchState&) const
{
VERIFY(count() > 0);
input.fail_counter += count() - 1;
return ExecutionResult::Failed_ExecuteLowPrioForks;
}
ALWAYS_INLINE ExecutionResult OpCode_Jump::execute(MatchInput const&, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
state.instruction_position += offset();
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_ForkJump::execute(MatchInput const&, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
state.fork_at_position = state.instruction_position + size() + offset();
return ExecutionResult::Fork_PrioHigh;
}
ALWAYS_INLINE ExecutionResult OpCode_ForkReplaceJump::execute(MatchInput const& input, MatchState& state) const
{
state.fork_at_position = state.instruction_position + size() + offset();
input.fork_to_replace = state.instruction_position;
return ExecutionResult::Fork_PrioHigh;
}
ALWAYS_INLINE ExecutionResult OpCode_ForkStay::execute(MatchInput const&, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
state.fork_at_position = state.instruction_position + size() + offset();
return ExecutionResult::Fork_PrioLow;
}
ALWAYS_INLINE ExecutionResult OpCode_ForkReplaceStay::execute(MatchInput const& input, MatchState& state) const
{
state.fork_at_position = state.instruction_position + size() + offset();
input.fork_to_replace = state.instruction_position;
return ExecutionResult::Fork_PrioLow;
}
ALWAYS_INLINE ExecutionResult OpCode_CheckBegin::execute(MatchInput const& input, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
if (0 == state.string_position && (input.regex_options & AllFlags::MatchNotBeginOfLine))
return ExecutionResult::Failed_ExecuteLowPrioForks;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if ((0 == state.string_position && !(input.regex_options & AllFlags::MatchNotBeginOfLine))
|| (0 != state.string_position && (input.regex_options & AllFlags::MatchNotBeginOfLine))
|| (0 == state.string_position && (input.regex_options & AllFlags::Global)))
return ExecutionResult::Continue;
return ExecutionResult::Failed_ExecuteLowPrioForks;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
ALWAYS_INLINE ExecutionResult OpCode_CheckBoundary::execute(MatchInput const& input, MatchState& state) const
{
auto isword = [](auto ch) { return is_ascii_alphanumeric(ch) || ch == '_'; };
auto is_word_boundary = [&] {
if (state.string_position == input.view.length()) {
if (state.string_position > 0 && isword(input.view[state.string_position_in_code_units - 1]))
return true;
return false;
}
if (state.string_position == 0) {
if (isword(input.view[0]))
return true;
return false;
}
return !!(isword(input.view[state.string_position_in_code_units]) ^ isword(input.view[state.string_position_in_code_units - 1]));
};
switch (type()) {
case BoundaryCheckType::Word: {
if (is_word_boundary())
return ExecutionResult::Continue;
return ExecutionResult::Failed_ExecuteLowPrioForks;
}
case BoundaryCheckType::NonWord: {
if (!is_word_boundary())
return ExecutionResult::Continue;
return ExecutionResult::Failed_ExecuteLowPrioForks;
}
}
VERIFY_NOT_REACHED();
}
ALWAYS_INLINE ExecutionResult OpCode_CheckEnd::execute(MatchInput const& input, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
if (state.string_position == input.view.length() && (input.regex_options & AllFlags::MatchNotEndOfLine))
return ExecutionResult::Failed_ExecuteLowPrioForks;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if ((state.string_position == input.view.length() && !(input.regex_options & AllFlags::MatchNotEndOfLine))
|| (state.string_position != input.view.length() && (input.regex_options & AllFlags::MatchNotEndOfLine || input.regex_options & AllFlags::MatchNotBeginOfLine)))
return ExecutionResult::Continue;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return ExecutionResult::Failed_ExecuteLowPrioForks;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
ALWAYS_INLINE ExecutionResult OpCode_ClearCaptureGroup::execute(MatchInput const& input, MatchState& state) const
{
if (input.match_index < state.capture_group_matches.size()) {
auto& group = state.capture_group_matches[input.match_index];
if (id() < group.size())
group[id()].reset();
}
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_SaveLeftCaptureGroup::execute(MatchInput const& input, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
if (input.match_index >= state.capture_group_matches.size()) {
state.capture_group_matches.ensure_capacity(input.match_index);
auto capacity = state.capture_group_matches.capacity();
for (size_t i = state.capture_group_matches.size(); i <= capacity; ++i)
state.capture_group_matches.empend();
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
if (id() >= state.capture_group_matches.at(input.match_index).size()) {
state.capture_group_matches.at(input.match_index).ensure_capacity(id());
auto capacity = state.capture_group_matches.at(input.match_index).capacity();
for (size_t i = state.capture_group_matches.at(input.match_index).size(); i <= capacity; ++i)
state.capture_group_matches.at(input.match_index).empend();
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
state.capture_group_matches.at(input.match_index).at(id()).left_column = state.string_position;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_SaveRightCaptureGroup::execute(MatchInput const& input, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
auto& match = state.capture_group_matches.at(input.match_index).at(id());
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto start_position = match.left_column;
if (state.string_position < start_position)
return ExecutionResult::Failed_ExecuteLowPrioForks;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto length = state.string_position - start_position;
if (start_position < match.column)
return ExecutionResult::Continue;
VERIFY(start_position + length <= input.view.length());
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto view = input.view.substring_view(start_position, length);
if (input.regex_options & AllFlags::StringCopyMatches) {
match = { view.to_string(), input.line, start_position, input.global_offset + start_position }; // create a copy of the original string
} else {
match = { view, input.line, start_position, input.global_offset + start_position }; // take view to original string
}
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_SaveRightNamedCaptureGroup::execute(MatchInput const& input, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
auto& match = state.capture_group_matches.at(input.match_index).at(id());
auto start_position = match.left_column;
if (state.string_position < start_position)
return ExecutionResult::Failed_ExecuteLowPrioForks;
auto length = state.string_position - start_position;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (start_position < match.column)
return ExecutionResult::Continue;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
VERIFY(start_position + length <= input.view.length());
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto view = input.view.substring_view(start_position, length);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (input.regex_options & AllFlags::StringCopyMatches) {
match = { view.to_string(), name(), input.line, start_position, input.global_offset + start_position }; // create a copy of the original string
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else {
match = { view, name(), input.line, start_position, input.global_offset + start_position }; // take view to original string
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_Compare::execute(MatchInput const& input, MatchState& state) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
bool inverse { false };
bool temporary_inverse { false };
bool reset_temp_inverse { false };
auto current_inversion_state = [&]() -> bool { return temporary_inverse ^ inverse; };
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
size_t string_position = state.string_position;
bool inverse_matched { false };
bool had_zero_length_match { false };
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
state.string_position_before_match = state.string_position;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
size_t offset { state.instruction_position + 3 };
for (size_t i = 0; i < arguments_count(); ++i) {
if (state.string_position > string_position)
break;
if (reset_temp_inverse) {
reset_temp_inverse = false;
temporary_inverse = false;
} else {
reset_temp_inverse = true;
}
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto compare_type = (CharacterCompareType)m_bytecode->at(offset++);
if (compare_type == CharacterCompareType::Inverse)
inverse = true;
else if (compare_type == CharacterCompareType::TemporaryInverse) {
// If "TemporaryInverse" is given, negate the current inversion state only for the next opcode.
// it follows that this cannot be the last compare element.
VERIFY(i != arguments_count() - 1);
temporary_inverse = true;
reset_temp_inverse = false;
} else if (compare_type == CharacterCompareType::Char) {
u32 ch = m_bytecode->at(offset++);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
// We want to compare a string that is longer or equal in length to the available string
if (input.view.length() <= state.string_position)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return ExecutionResult::Failed_ExecuteLowPrioForks;
compare_char(input, state, ch, current_inversion_state(), inverse_matched);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else if (compare_type == CharacterCompareType::AnyChar) {
// We want to compare a string that is definitely longer than the available string
if (input.view.length() <= state.string_position)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return ExecutionResult::Failed_ExecuteLowPrioForks;
VERIFY(!current_inversion_state());
advance_string_position(state, input.view);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else if (compare_type == CharacterCompareType::String) {
VERIFY(!current_inversion_state());
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
2021-07-23 15:55:14 +00:00
auto const& length = m_bytecode->at(offset++);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
// We want to compare a string that is definitely longer than the available string
if (input.view.length() < state.string_position + length)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return ExecutionResult::Failed_ExecuteLowPrioForks;
Optional<String> str;
Vector<u16, 1> utf16;
Vector<u32> data;
data.ensure_capacity(length);
for (size_t i = offset; i < offset + length; ++i)
data.unchecked_append(m_bytecode->at(i));
auto view = input.view.construct_as_same(data, str, utf16);
offset += length;
if (!compare_string(input, state, view, had_zero_length_match))
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return ExecutionResult::Failed_ExecuteLowPrioForks;
} else if (compare_type == CharacterCompareType::CharClass) {
if (input.view.length() <= state.string_position)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return ExecutionResult::Failed_ExecuteLowPrioForks;
auto character_class = (CharClass)m_bytecode->at(offset++);
auto ch = input.view[state.string_position_in_code_units];
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
compare_character_class(input, state, character_class, ch, current_inversion_state(), inverse_matched);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else if (compare_type == CharacterCompareType::LookupTable) {
if (input.view.length() <= state.string_position)
return ExecutionResult::Failed_ExecuteLowPrioForks;
auto count = m_bytecode->at(offset++);
auto range_data = m_bytecode->spans().slice(offset, count);
offset += count;
auto ch = input.view.substring_view(state.string_position, 1)[0];
auto const* matching_range = binary_search(range_data, ch, nullptr, [insensitive = input.regex_options & AllFlags::Insensitive](auto needle, CharRange range) {
auto from = range.from;
auto to = range.to;
if (insensitive) {
from = to_ascii_lowercase(from);
to = to_ascii_lowercase(to);
needle = to_ascii_lowercase(needle);
}
if (needle > range.to)
return 1;
if (needle < range.from)
return -1;
return 0;
});
if (matching_range) {
if (current_inversion_state())
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
}
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else if (compare_type == CharacterCompareType::CharRange) {
if (input.view.length() <= state.string_position)
return ExecutionResult::Failed_ExecuteLowPrioForks;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto value = (CharRange)m_bytecode->at(offset++);
auto from = value.from;
auto to = value.to;
auto ch = input.view[state.string_position_in_code_units];
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
compare_character_range(input, state, from, to, ch, current_inversion_state(), inverse_matched);
} else if (compare_type == CharacterCompareType::Reference) {
auto reference_number = (size_t)m_bytecode->at(offset++);
auto& groups = state.capture_group_matches.at(input.match_index);
if (groups.size() <= reference_number)
return ExecutionResult::Failed_ExecuteLowPrioForks;
auto str = groups.at(reference_number).view;
// We want to compare a string that is definitely longer than the available string
if (input.view.length() < state.string_position + str.length())
return ExecutionResult::Failed_ExecuteLowPrioForks;
if (!compare_string(input, state, str, had_zero_length_match))
return ExecutionResult::Failed_ExecuteLowPrioForks;
} else if (compare_type == CharacterCompareType::Property) {
auto property = static_cast<Unicode::Property>(m_bytecode->at(offset++));
compare_property(input, state, property, current_inversion_state(), inverse_matched);
} else if (compare_type == CharacterCompareType::GeneralCategory) {
auto general_category = static_cast<Unicode::GeneralCategory>(m_bytecode->at(offset++));
compare_general_category(input, state, general_category, current_inversion_state(), inverse_matched);
} else if (compare_type == CharacterCompareType::Script) {
auto script = static_cast<Unicode::Script>(m_bytecode->at(offset++));
compare_script(input, state, script, current_inversion_state(), inverse_matched);
} else if (compare_type == CharacterCompareType::ScriptExtension) {
auto script = static_cast<Unicode::Script>(m_bytecode->at(offset++));
compare_script_extension(input, state, script, current_inversion_state(), inverse_matched);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else {
warnln("Undefined comparison: {}", (int)compare_type);
VERIFY_NOT_REACHED();
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
break;
}
}
if (current_inversion_state() && !inverse_matched)
advance_string_position(state, input.view);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if ((!had_zero_length_match && string_position == state.string_position) || state.string_position > input.view.length())
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return ExecutionResult::Failed_ExecuteLowPrioForks;
return ExecutionResult::Continue;
}
2021-07-23 15:55:14 +00:00
ALWAYS_INLINE void OpCode_Compare::compare_char(MatchInput const& input, MatchState& state, u32 ch1, bool inverse, bool& inverse_matched)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
if (state.string_position == input.view.length())
return;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto input_view = input.view.substring_view(state.string_position, 1)[0];
bool equal;
if (input.regex_options & AllFlags::Insensitive)
equal = to_ascii_lowercase(input_view) == to_ascii_lowercase(ch1); // FIXME: Implement case-insensitive matching for non-ascii characters
else
equal = input_view == ch1;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (equal) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch1);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
}
ALWAYS_INLINE bool OpCode_Compare::compare_string(MatchInput const& input, MatchState& state, RegexStringView str, bool& had_zero_length_match)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
if (state.string_position + str.length() > input.view.length()) {
if (str.is_empty()) {
had_zero_length_match = true;
return true;
}
return false;
}
if (str.length() == 0) {
had_zero_length_match = true;
return true;
}
auto subject = input.view.substring_view(state.string_position, str.length());
bool equals;
if (input.regex_options & AllFlags::Insensitive)
equals = subject.equals_ignoring_case(str);
else
equals = subject.equals(str);
if (equals)
advance_string_position(state, input.view, str);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
return equals;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
2021-07-23 15:55:14 +00:00
ALWAYS_INLINE void OpCode_Compare::compare_character_class(MatchInput const& input, MatchState& state, CharClass character_class, u32 ch, bool inverse, bool& inverse_matched)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
switch (character_class) {
case CharClass::Alnum:
if (is_ascii_alphanumeric(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Alpha:
if (is_ascii_alpha(ch))
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
break;
case CharClass::Blank:
if (is_ascii_blank(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Cntrl:
if (is_ascii_control(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Digit:
if (is_ascii_digit(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Graph:
if (is_ascii_graphical(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Lower:
if (is_ascii_lower_alpha(ch) || ((input.regex_options & AllFlags::Insensitive) && is_ascii_upper_alpha(ch))) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Print:
if (is_ascii_printable(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Punct:
if (is_ascii_punctuation(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Space:
if (is_ascii_space(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Upper:
if (is_ascii_upper_alpha(ch) || ((input.regex_options & AllFlags::Insensitive) && is_ascii_lower_alpha(ch))) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
case CharClass::Word:
if (is_ascii_alphanumeric(ch) || ch == '_') {
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
}
break;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
case CharClass::Xdigit:
if (is_ascii_hex_digit(ch)) {
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
break;
}
}
2021-07-23 15:55:14 +00:00
ALWAYS_INLINE void OpCode_Compare::compare_character_range(MatchInput const& input, MatchState& state, u32 from, u32 to, u32 ch, bool inverse, bool& inverse_matched)
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
if (input.regex_options & AllFlags::Insensitive) {
from = to_ascii_lowercase(from);
to = to_ascii_lowercase(to);
ch = to_ascii_lowercase(ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
if (ch >= from && ch <= to) {
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, ch);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
}
ALWAYS_INLINE void OpCode_Compare::compare_property(MatchInput const& input, MatchState& state, Unicode::Property property, bool inverse, bool& inverse_matched)
{
if (state.string_position == input.view.length())
return;
u32 code_point = input.view[state.string_position_in_code_units];
bool equal = Unicode::code_point_has_property(code_point, property);
if (equal) {
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, code_point);
}
}
ALWAYS_INLINE void OpCode_Compare::compare_general_category(MatchInput const& input, MatchState& state, Unicode::GeneralCategory general_category, bool inverse, bool& inverse_matched)
{
if (state.string_position == input.view.length())
return;
u32 code_point = input.view[state.string_position_in_code_units];
bool equal = Unicode::code_point_has_general_category(code_point, general_category);
if (equal) {
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, code_point);
}
}
ALWAYS_INLINE void OpCode_Compare::compare_script(MatchInput const& input, MatchState& state, Unicode::Script script, bool inverse, bool& inverse_matched)
{
if (state.string_position == input.view.length())
return;
u32 code_point = input.view[state.string_position_in_code_units];
bool equal = Unicode::code_point_has_script(code_point, script);
if (equal) {
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, code_point);
}
}
ALWAYS_INLINE void OpCode_Compare::compare_script_extension(MatchInput const& input, MatchState& state, Unicode::Script script, bool inverse, bool& inverse_matched)
{
if (state.string_position == input.view.length())
return;
u32 code_point = input.view[state.string_position_in_code_units];
bool equal = Unicode::code_point_has_script_extension(code_point, script);
if (equal) {
if (inverse)
inverse_matched = true;
else
advance_string_position(state, input.view, code_point);
}
}
2021-07-23 15:55:14 +00:00
String const OpCode_Compare::arguments_string() const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
return String::formatted("argc={}, args={} ", arguments_count(), arguments_size());
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
Vector<CompareTypeAndValuePair> OpCode_Compare::flat_compares() const
{
Vector<CompareTypeAndValuePair> result;
size_t offset { state().instruction_position + 3 };
for (size_t i = 0; i < arguments_count(); ++i) {
auto compare_type = (CharacterCompareType)m_bytecode->at(offset++);
if (compare_type == CharacterCompareType::Char) {
auto ch = m_bytecode->at(offset++);
result.append({ compare_type, ch });
} else if (compare_type == CharacterCompareType::Reference) {
auto ref = m_bytecode->at(offset++);
result.append({ compare_type, ref });
} else if (compare_type == CharacterCompareType::String) {
auto& length = m_bytecode->at(offset++);
if (length > 0)
result.append({ compare_type, m_bytecode->at(offset) });
StringBuilder str_builder;
offset += length;
} else if (compare_type == CharacterCompareType::CharClass) {
auto character_class = m_bytecode->at(offset++);
result.append({ compare_type, character_class });
} else if (compare_type == CharacterCompareType::CharRange) {
auto value = m_bytecode->at(offset++);
result.append({ compare_type, value });
} else if (compare_type == CharacterCompareType::LookupTable) {
auto count = m_bytecode->at(offset++);
for (size_t i = 0; i < count; ++i)
result.append({ CharacterCompareType::CharRange, m_bytecode->at(offset++) });
} else {
result.append({ compare_type, 0 });
}
}
return result;
}
2021-07-23 15:55:14 +00:00
Vector<String> const OpCode_Compare::variable_arguments_to_string(Optional<MatchInput> input) const
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
{
Vector<String> result;
size_t offset { state().instruction_position + 3 };
RegexStringView view = ((input.has_value()) ? input.value().view : nullptr);
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
for (size_t i = 0; i < arguments_count(); ++i) {
auto compare_type = (CharacterCompareType)m_bytecode->at(offset++);
result.empend(String::formatted("type={} [{}]", (size_t)compare_type, character_compare_type_name(compare_type)));
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
auto string_start_offset = state().string_position_before_match;
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
if (compare_type == CharacterCompareType::Char) {
auto ch = m_bytecode->at(offset++);
auto is_ascii = is_ascii_printable(ch);
if (is_ascii)
result.empend(String::formatted("value='{:c}'", static_cast<char>(ch)));
else
result.empend(String::formatted("value={:x}", ch));
if (!view.is_null() && view.length() > string_start_offset) {
if (is_ascii) {
result.empend(String::formatted(
"compare against: '{}'",
view.substring_view(string_start_offset, string_start_offset > view.length() ? 0 : 1).to_string()));
} else {
auto str = view.substring_view(string_start_offset, string_start_offset > view.length() ? 0 : 1).to_string();
u8 buf[8] { 0 };
__builtin_memcpy(buf, str.characters(), min(str.length(), sizeof(buf)));
result.empend(String::formatted("compare against: {:x},{:x},{:x},{:x},{:x},{:x},{:x},{:x}",
buf[0], buf[1], buf[2], buf[3], buf[4], buf[5], buf[6], buf[7]));
}
}
} else if (compare_type == CharacterCompareType::Reference) {
auto ref = m_bytecode->at(offset++);
result.empend(String::formatted("number={}", ref));
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else if (compare_type == CharacterCompareType::String) {
auto& length = m_bytecode->at(offset++);
StringBuilder str_builder;
for (size_t i = 0; i < length; ++i)
str_builder.append(m_bytecode->at(offset++));
result.empend(String::formatted("value=\"{}\"", str_builder.string_view().substring_view(0, length)));
if (!view.is_null() && view.length() > state().string_position)
result.empend(String::formatted(
"compare against: \"{}\"",
input.value().view.substring_view(string_start_offset, string_start_offset + length > view.length() ? 0 : length).to_string()));
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else if (compare_type == CharacterCompareType::CharClass) {
auto character_class = (CharClass)m_bytecode->at(offset++);
result.empend(String::formatted("ch_class={} [{}]", (size_t)character_class, character_class_name(character_class)));
if (!view.is_null() && view.length() > state().string_position)
result.empend(String::formatted(
"compare against: '{}'",
input.value().view.substring_view(string_start_offset, state().string_position > view.length() ? 0 : 1).to_string()));
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
} else if (compare_type == CharacterCompareType::CharRange) {
auto value = (CharRange)m_bytecode->at(offset++);
result.empend(String::formatted("ch_range={:x}-{:x}", value.from, value.to));
if (!view.is_null() && view.length() > state().string_position)
result.empend(String::formatted(
"compare against: '{}'",
input.value().view.substring_view(string_start_offset, state().string_position > view.length() ? 0 : 1).to_string()));
} else if (compare_type == CharacterCompareType::LookupTable) {
auto count = m_bytecode->at(offset++);
for (size_t j = 0; j < count; ++j) {
auto range = (CharRange)m_bytecode->at(offset++);
result.append(String::formatted("{:x}-{:x}", range.from, range.to));
}
if (!view.is_null() && view.length() > state().string_position)
result.empend(String::formatted(
"compare against: '{}'",
input.value().view.substring_view(string_start_offset, state().string_position > view.length() ? 0 : 1).to_string()));
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}
}
return result;
}
ALWAYS_INLINE ExecutionResult OpCode_Repeat::execute(MatchInput const&, MatchState& state) const
{
VERIFY(count() > 0);
if (id() >= state.repetition_marks.size())
state.repetition_marks.resize(id() + 1);
auto& repetition_mark = state.repetition_marks.at(id());
if (repetition_mark == count() - 1) {
repetition_mark = 0;
} else {
state.instruction_position -= offset() + size();
++repetition_mark;
}
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_ResetRepeat::execute(MatchInput const&, MatchState& state) const
{
if (id() >= state.repetition_marks.size())
state.repetition_marks.resize(id() + 1);
state.repetition_marks.at(id()) = 0;
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_Checkpoint::execute(MatchInput const& input, MatchState& state) const
{
input.checkpoints.set(state.instruction_position, state.string_position);
return ExecutionResult::Continue;
}
ALWAYS_INLINE ExecutionResult OpCode_JumpNonEmpty::execute(MatchInput const& input, MatchState& state) const
{
auto current_position = state.string_position;
auto checkpoint_ip = state.instruction_position + size() + checkpoint();
if (input.checkpoints.get(checkpoint_ip).value_or(current_position) != current_position) {
auto form = this->form();
if (form == OpCodeId::Jump) {
state.instruction_position += offset();
return ExecutionResult::Continue;
}
state.fork_at_position = state.instruction_position + size() + offset();
if (form == OpCodeId::ForkJump)
return ExecutionResult::Fork_PrioHigh;
if (form == OpCodeId::ForkStay)
return ExecutionResult::Fork_PrioLow;
if (form == OpCodeId::ForkReplaceStay) {
input.fork_to_replace = state.instruction_position;
return ExecutionResult::Fork_PrioLow;
}
if (form == OpCodeId::ForkReplaceJump) {
input.fork_to_replace = state.instruction_position;
return ExecutionResult::Fork_PrioHigh;
}
}
return ExecutionResult::Continue;
}
LibRegex: Add a regular expression library This commit is a mix of several commits, squashed into one because the commits before 'Move regex to own Library and fix all the broken stuff' were not fixable in any elegant way. The commits are listed below for "historical" purposes: - AK: Add options/flags and Errors for regular expressions Flags can be provided for any possible flavour by adding a new scoped enum. Handling of flags is done by templated Options class and the overloaded '|' and '&' operators. - AK: Add Lexer for regular expressions The lexer parses the input and extracts tokens needed to parse a regular expression. - AK: Add regex Parser and PosixExtendedParser This patchset adds a abstract parser class that can be derived to implement different parsers. A parser produces bytecode to be executed within the regex matcher. - AK: Add regex matcher This patchset adds an regex matcher based on the principles of the T-REX VM. The bytecode pruduced by the respective Parser is put into the matcher and the VM will recursively execute the bytecode according to the available OpCodes. Possible improvement: the recursion could be replaced by multi threading capabilities. To match a Regular expression, e.g. for the Posix standard regular expression matcher use the following API: ``` Pattern<PosixExtendedParser> pattern("^.*$"); auto result = pattern.match("Well, hello friends!\nHello World!"); // Match whole needle EXPECT(result.count == 1); EXPECT(result.matches.at(0).view.starts_with("Well")); EXPECT(result.matches.at(0).view.end() == "!"); result = pattern.match("Well, hello friends!\nHello World!", PosixFlags::Multiline); // Match line by line EXPECT(result.count == 2); EXPECT(result.matches.at(0).view == "Well, hello friends!"); EXPECT(result.matches.at(1).view == "Hello World!"); EXPECT(pattern.has_match("Well,....")); // Just check if match without a result, which saves some resources. ``` - AK: Rework regex to work with opcodes objects This patchsets reworks the matcher to work on a more structured base. For that an abstract OpCode class and derived classes for the specific OpCodes have been added. The respective opcode logic is contained in each respective execute() method. - AK: Add benchmark for regex - AK: Some optimization in regex for runtime and memory - LibRegex: Move regex to own Library and fix all the broken stuff Now regex works again and grep utility is also in place for testing. This commit also fixes the use of regex.h in C by making `regex_t` an opaque (-ish) type, which makes its behaviour consistent between C and C++ compilers. Previously, <regex.h> would've blown C compilers up, and even if it didn't, would've caused a leak in C code, and not in C++ code (due to the existence of `OwnPtr` inside the struct). To make this whole ordeal easier to deal with (for now), this pulls the definitions of `reg*()` into LibRegex. pros: - The circular dependency between LibC and LibRegex is broken - Eaiser to test (without accidentally pulling in the host's libc!) cons: - Using any of the regex.h functions will require the user to link -lregex - The symbols will be missing from libc, which will be a big surprise down the line (especially with shared libs). Co-Authored-By: Ali Mohammad Pur <ali.mpfard@gmail.com>
2020-04-26 12:45:10 +00:00
}