Commit 52eb4f17 authored by Sergei Golubchik's avatar Sergei Golubchik

Merge branch 'merge-pcre' into 10.1

parents 1389c94b 879f7e85
...@@ -8,7 +8,7 @@ Email domain: cam.ac.uk ...@@ -8,7 +8,7 @@ Email domain: cam.ac.uk
University of Cambridge Computing Service, University of Cambridge Computing Service,
Cambridge, England. Cambridge, England.
Copyright (c) 1997-2018 University of Cambridge Copyright (c) 1997-2019 University of Cambridge
All rights reserved All rights reserved
...@@ -19,7 +19,7 @@ Written by: Zoltan Herczeg ...@@ -19,7 +19,7 @@ Written by: Zoltan Herczeg
Email local part: hzmester Email local part: hzmester
Emain domain: freemail.hu Emain domain: freemail.hu
Copyright(c) 2010-2018 Zoltan Herczeg Copyright(c) 2010-2019 Zoltan Herczeg
All rights reserved. All rights reserved.
...@@ -30,7 +30,7 @@ Written by: Zoltan Herczeg ...@@ -30,7 +30,7 @@ Written by: Zoltan Herczeg
Email local part: hzmester Email local part: hzmester
Emain domain: freemail.hu Emain domain: freemail.hu
Copyright(c) 2009-2018 Zoltan Herczeg Copyright(c) 2009-2019 Zoltan Herczeg
All rights reserved. All rights reserved.
......
...@@ -5,6 +5,49 @@ Note that the PCRE 8.xx series (PCRE1) is now in a bugfix-only state. All ...@@ -5,6 +5,49 @@ Note that the PCRE 8.xx series (PCRE1) is now in a bugfix-only state. All
development is happening in the PCRE2 10.xx series. development is happening in the PCRE2 10.xx series.
Version 8.43 23-February-2019
-----------------------------
1. Some time ago the config macro SUPPORT_UTF8 was changed to SUPPORT_UTF
because it also applies to UTF-16 and UTF-32. However, this change was not made
in the pcre2cpp files; consequently the C++ wrapper has from then been compiled
with a bug in it, which would have been picked up by the unit test except that
it also had its UTF8 code cut out. The bug was in a global replace when moving
forward after matching an empty string.
2. The C++ wrapper got broken a long time ago (version 7.3, August 2007) when
(*CR) was invented (assuming it was the first such start-of-pattern option).
The wrapper could never handle such patterns because it wraps patterns in
(?:...)\z in order to support end anchoring. I have hacked in some code to fix
this, that is, move the wrapping till after any existing start-of-pattern
special settings.
3. "pcre2grep" (sic) was accidentally mentioned in an error message (fix was
ported from PCRE2).
4. Typo LCC_ALL for LC_ALL fixed in pcregrep.
5. In a pattern such as /[^\x{100}-\x{ffff}]*[\x80-\xff]/ which has a repeated
negative class with no characters less than 0x100 followed by a positive class
with only characters less than 0x100, the first class was incorrectly being
auto-possessified, causing incorrect match failures.
6. If the only branch in a conditional subpattern was anchored, the whole
subpattern was treated as anchored, when it should not have been, since the
assumed empty second branch cannot be anchored. Demonstrated by test patterns
such as /(?(1)^())b/ or /(?(?=^))b/.
7. Fix subject buffer overread in JIT when UTF is disabled and \X or \R has
a greater than 1 fixed quantifier. This issue was found by Yunho Kim.
8. If a pattern started with a subroutine call that had a quantifier with a
minimum of zero, an incorrect "match must start with this character" could be
recorded. Example: /(?&xxx)*ABC(?<xxx>XYZ)/ would (incorrectly) expect 'A' to
be the first character of a match.
9. Improve MAP_JIT flag usage on MacOS. Patch by Rich Siegel.
Version 8.42 20-March-2018 Version 8.42 20-March-2018
-------------------------- --------------------------
......
...@@ -25,7 +25,7 @@ Email domain: cam.ac.uk ...@@ -25,7 +25,7 @@ Email domain: cam.ac.uk
University of Cambridge Computing Service, University of Cambridge Computing Service,
Cambridge, England. Cambridge, England.
Copyright (c) 1997-2018 University of Cambridge Copyright (c) 1997-2019 University of Cambridge
All rights reserved. All rights reserved.
...@@ -34,9 +34,9 @@ PCRE JUST-IN-TIME COMPILATION SUPPORT ...@@ -34,9 +34,9 @@ PCRE JUST-IN-TIME COMPILATION SUPPORT
Written by: Zoltan Herczeg Written by: Zoltan Herczeg
Email local part: hzmester Email local part: hzmester
Emain domain: freemail.hu Email domain: freemail.hu
Copyright(c) 2010-2018 Zoltan Herczeg Copyright(c) 2010-2019 Zoltan Herczeg
All rights reserved. All rights reserved.
...@@ -45,9 +45,9 @@ STACK-LESS JUST-IN-TIME COMPILER ...@@ -45,9 +45,9 @@ STACK-LESS JUST-IN-TIME COMPILER
Written by: Zoltan Herczeg Written by: Zoltan Herczeg
Email local part: hzmester Email local part: hzmester
Emain domain: freemail.hu Email domain: freemail.hu
Copyright(c) 2009-2018 Zoltan Herczeg Copyright(c) 2009-2019 Zoltan Herczeg
All rights reserved. All rights reserved.
......
News about PCRE releases News about PCRE releases
------------------------ ------------------------
Note that this library (now called PCRE1) is now being maintained for bug fixes
only. New projects are advised to use the new PCRE2 libraries.
Release 8.43 23-February-2019
-----------------------------
This is a bug-fix release.
Release 8.42 20-March-2018 Release 8.42 20-March-2018
-------------------------- --------------------------
......
...@@ -9,17 +9,17 @@ dnl The PCRE_PRERELEASE feature is for identifying release candidates. It might ...@@ -9,17 +9,17 @@ dnl The PCRE_PRERELEASE feature is for identifying release candidates. It might
dnl be defined as -RC2, for example. For real releases, it should be empty. dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre_major, [8]) m4_define(pcre_major, [8])
m4_define(pcre_minor, [42]) m4_define(pcre_minor, [43])
m4_define(pcre_prerelease, []) m4_define(pcre_prerelease, [])
m4_define(pcre_date, [2018-03-20]) m4_define(pcre_date, [2019-02-23])
# NOTE: The CMakeLists.txt file searches for the above variables in the first # NOTE: The CMakeLists.txt file searches for the above variables in the first
# 50 lines of this file. Please update that if the variables above are moved. # 50 lines of this file. Please update that if the variables above are moved.
# Libtool shared library interface versions (current:revision:age) # Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [3:10:2]) m4_define(libpcre_version, [3:11:2])
m4_define(libpcre16_version, [2:10:2]) m4_define(libpcre16_version, [2:11:2])
m4_define(libpcre32_version, [0:10:0]) m4_define(libpcre32_version, [0:11:0])
m4_define(libpcreposix_version, [0:6:0]) m4_define(libpcreposix_version, [0:6:0])
m4_define(libpcrecpp_version, [0:1:0]) m4_define(libpcrecpp_version, [0:1:0])
......
...@@ -6,7 +6,7 @@ ...@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language. and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel Written by Philip Hazel
Copyright (c) 1997-2016 University of Cambridge Copyright (c) 1997-2018 University of Cambridge
----------------------------------------------------------------------------- -----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without Redistribution and use in source and binary forms, with or without
...@@ -3300,7 +3300,7 @@ for(;;) ...@@ -3300,7 +3300,7 @@ for(;;)
if ((*xclass_flags & XCL_MAP) == 0) if ((*xclass_flags & XCL_MAP) == 0)
{ {
/* No bits are set for characters < 256. */ /* No bits are set for characters < 256. */
if (list[1] == 0) return TRUE; if (list[1] == 0) return (*xclass_flags & XCL_NOT) == 0;
/* Might be an empty repeat. */ /* Might be an empty repeat. */
continue; continue;
} }
...@@ -7645,6 +7645,8 @@ for (;; ptr++) ...@@ -7645,6 +7645,8 @@ for (;; ptr++)
/* Can't determine a first byte now */ /* Can't determine a first byte now */
if (firstcharflags == REQ_UNSET) firstcharflags = REQ_NONE; if (firstcharflags == REQ_UNSET) firstcharflags = REQ_NONE;
zerofirstchar = firstchar;
zerofirstcharflags = firstcharflags;
continue; continue;
...@@ -8685,10 +8687,18 @@ do { ...@@ -8685,10 +8687,18 @@ do {
if (!is_anchored(scode, new_map, cd, atomcount)) return FALSE; if (!is_anchored(scode, new_map, cd, atomcount)) return FALSE;
} }
/* Positive forward assertions and conditions */ /* Positive forward assertion */
else if (op == OP_ASSERT || op == OP_COND) else if (op == OP_ASSERT)
{
if (!is_anchored(scode, bracket_map, cd, atomcount)) return FALSE;
}
/* Condition; not anchored if no second branch */
else if (op == OP_COND)
{ {
if (scode[GET(scode,1)] != OP_ALT) return FALSE;
if (!is_anchored(scode, bracket_map, cd, atomcount)) return FALSE; if (!is_anchored(scode, bracket_map, cd, atomcount)) return FALSE;
} }
......
...@@ -9002,7 +9002,7 @@ if (exact > 1) ...@@ -9002,7 +9002,7 @@ if (exact > 1)
#ifdef SUPPORT_UTF #ifdef SUPPORT_UTF
&& !common->utf && !common->utf
#endif #endif
) && type != OP_ANYNL && type != OP_EXTUNI)
{ {
OP2(SLJIT_ADD, TMP1, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(exact)); OP2(SLJIT_ADD, TMP1, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(exact));
add_jump(compiler, &backtrack->topbacktracks, CMP(SLJIT_GREATER, TMP1, 0, STR_END, 0)); add_jump(compiler, &backtrack->topbacktracks, CMP(SLJIT_GREATER, TMP1, 0, STR_END, 0));
......
...@@ -80,6 +80,24 @@ static const string empty_string; ...@@ -80,6 +80,24 @@ static const string empty_string;
// If the user doesn't ask for any options, we just use this one // If the user doesn't ask for any options, we just use this one
static RE_Options default_options; static RE_Options default_options;
// Specials for the start of patterns. See comments where start_options is used
// below. (PH June 2018)
static const char *start_options[] = {
"(*UTF8)",
"(*UTF)",
"(*UCP)",
"(*NO_START_OPT)",
"(*NO_AUTO_POSSESS)",
"(*LIMIT_RECURSION=",
"(*LIMIT_MATCH=",
"(*CRLF)",
"(*CR)",
"(*BSR_UNICODE)",
"(*BSR_ANYCRLF)",
"(*ANYCRLF)",
"(*ANY)",
"" };
void RE::Init(const string& pat, const RE_Options* options) { void RE::Init(const string& pat, const RE_Options* options) {
pattern_ = pat; pattern_ = pat;
if (options == NULL) { if (options == NULL) {
...@@ -135,7 +153,49 @@ pcre* RE::Compile(Anchor anchor) { ...@@ -135,7 +153,49 @@ pcre* RE::Compile(Anchor anchor) {
} else { } else {
// Tack a '\z' at the end of RE. Parenthesize it first so that // Tack a '\z' at the end of RE. Parenthesize it first so that
// the '\z' applies to all top-level alternatives in the regexp. // the '\z' applies to all top-level alternatives in the regexp.
string wrapped = "(?:"; // A non-counting grouping operator
/* When this code was written (for PCRE 6.0) it was enough just to
parenthesize the entire pattern. Unfortunately, when the feature of
starting patterns with (*UTF8) or (*CR) etc. was added to PCRE patterns,
this code was never updated. This bug was not noticed till 2018, long after
PCRE became obsolescent and its maintainer no longer around. Since PCRE is
frozen, I have added a hack to check for all the existing "start of
pattern" specials - knowing that no new ones will ever be added. I am not a
C++ programmer, so the code style is no doubt crude. It is also
inefficient, but is only run when the pattern starts with "(*".
PH June 2018. */
string wrapped = "";
if (pattern_.c_str()[0] == '(' && pattern_.c_str()[1] == '*') {
int kk, klen, kmat;
for (;;) { // Loop for any number of leading items
for (kk = 0; start_options[kk][0] != 0; kk++) {
klen = strlen(start_options[kk]);
kmat = strncmp(pattern_.c_str(), start_options[kk], klen);
if (kmat >= 0) break;
}
if (kmat != 0) break; // Not found
// If the item ended in "=" we must copy digits up to ")".
if (start_options[kk][klen-1] == '=') {
while (isdigit(pattern_.c_str()[klen])) klen++;
if (pattern_.c_str()[klen] != ')') break; // Syntax error
klen++;
}
// Move the item from the pattern to the start of the wrapped string.
wrapped += pattern_.substr(0, klen);
pattern_.erase(0, klen);
}
}
// Wrap the rest of the pattern.
wrapped += "(?:"; // A non-counting grouping operator
wrapped += pattern_; wrapped += pattern_;
wrapped += ")\\z"; wrapped += ")\\z";
re = pcre_compile(wrapped.c_str(), pcre_options, re = pcre_compile(wrapped.c_str(), pcre_options,
...@@ -415,7 +475,7 @@ int RE::GlobalReplace(const StringPiece& rewrite, ...@@ -415,7 +475,7 @@ int RE::GlobalReplace(const StringPiece& rewrite,
matchend++; matchend++;
} }
// We also need to advance more than one char if we're in utf8 mode. // We also need to advance more than one char if we're in utf8 mode.
#ifdef SUPPORT_UTF8 #ifdef SUPPORT_UTF
if (options_.utf8()) { if (options_.utf8()) {
while (matchend < static_cast<int>(str->length()) && while (matchend < static_cast<int>(str->length()) &&
((*str)[matchend] & 0xc0) == 0x80) ((*str)[matchend] & 0xc0) == 0x80)
......
...@@ -309,7 +309,7 @@ static void TestReplace() { ...@@ -309,7 +309,7 @@ static void TestReplace() {
"@aa", "@aa",
"@@@", "@@@",
3 }, 3 },
#ifdef SUPPORT_UTF8 #ifdef SUPPORT_UTF
{ "b*", { "b*",
"bb", "bb",
"\xE3\x83\x9B\xE3\x83\xBC\xE3\x83\xA0\xE3\x81\xB8", // utf8 "\xE3\x83\x9B\xE3\x83\xBC\xE3\x83\xA0\xE3\x81\xB8", // utf8
...@@ -327,7 +327,7 @@ static void TestReplace() { ...@@ -327,7 +327,7 @@ static void TestReplace() {
{ "", NULL, NULL, NULL, NULL, 0 } { "", NULL, NULL, NULL, NULL, 0 }
}; };
#ifdef SUPPORT_UTF8 #ifdef SUPPORT_UTF
const bool support_utf8 = true; const bool support_utf8 = true;
#else #else
const bool support_utf8 = false; const bool support_utf8 = false;
...@@ -535,7 +535,7 @@ static void TestQuoteMetaLatin1() { ...@@ -535,7 +535,7 @@ static void TestQuoteMetaLatin1() {
} }
static void TestQuoteMetaUtf8() { static void TestQuoteMetaUtf8() {
#ifdef SUPPORT_UTF8 #ifdef SUPPORT_UTF
TestQuoteMeta("Pl\xc3\xa1\x63ido Domingo", pcrecpp::UTF8()); TestQuoteMeta("Pl\xc3\xa1\x63ido Domingo", pcrecpp::UTF8());
TestQuoteMeta("xyz", pcrecpp::UTF8()); // No fancy utf8 TestQuoteMeta("xyz", pcrecpp::UTF8()); // No fancy utf8
TestQuoteMeta("\xc2\xb0", pcrecpp::UTF8()); // 2-byte utf8 (degree symbol) TestQuoteMeta("\xc2\xb0", pcrecpp::UTF8()); // 2-byte utf8 (degree symbol)
...@@ -1178,7 +1178,7 @@ int main(int argc, char** argv) { ...@@ -1178,7 +1178,7 @@ int main(int argc, char** argv) {
CHECK(re.error().empty()); // Must have no error CHECK(re.error().empty()); // Must have no error
} }
#ifdef SUPPORT_UTF8 #ifdef SUPPORT_UTF
// Check UTF-8 handling // Check UTF-8 handling
{ {
printf("Testing UTF-8 handling\n"); printf("Testing UTF-8 handling\n");
...@@ -1203,6 +1203,30 @@ int main(int argc, char** argv) { ...@@ -1203,6 +1203,30 @@ int main(int argc, char** argv) {
RE re_test2("...", pcrecpp::UTF8()); RE re_test2("...", pcrecpp::UTF8());
CHECK(re_test2.FullMatch(utf8_string)); CHECK(re_test2.FullMatch(utf8_string));
// PH added these tests for leading option settings
RE re_testZ0("(*CR)(*NO_START_OPT).........");
CHECK(re_testZ0.FullMatch(utf8_string));
#ifdef SUPPORT_UTF
RE re_testZ1("(*UTF8)...");
CHECK(re_testZ1.FullMatch(utf8_string));
RE re_testZ2("(*UTF)...");
CHECK(re_testZ2.FullMatch(utf8_string));
#ifdef SUPPORT_UCP
RE re_testZ3("(*UCP)(*UTF)...");
CHECK(re_testZ3.FullMatch(utf8_string));
RE re_testZ4("(*UCP)(*LIMIT_MATCH=1000)(*UTF)...");
CHECK(re_testZ4.FullMatch(utf8_string));
RE re_testZ5("(*UCP)(*LIMIT_MATCH=1000)(*ANY)(*UTF)...");
CHECK(re_testZ5.FullMatch(utf8_string));
#endif
#endif
// Check that '.' matches one byte or UTF-8 character // Check that '.' matches one byte or UTF-8 character
// according to the mode. // according to the mode.
string ss; string ss;
...@@ -1248,7 +1272,7 @@ int main(int argc, char** argv) { ...@@ -1248,7 +1272,7 @@ int main(int argc, char** argv) {
CHECK(!match_sentence.FullMatch(target)); CHECK(!match_sentence.FullMatch(target));
CHECK(!match_sentence_re.FullMatch(target)); CHECK(!match_sentence_re.FullMatch(target));
} }
#endif /* def SUPPORT_UTF8 */ #endif /* def SUPPORT_UTF */
printf("Testing error reporting\n"); printf("Testing error reporting\n");
......
...@@ -2252,7 +2252,7 @@ if (isdirectory(pathname)) ...@@ -2252,7 +2252,7 @@ if (isdirectory(pathname))
int fnlength = strlen(pathname) + strlen(nextfile) + 2; int fnlength = strlen(pathname) + strlen(nextfile) + 2;
if (fnlength > 2048) if (fnlength > 2048)
{ {
fprintf(stderr, "pcre2grep: recursive filename is too long\n"); fprintf(stderr, "pcregrep: recursive filename is too long\n");
rc = 2; rc = 2;
break; break;
} }
...@@ -3034,7 +3034,7 @@ LC_ALL environment variable is set, and if so, use it. */ ...@@ -3034,7 +3034,7 @@ LC_ALL environment variable is set, and if so, use it. */
if (locale == NULL) if (locale == NULL)
{ {
locale = getenv("LC_ALL"); locale = getenv("LC_ALL");
locale_from = "LCC_ALL"; locale_from = "LC_ALL";
} }
if (locale == NULL) if (locale == NULL)
......
...@@ -5742,4 +5742,19 @@ AbcdCBefgBhiBqz ...@@ -5742,4 +5742,19 @@ AbcdCBefgBhiBqz
/X+(?#comment)?/ /X+(?#comment)?/
>XXX< >XXX<
/ (?<word> \w+ )* \. /xi
pokus.
/(?(DEFINE) (?<word> \w+ ) ) (?&word)* \./xi
pokus.
/(?(DEFINE) (?<word> \w+ ) ) ( (?&word)* ) \./xi
pokus.
/(?&word)* (?(DEFINE) (?<word> \w+ ) ) \./xi
pokus.
/(?&word)* \. (?<word> \w+ )/xi
pokus.hokus
/-- End of testinput1 --/ /-- End of testinput1 --/
...@@ -4257,4 +4257,7 @@ backtracking verbs. --/ ...@@ -4257,4 +4257,7 @@ backtracking verbs. --/
ab ab
aaab aaab
/(?(?=^))b/
abc
/-- End of testinput2 --/ /-- End of testinput2 --/
...@@ -727,4 +727,7 @@ ...@@ -727,4 +727,7 @@
/\C(\W?ſ)'?{{/8 /\C(\W?ſ)'?{{/8
\\C(\\W?ſ)'?{{ \\C(\\W?ſ)'?{{
/[^\x{100}-\x{ffff}]*[\x80-\xff]/8
\x{99}\x{99}\x{99}
/-- End of testinput4 --/ /-- End of testinput4 --/
...@@ -9446,4 +9446,28 @@ No match ...@@ -9446,4 +9446,28 @@ No match
>XXX< >XXX<
0: X 0: X
/ (?<word> \w+ )* \. /xi
pokus.
0: pokus.
1: pokus
/(?(DEFINE) (?<word> \w+ ) ) (?&word)* \./xi
pokus.
0: pokus.
/(?(DEFINE) (?<word> \w+ ) ) ( (?&word)* ) \./xi
pokus.
0: pokus.
1: <unset>
2: pokus
/(?&word)* (?(DEFINE) (?<word> \w+ ) ) \./xi
pokus.
0: pokus.
/(?&word)* \. (?<word> \w+ )/xi
pokus.hokus
0: pokus.hokus
1: hokus
/-- End of testinput1 --/ /-- End of testinput1 --/
...@@ -14721,4 +14721,8 @@ No need char ...@@ -14721,4 +14721,8 @@ No need char
0: ab 0: ab
1: a 1: a
/(?(?=^))b/
abc
0: b
/-- End of testinput2 --/ /-- End of testinput2 --/
...@@ -1277,4 +1277,8 @@ No match ...@@ -1277,4 +1277,8 @@ No match
\\C(\\W?ſ)'?{{ \\C(\\W?ſ)'?{{
No match No match
/[^\x{100}-\x{ffff}]*[\x80-\xff]/8
\x{99}\x{99}\x{99}
0: \x{99}\x{99}\x{99}
/-- End of testinput4 --/ /-- End of testinput4 --/
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment