[3529] | 1 | Things to do for GNU grep
|
---|
| 2 |
|
---|
| 3 | Copyright (C) 1992, 1997-2002, 2004-2021 Free Software Foundation, Inc.
|
---|
| 4 |
|
---|
| 5 | Copying and distribution of this file, with or without modification,
|
---|
| 6 | are permitted in any medium without royalty provided the copyright
|
---|
| 7 | notice and this notice are preserved.
|
---|
| 8 |
|
---|
| 9 | ===============
|
---|
| 10 | Short term work
|
---|
| 11 | ===============
|
---|
| 12 |
|
---|
| 13 | See where we are with UTF-8 performance.
|
---|
| 14 |
|
---|
| 15 | Merge Debian patches that seem relevant.
|
---|
| 16 |
|
---|
| 17 | Go through patches in Savannah.
|
---|
| 18 |
|
---|
| 19 | Fix --directories=read.
|
---|
| 20 |
|
---|
| 21 | Write better Texinfo documentation for grep. The manual page would be a
|
---|
| 22 | good place to start, but Info documents are also supposed to contain a
|
---|
| 23 | tutorial and examples.
|
---|
| 24 |
|
---|
| 25 | Some tests in tests/spencer2.tests should have failed! Need to filter out
|
---|
| 26 | some bugs in dfa.[ch]/regex.[ch].
|
---|
| 27 |
|
---|
| 28 | Multithreading?
|
---|
| 29 |
|
---|
| 30 | GNU grep originally did 32-bit arithmetic. Although it has moved to
|
---|
| 31 | 64-bit on 64-bit platforms by using types like ptrdiff_t and size_t,
|
---|
| 32 | this conversion has not been entirely systematic and should be checked.
|
---|
| 33 |
|
---|
| 34 | Lazy dynamic linking of libpcre. See Debianâs 03-397262-dlopen-pcre.patch.
|
---|
| 35 |
|
---|
| 36 | Check FreeBSDâs integration of zgrep (-Z) and bzgrep (-J) in one
|
---|
| 37 | binary. Is there a possibility of doing even better by automatically
|
---|
| 38 | checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
|
---|
| 39 | 0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with
|
---|
| 40 | libpcre is decided, do the same for libz and libbz2.
|
---|
| 41 |
|
---|
| 42 | |
---|
| 43 |
|
---|
| 44 | ===================
|
---|
| 45 | Matching algorithms
|
---|
| 46 | ===================
|
---|
| 47 |
|
---|
| 48 | Take a look at these and consider opportunities for merging or cloning:
|
---|
| 49 |
|
---|
| 50 | -- http://osrd.org/projects/grep/global-regular-expression-print-tools-grep-variants
|
---|
| 51 | -- ja-grepâs mlb2 patch (Japanese grep)
|
---|
| 52 | <http://distcache.freebsd.org/ports-distfiles/grep-2.4.2-mlb2.patch.gz>
|
---|
| 53 | -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
|
---|
| 54 | <http://www.mt.cs.keio.ac.jp/person/narita/lv/>;
|
---|
| 55 | -- cgrep (Context grep) <https://awgn.github.io/cgrep/>
|
---|
| 56 | seems like nice work;
|
---|
| 57 | -- sgrep (Struct grep) <https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
|
---|
| 58 | -- agrep (Approximate grep) <https://www.tgries.de/agrep/>,
|
---|
| 59 | from glimpse;
|
---|
| 60 | -- nr-grep (Nondeterministic reverse grep)
|
---|
| 61 | <https://www.dcc.uchile.cl/~gnavarro/software/>;
|
---|
| 62 | -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
|
---|
| 63 | -- freegrep <https://github.com/howardjp/freegrep>;
|
---|
| 64 |
|
---|
| 65 | Check some new algorithms for matching. See, for example, Faro &
|
---|
| 66 | Lecroq (cited in kwset.c).
|
---|
| 67 |
|
---|
| 68 | Fix the DFA matcher to never use exponential space. (Fortunately, these
|
---|
| 69 | cases are rare.)
|
---|
| 70 |
|
---|
| 71 | |
---|
| 72 |
|
---|
| 73 | ============================
|
---|
| 74 | Standards: POSIX and Unicode
|
---|
| 75 | ============================
|
---|
| 76 |
|
---|
| 77 | For POSIX compliance issues, see POSIX 1003.1.
|
---|
| 78 |
|
---|
| 79 | Current support for the POSIX [= =] and [. .] constructs is limited to
|
---|
| 80 | platforms whose regular expression matchers are sufficiently
|
---|
| 81 | compatible with the GNU C library so that the --without-included-regex
|
---|
| 82 | option of âconfigureâ is in effect. Extend this support to non-glibc
|
---|
| 83 | platforms, where --with-included-regex is in effect, by modifying the
|
---|
| 84 | included version of the regex code to defer to the native version when
|
---|
| 85 | handling [= =] and [. .].
|
---|
| 86 |
|
---|
| 87 | For Unicode, interesting things to check include the Unicode Standard
|
---|
| 88 | <https://www.unicode.org/standard/standard.html> and the Unicode Technical
|
---|
| 89 | Standard #18 (<https://www.unicode.org/reports/tr18/> âUnicode Regular
|
---|
| 90 | Expressionsâ). Talk to Bruno Haible whoâs maintaining GNU libunistring.
|
---|
| 91 | See also Unicode Standard Annex #15 (<https://www.unicode.org/reports/tr15/>
|
---|
| 92 | âUnicode Normalization Formsâ), already implemented by GNU libunistring.
|
---|
| 93 |
|
---|
| 94 | In particular, --ignore-case needs to be evaluated against the standards.
|
---|
| 95 | We may want to deviate from POSIX if Unicode provides better or clearer
|
---|
| 96 | semantics.
|
---|
| 97 |
|
---|
| 98 | POSIX and --ignore-case
|
---|
| 99 | -----------------------
|
---|
| 100 |
|
---|
| 101 | For this issue, interesting things to check in POSIX include the
|
---|
| 102 | Open Group Base Specifications, Chapter âRegular Expressionsâ, in
|
---|
| 103 | particular Section âRegular Expression General Requirementsâ and its
|
---|
| 104 | paragraph about caseless matching (this may not have been fully
|
---|
| 105 | thought through and that this text may be self-contradicting
|
---|
| 106 | [specifically: âof either data or patternsâ versus all the rest]).
|
---|
| 107 | See:
|
---|
| 108 |
|
---|
| 109 | http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02
|
---|
| 110 |
|
---|
| 111 | In particular, consider the following with POSIXâs approach to case
|
---|
| 112 | folding in mind. Assume a non-Turkic locale with a character
|
---|
| 113 | repertoire reduced to the following various forms of âLATIN LETTER Iâ:
|
---|
| 114 |
|
---|
| 115 | 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
|
---|
| 116 | 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
|
---|
| 117 | 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\
|
---|
| 118 | LATIN CAPITAL LETTER I DOT;;;0069;
|
---|
| 119 | 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
|
---|
| 120 |
|
---|
| 121 | UTF-8 octet lengths differ between U+0049 (0x49) and U+0069 (0x69)
|
---|
| 122 | versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This implies that
|
---|
| 123 | whole UTF-8 strings cannot be case-converted in place, using the same
|
---|
| 124 | memory buffer, and that the needed octet-size of the new buffer cannot
|
---|
| 125 | merely be guessed (although thereâs a simple upper bound of five times
|
---|
| 126 | the size of the input, as the longest UTF-8 encoding of any character
|
---|
| 127 | is five bytes).
|
---|
| 128 |
|
---|
| 129 | We have
|
---|
| 130 |
|
---|
| 131 | lc(I) = i, uc(I) = I
|
---|
| 132 | lc(i) = i, uc(i) = I
|
---|
| 133 | lc(İ) = i, uc(İ) = İ
|
---|
| 134 | lc(ı) = ı, uc(ı) = I
|
---|
| 135 |
|
---|
| 136 | where lc() and uc() denote lower-case and upper-case conversions.
|
---|
| 137 |
|
---|
| 138 | There are several candidate --ignore-case logics. Using the
|
---|
| 139 |
|
---|
| 140 | if (lc(input_wchar) == lc(pattern_wchar))
|
---|
| 141 |
|
---|
| 142 | logic leads to the following matches:
|
---|
| 143 |
|
---|
| 144 | \in I i İ ı
|
---|
| 145 | pat\ ----------
|
---|
| 146 | I | Y Y Y n
|
---|
| 147 | i | Y Y Y n
|
---|
| 148 | İ | Y Y Y n
|
---|
| 149 | ı | n n n Y
|
---|
| 150 |
|
---|
| 151 | There is a lack of symmetry between CAPITAL and SMALL LETTERs with
|
---|
| 152 | this. Using the
|
---|
| 153 |
|
---|
| 154 | if (uc(input_wchar) == uc(pattern_wchar))
|
---|
| 155 |
|
---|
| 156 | logic (which is what GNU grep currently does although this is not
|
---|
| 157 | documented or guaranteed in the future), leads to the following
|
---|
| 158 | matches:
|
---|
| 159 |
|
---|
| 160 | \in I i İ ı
|
---|
| 161 | pat\ ----------
|
---|
| 162 | I | Y Y n Y
|
---|
| 163 | i | Y Y n Y
|
---|
| 164 | İ | n n Y n
|
---|
| 165 | ı | Y Y n Y
|
---|
| 166 |
|
---|
| 167 | There is a lack of symmetry between CAPITAL and SMALL LETTERs with
|
---|
| 168 | this.
|
---|
| 169 |
|
---|
| 170 | Using the
|
---|
| 171 |
|
---|
| 172 | if (lc(input_wchar) == lc(pattern_wchar)
|
---|
| 173 | || uc(input_wchar) == uc(pattern_wchar))
|
---|
| 174 |
|
---|
| 175 | logic leads to the following matches:
|
---|
| 176 |
|
---|
| 177 | \in I i İ ı
|
---|
| 178 | pat\ ----------
|
---|
| 179 | I | Y Y Y Y
|
---|
| 180 | i | Y Y Y Y
|
---|
| 181 | İ | Y Y Y n
|
---|
| 182 | ı | Y Y n Y
|
---|
| 183 |
|
---|
| 184 | There is some elegance and symmetry with this. But there are
|
---|
| 185 | potentially two conversions to be made per input character. If the
|
---|
| 186 | pattern is pre-converted, two copies of it need to be kept and used in
|
---|
| 187 | a mutually coherent fashion.
|
---|
| 188 |
|
---|
| 189 | Using the
|
---|
| 190 |
|
---|
| 191 | if (input_wchar == pattern_wchar
|
---|
| 192 | || lc(input_wchar) == pattern_wchar
|
---|
| 193 | || uc(input_wchar) == pattern_wchar)
|
---|
| 194 |
|
---|
| 195 | logic (a plausible interpretation of POSIX) leads to the following
|
---|
| 196 | matches:
|
---|
| 197 |
|
---|
| 198 | \in I i İ ı
|
---|
| 199 | pat\ ----------
|
---|
| 200 | I | Y Y n Y
|
---|
| 201 | i | Y Y Y n
|
---|
| 202 | İ | n n Y n
|
---|
| 203 | ı | n n n Y
|
---|
| 204 |
|
---|
| 205 | There is a different CAPITAL/SMALL symmetry with this. But thereâs
|
---|
| 206 | also a loss of pattern/input symmetry thatâs unique to it. Also there
|
---|
| 207 | are potentially two conversions to be made per input character.
|
---|
| 208 |
|
---|
| 209 | Using the
|
---|
| 210 |
|
---|
| 211 | if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
|
---|
| 212 |
|
---|
| 213 | logic leads to the following matches:
|
---|
| 214 |
|
---|
| 215 | \in I i İ ı
|
---|
| 216 | pat\ ----------
|
---|
| 217 | I | Y Y Y Y
|
---|
| 218 | i | Y Y Y Y
|
---|
| 219 | İ | Y Y Y Y
|
---|
| 220 | ı | Y Y Y Y
|
---|
| 221 |
|
---|
| 222 | This shows total symmetry and transitivity (at least in this example
|
---|
| 223 | analysis). There are two conversions to be made per input character,
|
---|
| 224 | but support could be added for having a single straight mapping
|
---|
| 225 | performing a composition of the two conversions.
|
---|
| 226 |
|
---|
| 227 | Any optimization in the implementation of each logic must not change
|
---|
| 228 | its basic semantic.
|
---|
| 229 |
|
---|
| 230 |
|
---|
| 231 | Unicode and --ignore-case
|
---|
| 232 | -------------------------
|
---|
| 233 |
|
---|
| 234 | For this issue, interesting things to check in Unicode include:
|
---|
| 235 |
|
---|
| 236 | - The Unicode Standard, Chapter 3
|
---|
| 237 | (<https://www.unicode.org/versions/Unicode9.0.0/ch03.pdf>
|
---|
| 238 | âConformanceâ), Section 3.13 (âDefault Case Algorithmsâ) and the
|
---|
| 239 | toCasefold() case conversion operation.
|
---|
| 240 |
|
---|
| 241 | - The Unicode Standard, Chapter 4
|
---|
| 242 | (<https://www.unicode.org/versions/Unicode9.0.0/ch04.pdf>
|
---|
| 243 | âCharacter Propertiesâ), Section 4.2 (âCaseâ) and
|
---|
| 244 | the <https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
|
---|
| 245 | SpecialCasing.txt and
|
---|
| 246 | <https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
|
---|
| 247 | CaseFolding.txt files.
|
---|
| 248 |
|
---|
| 249 | - The Unicode Standard, Chapter 5
|
---|
| 250 | (<https://www.unicode.org/versions/Unicode9.0.0/ch05.pdf>
|
---|
| 251 | âImplementation Guidelinesâ), Section 5.18 (âCase Mappingsâ),
|
---|
| 252 | Subsection âCaseless Matchingâ.
|
---|
| 253 |
|
---|
| 254 | - The Unicode case charts <https://www.unicode.org/charts/case/>.
|
---|
| 255 |
|
---|
| 256 | Unicode uses the
|
---|
| 257 |
|
---|
| 258 | if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
|
---|
| 259 |
|
---|
| 260 | logic for caseless matching. Consider the âLATIN LETTER Iâ example
|
---|
| 261 | mentioned above. In a non-Turkic locale, simple case folding yields
|
---|
| 262 |
|
---|
| 263 | toCasefold_simple(U+0049) = U+0069
|
---|
| 264 | toCasefold_simple(U+0069) = U+0069
|
---|
| 265 | toCasefold_simple(U+0130) = U+0130
|
---|
| 266 | toCasefold_simple(U+0131) = U+0131
|
---|
| 267 |
|
---|
| 268 | which leads to the following matches:
|
---|
| 269 |
|
---|
| 270 | \in I i İ ı
|
---|
| 271 | pat\ ----------
|
---|
| 272 | I | Y Y n n
|
---|
| 273 | i | Y Y n n
|
---|
| 274 | İ | n n Y n
|
---|
| 275 | ı | n n n Y
|
---|
| 276 |
|
---|
| 277 | This is different from anything so far!
|
---|
| 278 |
|
---|
| 279 | In a non-Turkic locale, full case folding yields
|
---|
| 280 |
|
---|
| 281 | toCasefold_full(U+0049) = U+0069
|
---|
| 282 | toCasefold_full(U+0069) = U+0069
|
---|
| 283 | toCasefold_full(U+0130) = <U+0069, U+0307>
|
---|
| 284 | toCasefold_full(U+0131) = U+0131
|
---|
| 285 |
|
---|
| 286 | with
|
---|
| 287 |
|
---|
| 288 | 0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
|
---|
| 289 |
|
---|
| 290 | which leads to the following matches:
|
---|
| 291 |
|
---|
| 292 | \in I i İ ı
|
---|
| 293 | pat\ ----------
|
---|
| 294 | I | Y Y * n
|
---|
| 295 | i | Y Y * n
|
---|
| 296 | İ | n n Y n
|
---|
| 297 | ı | n n n Y
|
---|
| 298 |
|
---|
| 299 | This is just sad!
|
---|
| 300 |
|
---|
| 301 | Having toCasefold(U+0131), simple or full, map to itself instead of
|
---|
| 302 | U+0069 is in contradiction with the rules of Section 5.18 of the
|
---|
| 303 | Unicode Standard since toUpperCase(U+0131) is U+0049. Same thing for
|
---|
| 304 | toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069. The
|
---|
| 305 | justification for the weird toCasefold_full(U+0130) mapping is
|
---|
| 306 | unknown; it doesnât even make sense to add a dot (U+0307) to a letter
|
---|
| 307 | that already has one (U+0069). It would have been so simple to put
|
---|
| 308 | them all in the same equivalence class!
|
---|
| 309 |
|
---|
| 310 | Otherwise, also consider the following problem with Unicodeâs approach
|
---|
| 311 | on case folding in mind. Assume that we want to perform
|
---|
| 312 |
|
---|
| 313 | echo 'AÃBC' | grep -i 'Sb'
|
---|
| 314 |
|
---|
| 315 | which corresponds to
|
---|
| 316 |
|
---|
| 317 | input: U+0041 U+00DF U+0042 U+0043 U+000A
|
---|
| 318 | pattern: U+0053 U+0062
|
---|
| 319 |
|
---|
| 320 | Following CaseFolding.txt, applying the toCasefold() transformation to
|
---|
| 321 | these yields
|
---|
| 322 |
|
---|
| 323 | input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
|
---|
| 324 | pattern: U+0073 U+0062
|
---|
| 325 |
|
---|
| 326 | so, according to this approach, the input should match the pattern.
|
---|
| 327 | As long as the original input line is to be reported to the user as a
|
---|
| 328 | whole, there is no problem (from the userâs point-of-view;
|
---|
| 329 | implementation is complicated by this).
|
---|
| 330 |
|
---|
| 331 | However, consider both these GNU extensions:
|
---|
| 332 |
|
---|
| 333 | echo 'AÃBC' | grep -i --only-matching 'Sb'
|
---|
| 334 | echo 'AÃBC' | grep -i --color=always 'Sb'
|
---|
| 335 |
|
---|
| 336 | What is to be reported in these cases, since the match begins in the
|
---|
| 337 | *middle* of the original input character âÃâ?
|
---|
| 338 |
|
---|
| 339 | Unicodeâs toCasefold() cannot be implemented in terms of POSIXâs
|
---|
| 340 | towctrans() since that can only return a single wint_t value per input
|
---|
| 341 | wint_t value.
|
---|