source: vendor/grep/2.12/TODO@ 2893

Last change on this file since 2893 was 2595, checked in by bird, 13 years ago

gnu grep version 2.12 (grep-2.12.tar.xz, md5sum=8d2f0346d08b13c18afb81f0e8aa1e2f)

File size: 10.8 KB
Line 
1 Copyright (C) 1992, 1997-2002, 2004-2012 Free Software Foundation, Inc.
2
3 Copying and distribution of this file, with or without modification,
4 are permitted in any medium without royalty provided the copyright
5 notice and this notice are preserved.
6
7===============
8Short term work
9===============
10
11See where we are with UTF-8 performance.
12
13Merge Debian patches 55-bigfile.patch, 69-mbtowc.patch and
1470-man_apostrophe.patch. Go through patches in Savannah.
15
16Cleanup of the grep(), grepdir(), recursion (the "main loop") to use fts.
17Fix --directories=read.
18
19Write better Texinfo documentation for grep. The manual page would be a
20good place to start, but Info documents are also supposed to contain a
21tutorial and examples.
22
23Some test in tests/spencer2.tests should have failed! Need to filter out
24some bugs in dfa.[ch]/regex.[ch].
25
26Multithreading?
27
28GNU grep does 32-bit arithmetic, it needs to move to 64-bit (i.e.
29size_t/ptrdiff_t).
30
31Lazy dynamic linking of libpcre.
32
33Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one
34binary. Is there a possibility of doing even better by automatically
35checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
360x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with
37libpcre is decided, do the same for libz and libbz2.
38
39
40
41==================
42Matching algorithms
43==================
44
45Check <http://tony.abou-assaleh.net/greps.html>. Take a look at these
46and consider opportunities for merging or cloning:
47
48 -- ja-grep's mlb2 patch (Japanese grep)
49 <ftp://ftp.freebsd.org/pub/FreeBSD/ports/distfiles/grep-2.4.2-mlb2.patch.gz>
50 -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
51 <http://www.ff.iij4u.or.jp/~nrt/lv/>;
52 -- cgrep (Context grep) <http://plg.uwaterloo.ca/~ftp/mt/cgrep/>
53 seems like nice work;
54 -- sgrep (Struct grep) <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
55 -- agrep (Approximate grep) <http://www.tgries.de/agrep/>,
56 from glimpse;
57 -- nr-grep (Nondeterministic reverse grep)
58 <http://www.dcc.uchile.cl/~gnavarro/software/>;
59 -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
60 -- grep.py (Python grep) <http://www.vdesmedt.com/~vds2212/grep.html>;
61 -- freegrep <http://www.vocito.com/downloads/software/grep/>;
62
63Check some new algorithms for matching; talk to Karl Berry and Nelson.
64Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142)
65claim that his algorithm is faster than Boyer-More. Worth checking.
66
67Fix the DFA matcher to never use exponential space. (Fortunately, these
68cases are rare.)
69
70
71
72============================
73Standards: POSIX and Unicode
74============================
75
76For POSIX compliance, see p10003.x. Current support for the POSIX [= =]
77and [. .] constructs is limited. This is difficult because it requires
78locale-dependent details of the character set and collating sequence,
79but POSIX does not standardize any method for accessing this information!
80
81For Unicode, interesting things to check include the Unicode Standard
82<http://www.unicode.org/standard/standard.html> and the Unicode Technical
83Standard #18 (<http://www.unicode.org/reports/tr18/> “Unicode Regular
84Expressions”). Talk to Bruno Haible who's maintaining GNU libunistring.
85See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/>
86“Unicode Normalization Forms”), already implemented by GNU libunistring.
87
88In particular, --ignore-case needs to be evaluated against the standards.
89We may want to deviate from POSIX if Unicode provides better or clearer
90semantics.
91
92POSIX and --ignore-case
93-----------------------
94
95For this issue, interesting things to check in POSIX include the
96Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in
97particular Section “Regular Expression General Requirements” and its
98paragraph about caseless matching (note that this may not have been
99fully thought through and that this text may be self-contradicting
100[specifically: “of either data or patterns” versus all the rest]).
101
102In particular, consider the following with POSIX's approach to case
103folding in mind. Assume a non-Turkic locale with a character
104repertoire reduced to the following various forms of “LATIN LETTER I”:
105
1060049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
1070069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
1080130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
1090131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
110
111First note the differing UTF-8 octet lengths of U+0049 (0x49) and
112U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This
113implies that whole UTF-8 strings cannot be case-converted in place,
114using the same memory buffer, and that the needed octet-size of the
115new buffer cannot merely be guessed (although there's a simple upper
116bound of six times the size of the input, as the longest UTF-8
117encoding of any character is six bytes).
118
119We have
120
121lc(I) = i, uc(I) = I
122lc(i) = i, uc(i) = I
123lc(İ) = i, uc(İ) = İ
124lc(ı) = ı, uc(ı) = I
125
126where lc() and uc() denote lower-case and upper-case conversions.
127
128There are several candidate --ignore-case logics (including the one
129mandated by POSIX):
130
131Using the
132
133if (lc(input_wchar) == lc(pattern_wchar))
134
135logic leads to the following matches:
136
137 \in I i İ ı
138pat\ ----------
139"I" | Y Y Y n
140"i" | Y Y Y n
141"İ" | Y Y Y n
142"ı" | n n n Y
143
144There is a lack of symmetry between CAPITAL and SMALL LETTERs with
145this.
146
147Using the
148
149if (uc(input_wchar) == uc(pattern_wchar))
150
151logic leads to the following matches:
152
153 \in I i İ ı
154pat\ ----------
155"I" | Y Y n Y
156"i" | Y Y n Y
157"İ" | n n Y n
158"ı" | Y Y n Y
159
160There is a lack of symmetry between CAPITAL and SMALL LETTERs with
161this.
162
163Using the
164
165if ( lc(input_wchar) == lc(pattern_wchar)
166|| uc(input_wchar) == uc(pattern_wchar))
167
168logic leads to the following matches:
169
170 \in I i İ ı
171pat\ ----------
172"I" | Y Y Y Y
173"i" | Y Y Y Y
174"İ" | Y Y Y n
175"ı" | Y Y n Y
176
177There is some elegance and symmetry with this. But there are
178potentially two conversions to be made per input character. If the
179pattern is pre-converted, two copies of it need to be kept and used in
180a mutually coherent fashion.
181
182Using the
183
184if ( input_wchar == pattern_wchar
185|| lc(input_wchar) == pattern_wchar
186|| uc(input_wchar) == pattern_wchar)
187
188logic (as mandated by POSIX) leads to the following matches:
189
190 \in I i İ ı
191pat\ ----------
192"I" | Y Y n Y
193"i" | Y Y Y n
194"İ" | n n Y n
195"ı" | n n n Y
196
197There is a different CAPITAL/SMALL symmetry with this. But there's
198also a loss of pattern/input symmetry that's unique to it. Also there
199are potentially two conversions to be made per input character.
200
201Using the
202
203if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
204
205
206logic leads to the following matches:
207
208 \in I i İ ı
209pat\ ----------
210"I" | Y Y Y Y
211"i" | Y Y Y Y
212"İ" | Y Y Y Y
213"ı" | Y Y Y Y
214
215This shows total symmetry and transitivity
216(at least in this example analysis).
217There are two conversions to be made per input character,
218but support could be added for having
219a single straight mapping performing
220a composition of the two conversions.
221
222Any optimization in the implementation of each logic
223must not change its basic semantic.
224
225
226Unicode and --ignore-case
227-------------------------
228
229For this issue, interesting things to check in Unicode include:
230
231 -- The Unicode Standard, Chapter 3
232 (<http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf>
233 “Conformance”), Section 3.13 (“Default Case Operations”) and the
234 toCasefold() case conversion operation.
235
236 -- The Unicode Standard, Chapter 4
237 (<http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf>
238 “Character Properties”), Section 4.2 (“Case—Normative”) and
239 the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
240 SpecialCasing.txt and
241 <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
242 CaseFolding.txt files from the
243 <http://www.unicode.org/Public/UNIDATA/UCD.html> Unicode
244 Character Database .
245
246The <http://www.unicode.org/standard/standard.html> Unicode Standard,
247Chapter 5 (“<http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf>
248Implementation Guidelines ”), Section 5.18 (“Case Mappings”),
249Subsection “Caseless Matching”.
250
251The Unicode <http://www.unicode.org/charts/case/> case charts.
252
253Unicode uses the
254
255if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
256
257logic for caseless matching. Let's consider the “LATIN LETTER I”
258example mentioned above. In a non-Turkic locale, simple case folding
259yields
260
261toCasefold_simple(U+0049) = U+0069
262toCasefold_simple(U+0069) = U+0069
263toCasefold_simple(U+0130) = U+0130
264toCasefold_simple(U+0131) = U+0131
265
266which leads to the following matches:
267
268 \in I i İ ı
269pat\ ----------
270"I" | Y Y n n
271"i" | Y Y n n
272"İ" | n n Y n
273"ı" | n n n Y
274
275This is different from anything so far!
276
277In a non-Turkic locale, full case folding yields
278
279toCasefold_full(U+0049) = U+0069
280toCasefold_full(U+0069) = U+0069
281toCasefold_full(U+0130) = <U+0069, U+0307>
282toCasefold_full(U+0131) = U+0131
283
284with
285
2860307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
287
288which leads to the following matches:
289
290 \in I i İ ı
291pat\ ----------
292"I" | Y Y * n
293"i" | Y Y * n
294"İ" | n n Y n
295"ı" | n n n Y
296
297This is just sad!
298
299Note that having toCasefold(U+0131), simple or full, map to itself
300instead of U+0069 is in contradiction with the rules of Section 5.18
301of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same
302thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is
303U+0069. The justification for the weird toCasefold_full(U+0130)
304mapping is unknown; it doesn't even make sense to add a dot (U+0307)
305to a letter that already has one (U+0069). It would have been so
306simple to put them all in the same equivalence class!
307
308Otherwise, also consider the following problem with Unicode's approach
309on case folding in mind. Assume that we want to perform
310
311echo 'AßBC | grep -i 'Sb'
312
313which corresponds to
314
315input: U+0041 U+00DF U+0042 U+0043 U+000A
316pattern: U+0053 U+0062
317
318Following “CaseFolding-4.1.0.txt”, applying the toCasefold()
319transformation to these yields
320
321input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
322pattern: U+0073 U+0062
323
324so, according to this approach, the input should match the pattern. As
325long as the original input line is to be reported to the user as a
326whole, there is no problem (from the user's point-of-view;
327implementation is complicated by this).
328
329However, consider both these GNU extensions:
330
331echo 'AßBC' | grep -i --only-matching 'Sb'
332echo 'AßBC' | grep -i --color=always 'Sb'
333
334What is to be reported in these cases, since the match begins in the
335*middle* of the original input character 'ß'?
336
337Note that Unicode's toCasefold() cannot be implemented in terms of
338POSIX' towctrans() since that can only return a single wint_t value
339per input wint_t value.
Note: See TracBrowser for help on using the repository browser.