Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

TODO@ 2893

Last change on this file since 2893 was 2595, checked in by bird, 13 years ago
gnu grep version 2.12 (grep-2.12.tar.xz, md5sum=8d2f0346d08b13c18afb81f0e8aa1e2f)
File size: 10.8 KB

Line
1	Copyright (C) 1992, 1997-2002, 2004-2012 Free Software Foundation, Inc.
2
3	Copying and distribution of this file, with or without modification,
4	are permitted in any medium without royalty provided the copyright
5	notice and this notice are preserved.
6
7	===============
8	Short term work
9	===============
10
11	See where we are with UTF-8 performance.
12
13	Merge Debian patches 55-bigfile.patch, 69-mbtowc.patch and
14	70-man_apostrophe.patch. Go through patches in Savannah.
15
16	Cleanup of the grep(), grepdir(), recursion (the "main loop") to use fts.
17	Fix --directories=read.
18
19	Write better Texinfo documentation for grep. The manual page would be a
20	good place to start, but Info documents are also supposed to contain a
21	tutorial and examples.
22
23	Some test in tests/spencer2.tests should have failed! Need to filter out
24	some bugs in dfa.[ch]/regex.[ch].
25
26	Multithreading?
27
28	GNU grep does 32-bit arithmetic, it needs to move to 64-bit (i.e.
29	size_t/ptrdiff_t).
30
31	Lazy dynamic linking of libpcre.
32
33	Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one
34	binary. Is there a possibility of doing even better by automatically
35	checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
36	0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with
37	libpcre is decided, do the same for libz and libbz2.
38
39
40
41	==================
42	Matching algorithms
43	==================
44
45	Check <http://tony.abou-assaleh.net/greps.html>. Take a look at these
46	and consider opportunities for merging or cloning:
47
48	-- ja-grep's mlb2 patch (Japanese grep)
49	<ftp://ftp.freebsd.org/pub/FreeBSD/ports/distfiles/grep-2.4.2-mlb2.patch.gz>
50	-- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
51	<http://www.ff.iij4u.or.jp/~nrt/lv/>;
52	-- cgrep (Context grep) <http://plg.uwaterloo.ca/~ftp/mt/cgrep/>
53	seems like nice work;
54	-- sgrep (Struct grep) <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
55	-- agrep (Approximate grep) <http://www.tgries.de/agrep/>,
56	from glimpse;
57	-- nr-grep (Nondeterministic reverse grep)
58	<http://www.dcc.uchile.cl/~gnavarro/software/>;
59	-- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
60	-- grep.py (Python grep) <http://www.vdesmedt.com/~vds2212/grep.html>;
61	-- freegrep <http://www.vocito.com/downloads/software/grep/>;
62
63	Check some new algorithms for matching; talk to Karl Berry and Nelson.
64	Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142)
65	claim that his algorithm is faster than Boyer-More. Worth checking.
66
67	Fix the DFA matcher to never use exponential space. (Fortunately, these
68	cases are rare.)
69
70
71
72	============================
73	Standards: POSIX and Unicode
74	============================
75
76	For POSIX compliance, see p10003.x. Current support for the POSIX [= =]
77	and [. .] constructs is limited. This is difficult because it requires
78	locale-dependent details of the character set and collating sequence,
79	but POSIX does not standardize any method for accessing this information!
80
81	For Unicode, interesting things to check include the Unicode Standard
82	<http://www.unicode.org/standard/standard.html> and the Unicode Technical
83	Standard #18 (<http://www.unicode.org/reports/tr18/> âUnicode Regular
84	Expressionsâ). Talk to Bruno Haible who's maintaining GNU libunistring.
85	See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/>
86	âUnicode Normalization Formsâ), already implemented by GNU libunistring.
87
88	In particular, --ignore-case needs to be evaluated against the standards.
89	We may want to deviate from POSIX if Unicode provides better or clearer
90	semantics.
91
92	POSIX and --ignore-case
93	-----------------------
94
95	For this issue, interesting things to check in POSIX include the
96	Volume âBase Definitions (XBD)â, Chapter âRegular Expressionsâ and in
97	particular Section âRegular Expression General Requirementsâ and its
98	paragraph about caseless matching (note that this may not have been
99	fully thought through and that this text may be self-contradicting
100	[specifically: âof either data or patternsâ versus all the rest]).
101
102	In particular, consider the following with POSIX's approach to case
103	folding in mind. Assume a non-Turkic locale with a character
104	repertoire reduced to the following various forms of âLATIN LETTER Iâ:
105
106	0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
107	0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
108	0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
109	0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
110
111	First note the differing UTF-8 octet lengths of U+0049 (0x49) and
112	U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This
113	implies that whole UTF-8 strings cannot be case-converted in place,
114	using the same memory buffer, and that the needed octet-size of the
115	new buffer cannot merely be guessed (although there's a simple upper
116	bound of six times the size of the input, as the longest UTF-8
117	encoding of any character is six bytes).
118
119	We have
120
121	lc(I) = i, uc(I) = I
122	lc(i) = i, uc(i) = I
123	lc(Ä°) = i, uc(Ä°) = Ä°
124	lc(Ä±) = Ä±, uc(Ä±) = I
125
126	where lc() and uc() denote lower-case and upper-case conversions.
127
128	There are several candidate --ignore-case logics (including the one
129	mandated by POSIX):
130
131	Using the
132
133	if (lc(input_wchar) == lc(pattern_wchar))
134
135	logic leads to the following matches:
136
137	\in I i Ä° Ä±
138	pat\ ----------
139	"I" \| Y Y Y n
140	"i" \| Y Y Y n
141	"Ä°" \| Y Y Y n
142	"Ä±" \| n n n Y
143
144	There is a lack of symmetry between CAPITAL and SMALL LETTERs with
145	this.
146
147	Using the
148
149	if (uc(input_wchar) == uc(pattern_wchar))
150
151	logic leads to the following matches:
152
153	\in I i Ä° Ä±
154	pat\ ----------
155	"I" \| Y Y n Y
156	"i" \| Y Y n Y
157	"Ä°" \| n n Y n
158	"Ä±" \| Y Y n Y
159
160	There is a lack of symmetry between CAPITAL and SMALL LETTERs with
161	this.
162
163	Using the
164
165	if ( lc(input_wchar) == lc(pattern_wchar)
166	\|\| uc(input_wchar) == uc(pattern_wchar))
167
168	logic leads to the following matches:
169
170	\in I i Ä° Ä±
171	pat\ ----------
172	"I" \| Y Y Y Y
173	"i" \| Y Y Y Y
174	"Ä°" \| Y Y Y n
175	"Ä±" \| Y Y n Y
176
177	There is some elegance and symmetry with this. But there are
178	potentially two conversions to be made per input character. If the
179	pattern is pre-converted, two copies of it need to be kept and used in
180	a mutually coherent fashion.
181
182	Using the
183
184	if ( input_wchar == pattern_wchar
185	\|\| lc(input_wchar) == pattern_wchar
186	\|\| uc(input_wchar) == pattern_wchar)
187
188	logic (as mandated by POSIX) leads to the following matches:
189
190	\in I i Ä° Ä±
191	pat\ ----------
192	"I" \| Y Y n Y
193	"i" \| Y Y Y n
194	"Ä°" \| n n Y n
195	"Ä±" \| n n n Y
196
197	There is a different CAPITAL/SMALL symmetry with this. But there's
198	also a loss of pattern/input symmetry that's unique to it. Also there
199	are potentially two conversions to be made per input character.
200
201	Using the
202
203	if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
204
205
206	logic leads to the following matches:
207
208	\in I i Ä° Ä±
209	pat\ ----------
210	"I" \| Y Y Y Y
211	"i" \| Y Y Y Y
212	"Ä°" \| Y Y Y Y
213	"Ä±" \| Y Y Y Y
214
215	This shows total symmetry and transitivity
216	(at least in this example analysis).
217	There are two conversions to be made per input character,
218	but support could be added for having
219	a single straight mapping performing
220	a composition of the two conversions.
221
222	Any optimization in the implementation of each logic
223	must not change its basic semantic.
224
225
226	Unicode and --ignore-case
227	-------------------------
228
229	For this issue, interesting things to check in Unicode include:
230
231	-- The Unicode Standard, Chapter 3
232	(<http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf>
233	âConformanceâ), Section 3.13 (âDefault Case Operationsâ) and the
234	toCasefold() case conversion operation.
235
236	-- The Unicode Standard, Chapter 4
237	(<http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf>
238	âCharacter Propertiesâ), Section 4.2 (âCaseâNormativeâ) and
239	the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
240	SpecialCasing.txt and
241	<http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
242	CaseFolding.txt files from the
243	<http://www.unicode.org/Public/UNIDATA/UCD.html> Unicode
244	Character Database .
245
246	The <http://www.unicode.org/standard/standard.html> Unicode Standard,
247	Chapter 5 (â<http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf>
248	Implementation Guidelines â), Section 5.18 (âCase Mappingsâ),
249	Subsection âCaseless Matchingâ.
250
251	The Unicode <http://www.unicode.org/charts/case/> case charts.
252
253	Unicode uses the
254
255	if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
256
257	logic for caseless matching. Let's consider the âLATIN LETTER Iâ
258	example mentioned above. In a non-Turkic locale, simple case folding
259	yields
260
261	toCasefold_simple(U+0049) = U+0069
262	toCasefold_simple(U+0069) = U+0069
263	toCasefold_simple(U+0130) = U+0130
264	toCasefold_simple(U+0131) = U+0131
265
266	which leads to the following matches:
267
268	\in I i Ä° Ä±
269	pat\ ----------
270	"I" \| Y Y n n
271	"i" \| Y Y n n
272	"Ä°" \| n n Y n
273	"Ä±" \| n n n Y
274
275	This is different from anything so far!
276
277	In a non-Turkic locale, full case folding yields
278
279	toCasefold_full(U+0049) = U+0069
280	toCasefold_full(U+0069) = U+0069
281	toCasefold_full(U+0130) = <U+0069, U+0307>
282	toCasefold_full(U+0131) = U+0131
283
284	with
285
286	0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
287
288	which leads to the following matches:
289
290	\in I i Ä° Ä±
291	pat\ ----------
292	"I" \| Y Y * n
293	"i" \| Y Y * n
294	"Ä°" \| n n Y n
295	"Ä±" \| n n n Y
296
297	This is just sad!
298
299	Note that having toCasefold(U+0131), simple or full, map to itself
300	instead of U+0069 is in contradiction with the rules of Section 5.18
301	of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same
302	thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is
303	U+0069. The justification for the weird toCasefold_full(U+0130)
304	mapping is unknown; it doesn't even make sense to add a dot (U+0307)
305	to a letter that already has one (U+0069). It would have been so
306	simple to put them all in the same equivalence class!
307
308	Otherwise, also consider the following problem with Unicode's approach
309	on case folding in mind. Assume that we want to perform
310
311	echo 'AÃBC \| grep -i 'Sb'
312
313	which corresponds to
314
315	input: U+0041 U+00DF U+0042 U+0043 U+000A
316	pattern: U+0053 U+0062
317
318	Following âCaseFolding-4.1.0.txtâ, applying the toCasefold()
319	transformation to these yields
320
321	input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
322	pattern: U+0073 U+0062
323
324	so, according to this approach, the input should match the pattern. As
325	long as the original input line is to be reported to the user as a
326	whole, there is no problem (from the user's point-of-view;
327	implementation is complicated by this).
328
329	However, consider both these GNU extensions:
330
331	echo 'AÃBC' \| grep -i --only-matching 'Sb'
332	echo 'AÃBC' \| grep -i --color=always 'Sb'
333
334	What is to be reported in these cases, since the match begins in the
335	middle of the original input character 'Ã'?
336
337	Note that Unicode's toCasefold() cannot be implemented in terms of
338	POSIX' towctrans() since that can only return a single wint_t value
339	per input wint_t value.

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format