Context Navigation

TODO

Last change on this file was 3529, checked in by bird, 4 years ago
Imported grep 3.7 from grep-3.7.tar.gz (sha256: c22b0cf2d4f6bbe599c902387e8058990e1eee99aef333a203829e5fd3dbb342), applying minimal auto-props.
File size: 11.0 KB

Rev	Line
[3529]	1	Things to do for GNU grep
	2
	3	Copyright (C) 1992, 1997-2002, 2004-2021 Free Software Foundation, Inc.
	4
	5	Copying and distribution of this file, with or without modification,
	6	are permitted in any medium without royalty provided the copyright
	7	notice and this notice are preserved.
	8
	9	===============
	10	Short term work
	11	===============
	12
	13	See where we are with UTF-8 performance.
	14
	15	Merge Debian patches that seem relevant.
	16
	17	Go through patches in Savannah.
	18
	19	Fix --directories=read.
	20
	21	Write better Texinfo documentation for grep. The manual page would be a
	22	good place to start, but Info documents are also supposed to contain a
	23	tutorial and examples.
	24
	25	Some tests in tests/spencer2.tests should have failed! Need to filter out
	26	some bugs in dfa.[ch]/regex.[ch].
	27
	28	Multithreading?
	29
	30	GNU grep originally did 32-bit arithmetic. Although it has moved to
	31	64-bit on 64-bit platforms by using types like ptrdiff_t and size_t,
	32	this conversion has not been entirely systematic and should be checked.
	33
	34	Lazy dynamic linking of libpcre. See Debianâs 03-397262-dlopen-pcre.patch.
	35
	36	Check FreeBSDâs integration of zgrep (-Z) and bzgrep (-J) in one
	37	binary. Is there a possibility of doing even better by automatically
	38	checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
	39	0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with
	40	libpcre is decided, do the same for libz and libbz2.
	41
	42
	43
	44	===================
	45	Matching algorithms
	46	===================
	47
	48	Take a look at these and consider opportunities for merging or cloning:
	49
	50	-- http://osrd.org/projects/grep/global-regular-expression-print-tools-grep-variants
	51	-- ja-grepâs mlb2 patch (Japanese grep)
	52	<http://distcache.freebsd.org/ports-distfiles/grep-2.4.2-mlb2.patch.gz>
	53	-- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
	54	<http://www.mt.cs.keio.ac.jp/person/narita/lv/>;
	55	-- cgrep (Context grep) <https://awgn.github.io/cgrep/>
	56	seems like nice work;
	57	-- sgrep (Struct grep) <https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
	58	-- agrep (Approximate grep) <https://www.tgries.de/agrep/>,
	59	from glimpse;
	60	-- nr-grep (Nondeterministic reverse grep)
	61	<https://www.dcc.uchile.cl/~gnavarro/software/>;
	62	-- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
	63	-- freegrep <https://github.com/howardjp/freegrep>;
	64
	65	Check some new algorithms for matching. See, for example, Faro &
	66	Lecroq (cited in kwset.c).
	67
	68	Fix the DFA matcher to never use exponential space. (Fortunately, these
	69	cases are rare.)
	70
	71
	72
	73	============================
	74	Standards: POSIX and Unicode
	75	============================
	76
	77	For POSIX compliance issues, see POSIX 1003.1.
	78
	79	Current support for the POSIX [= =] and [. .] constructs is limited to
	80	platforms whose regular expression matchers are sufficiently
	81	compatible with the GNU C library so that the --without-included-regex
	82	option of âconfigureâ is in effect. Extend this support to non-glibc
	83	platforms, where --with-included-regex is in effect, by modifying the
	84	included version of the regex code to defer to the native version when
	85	handling [= =] and [. .].
	86
	87	For Unicode, interesting things to check include the Unicode Standard
	88	<https://www.unicode.org/standard/standard.html> and the Unicode Technical
	89	Standard #18 (<https://www.unicode.org/reports/tr18/> âUnicode Regular
	90	Expressionsâ). Talk to Bruno Haible whoâs maintaining GNU libunistring.
	91	See also Unicode Standard Annex #15 (<https://www.unicode.org/reports/tr15/>
	92	âUnicode Normalization Formsâ), already implemented by GNU libunistring.
	93
	94	In particular, --ignore-case needs to be evaluated against the standards.
	95	We may want to deviate from POSIX if Unicode provides better or clearer
	96	semantics.
	97
	98	POSIX and --ignore-case
	99	-----------------------
	100
	101	For this issue, interesting things to check in POSIX include the
	102	Open Group Base Specifications, Chapter âRegular Expressionsâ, in
	103	particular Section âRegular Expression General Requirementsâ and its
	104	paragraph about caseless matching (this may not have been fully
	105	thought through and that this text may be self-contradicting
	106	[specifically: âof either data or patternsâ versus all the rest]).
	107	See:
	108
	109	http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02
	110
	111	In particular, consider the following with POSIXâs approach to case
	112	folding in mind. Assume a non-Turkic locale with a character
	113	repertoire reduced to the following various forms of âLATIN LETTER Iâ:
	114
	115	0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
	116	0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
	117	0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\
	118	LATIN CAPITAL LETTER I DOT;;;0069;
	119	0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
	120
	121	UTF-8 octet lengths differ between U+0049 (0x49) and U+0069 (0x69)
	122	versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This implies that
	123	whole UTF-8 strings cannot be case-converted in place, using the same
	124	memory buffer, and that the needed octet-size of the new buffer cannot
	125	merely be guessed (although thereâs a simple upper bound of five times
	126	the size of the input, as the longest UTF-8 encoding of any character
	127	is five bytes).
	128
	129	We have
	130
	131	lc(I) = i, uc(I) = I
	132	lc(i) = i, uc(i) = I
	133	lc(Ä°) = i, uc(Ä°) = Ä°
	134	lc(Ä±) = Ä±, uc(Ä±) = I
	135
	136	where lc() and uc() denote lower-case and upper-case conversions.
	137
	138	There are several candidate --ignore-case logics. Using the
	139
	140	if (lc(input_wchar) == lc(pattern_wchar))
	141
	142	logic leads to the following matches:
	143
	144	\in I i Ä° Ä±
	145	pat\ ----------
	146	I \| Y Y Y n
	147	i \| Y Y Y n
	148	Ä° \| Y Y Y n
	149	Ä± \| n n n Y
	150
	151	There is a lack of symmetry between CAPITAL and SMALL LETTERs with
	152	this. Using the
	153
	154	if (uc(input_wchar) == uc(pattern_wchar))
	155
	156	logic (which is what GNU grep currently does although this is not
	157	documented or guaranteed in the future), leads to the following
	158	matches:
	159
	160	\in I i Ä° Ä±
	161	pat\ ----------
	162	I \| Y Y n Y
	163	i \| Y Y n Y
	164	Ä° \| n n Y n
	165	Ä± \| Y Y n Y
	166
	167	There is a lack of symmetry between CAPITAL and SMALL LETTERs with
	168	this.
	169
	170	Using the
	171
	172	if (lc(input_wchar) == lc(pattern_wchar)
	173	\|\| uc(input_wchar) == uc(pattern_wchar))
	174
	175	logic leads to the following matches:
	176
	177	\in I i Ä° Ä±
	178	pat\ ----------
	179	I \| Y Y Y Y
	180	i \| Y Y Y Y
	181	Ä° \| Y Y Y n
	182	Ä± \| Y Y n Y
	183
	184	There is some elegance and symmetry with this. But there are
	185	potentially two conversions to be made per input character. If the
	186	pattern is pre-converted, two copies of it need to be kept and used in
	187	a mutually coherent fashion.
	188
	189	Using the
	190
	191	if (input_wchar == pattern_wchar
	192	\|\| lc(input_wchar) == pattern_wchar
	193	\|\| uc(input_wchar) == pattern_wchar)
	194
	195	logic (a plausible interpretation of POSIX) leads to the following
	196	matches:
	197
	198	\in I i Ä° Ä±
	199	pat\ ----------
	200	I \| Y Y n Y
	201	i \| Y Y Y n
	202	Ä° \| n n Y n
	203	Ä± \| n n n Y
	204
	205	There is a different CAPITAL/SMALL symmetry with this. But thereâs
	206	also a loss of pattern/input symmetry thatâs unique to it. Also there
	207	are potentially two conversions to be made per input character.
	208
	209	Using the
	210
	211	if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
	212
	213	logic leads to the following matches:
	214
	215	\in I i Ä° Ä±
	216	pat\ ----------
	217	I \| Y Y Y Y
	218	i \| Y Y Y Y
	219	Ä° \| Y Y Y Y
	220	Ä± \| Y Y Y Y
	221
	222	This shows total symmetry and transitivity (at least in this example
	223	analysis). There are two conversions to be made per input character,
	224	but support could be added for having a single straight mapping
	225	performing a composition of the two conversions.
	226
	227	Any optimization in the implementation of each logic must not change
	228	its basic semantic.
	229
	230
	231	Unicode and --ignore-case
	232	-------------------------
	233
	234	For this issue, interesting things to check in Unicode include:
	235
	236	- The Unicode Standard, Chapter 3
	237	(<https://www.unicode.org/versions/Unicode9.0.0/ch03.pdf>
	238	âConformanceâ), Section 3.13 (âDefault Case Algorithmsâ) and the
	239	toCasefold() case conversion operation.
	240
	241	- The Unicode Standard, Chapter 4
	242	(<https://www.unicode.org/versions/Unicode9.0.0/ch04.pdf>
	243	âCharacter Propertiesâ), Section 4.2 (âCaseâ) and
	244	the <https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
	245	SpecialCasing.txt and
	246	<https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
	247	CaseFolding.txt files.
	248
	249	- The Unicode Standard, Chapter 5
	250	(<https://www.unicode.org/versions/Unicode9.0.0/ch05.pdf>
	251	âImplementation Guidelinesâ), Section 5.18 (âCase Mappingsâ),
	252	Subsection âCaseless Matchingâ.
	253
	254	- The Unicode case charts <https://www.unicode.org/charts/case/>.
	255
	256	Unicode uses the
	257
	258	if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
	259
	260	logic for caseless matching. Consider the âLATIN LETTER Iâ example
	261	mentioned above. In a non-Turkic locale, simple case folding yields
	262
	263	toCasefold_simple(U+0049) = U+0069
	264	toCasefold_simple(U+0069) = U+0069
	265	toCasefold_simple(U+0130) = U+0130
	266	toCasefold_simple(U+0131) = U+0131
	267
	268	which leads to the following matches:
	269
	270	\in I i Ä° Ä±
	271	pat\ ----------
	272	I \| Y Y n n
	273	i \| Y Y n n
	274	Ä° \| n n Y n
	275	Ä± \| n n n Y
	276
	277	This is different from anything so far!
	278
	279	In a non-Turkic locale, full case folding yields
	280
	281	toCasefold_full(U+0049) = U+0069
	282	toCasefold_full(U+0069) = U+0069
	283	toCasefold_full(U+0130) = <U+0069, U+0307>
	284	toCasefold_full(U+0131) = U+0131
	285
	286	with
	287
	288	0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
	289
	290	which leads to the following matches:
	291
	292	\in I i Ä° Ä±
	293	pat\ ----------
	294	I \| Y Y * n
	295	i \| Y Y * n
	296	Ä° \| n n Y n
	297	Ä± \| n n n Y
	298
	299	This is just sad!
	300
	301	Having toCasefold(U+0131), simple or full, map to itself instead of
	302	U+0069 is in contradiction with the rules of Section 5.18 of the
	303	Unicode Standard since toUpperCase(U+0131) is U+0049. Same thing for
	304	toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069. The
	305	justification for the weird toCasefold_full(U+0130) mapping is
	306	unknown; it doesnât even make sense to add a dot (U+0307) to a letter
	307	that already has one (U+0069). It would have been so simple to put
	308	them all in the same equivalence class!
	309
	310	Otherwise, also consider the following problem with Unicodeâs approach
	311	on case folding in mind. Assume that we want to perform
	312
	313	echo 'AÃBC' \| grep -i 'Sb'
	314
	315	which corresponds to
	316
	317	input: U+0041 U+00DF U+0042 U+0043 U+000A
	318	pattern: U+0053 U+0062
	319
	320	Following CaseFolding.txt, applying the toCasefold() transformation to
	321	these yields
	322
	323	input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
	324	pattern: U+0073 U+0062
	325
	326	so, according to this approach, the input should match the pattern.
	327	As long as the original input line is to be reported to the user as a
	328	whole, there is no problem (from the userâs point-of-view;
	329	implementation is complicated by this).
	330
	331	However, consider both these GNU extensions:
	332
	333	echo 'AÃBC' \| grep -i --only-matching 'Sb'
	334	echo 'AÃBC' \| grep -i --color=always 'Sb'
	335
	336	What is to be reported in these cases, since the match begins in the
	337	middle of the original input character âÃâ?
	338
	339	Unicodeâs toCasefold() cannot be implemented in terms of POSIXâs
	340	towctrans() since that can only return a single wint_t value per input
	341	wint_t value.

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format