Ñæàòèå ñ ïîìîùüþ ãðàììàòè÷åñêèõ ìîäåëåé
>> Ðóññêèå ìàòåðèàëû | Àíãëèéñêèå ìàòåðèàëûÑìîòðèòå òàêæå ìàòåðèàëû:
- Ïðåäñêàçàíèå ïî ÷àñòè÷íîìó ñîâïàäåíèþ (PPM)
- Êîíòåêñòíûå ìåòîäû ñæàòèÿ (áåç PPM)
- Ñæàòèå òåêñòîâ
- Ìîäåëèðîâàíèå åñòåñòâåííûõ ÿçûêîâ
- Êëàññèôèêàöèÿ è ðàçìåòêà òåêñòîâ ñ èñïîëüçîâàíèåì ìåòîäîâ ñæàòèÿ äàííûõ
- Îáçîðû óíèâåðñàëüíûõ àëãîðèòìîâ ñæàòèÿ äàííûõ
- Êíèãà "Ìåòîäû ñæàòèÿ äàííûõ". Ðàçäåë 1 "Ìåòîäû ñæàòèÿ áåç ïîòåðü"
>> Ðóññêèå ìàòåðèàëû | Àíãëèéñêèå ìàòåðèàëû |
|||
Àâòîðû | Íàçâàíèå | Îïèñàíèå | Ðåéòèíã |
Êóðàïîâà Å.Â., Ðÿáêî Á.ß. | Ïðèìåíåíèå ôîðìàëüíûõ ãðàììàòèê ïðè êîäèðîâàíèè èñòî÷íèêîâ èíôîðìàöèè | Êîäèðîâàíèå ôàéëîâ áàç äàííûõ è áèáëèîòåê ïðîãðàìì ñ ïîìîùüþ ìîäåëè èñòî÷íèêà, ñîîòâåòñòâóþùåé çàäàííîé êîíòåêñòíî-ñâîáîäíîé ãðàììàòèêå. Òåêóùåå ãðàììàòè÷åñêîå ñîñòîÿíèå îáóñëàâëèâàåò èñïîëüçîâàíèå ñîîòâåòñòâóþùåãî åìó êîäà, ò.å. ïðèìåíÿåòñÿ êîíòåêñòíîå ìîäåëèðîâàíèå â ÷èñòîì âèäå.
Ïðîáëåìû ïåðåäà÷è èíôîðìàöèè, òîì 31, âûï.1, 1995. Ñ28-32. PDF 344 êáàéò |
|
>> Ðóññêèå ìàòåðèàëû | Àíãëèéñêèå ìàòåðèàëû | |||
Lehman E. | Approximation Algorithms for Grammar-Based Data Compression | This thesis considers the smallest grammar problem: find the smallest context-free grammar that generates exactly one given string. We show that this problem is intractable, and so our objective is to find approximation algorithms. This simple question is connected to many areas of research. Most importantly, there is a link to data compression; instead of storing a long string, one can store a small grammar that generates it. A small grammar for a string also naturally brings out underlying patterns, a fact that is useful, for example, in DNA analysis. Moreover, the size of the smallest context-free grammar generating a string can be regarded as a computable relaxation of Kolmogorov complexity... In this thesis, we establish hardness results, evaluate several previously proposed algorithms, and then present new procedures with much stronger approximation guarantees.
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, February 2002. PDF 682 êáàéò PS.RAR 334 êáàéò |
|
Bach J., Witten I.H. | Lexical Attraction for Text Compression | Íåêîòîðûå íàáëþäåíèÿ î çàâèñèìîñòè ñëîâ â òåêñòå îò íåñìåæíûõ ñ íèìè ñëîâ. Ëþáîïûòíî, íî äî ïðàêòè÷íîé ðåàëèçàöèè î÷åíü äàëåêî. New methods of acquiring structural information in text documents may support better compression by identifying an appropriate prediction context for each symbol. The method of "lexical attraction" infers syntactic dependency structures from statistical analysis of large corpora. We describe the generation of a lexical attraction model, discuss its application to text compression, and explore its potential to outperform fixed-context models such as word-level PPM. Perhaps the most exciting aspect of this work is the prospect of using compression as a metric for structure discovery in text. Proceedings of the 1999 IEEE Data Compression Conference, Snowbird, Utah, March 1999. HTML 54 êáàéò PDF 69 êáàéò PS.RAR 50 êáàéò |
|
Charikar M., Lehman E., Liu D., Panigrahy R., Prabhakaran M., Rasala A., Sahai A., Shelat A. | Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models | We consider the problem of finding the smallest context-free grammar that generates exactly one given string of length n. The size of this grammar is of theoretical interest as an efficiently computable variant of Kolmogorov complexity. The problem is of practical importance in areas such as data compression and pattern extraction. The smallest grammar is known to be hard to approximate to within a constant factor... Our main result is an exponential improvement of this ratio; we give an O(log(n/g*)) approximation algorithm, where g* is the size of the smallest grammar...
February 20, 2002 PDF 166 êáàéò |
|
Bookstein A., Klein Sh. | Compression, Information Theory and Grammars: A Unified Approach | ...We here
propose the notion of a formal grammar as a flexible model of text generation that encompasses most
of the models offered before as well as, in principle, extending the possibility of compression to a
much more general class of languages. Assuming a general model of text generation, a derivation
is given of the well known Shannon entropy formula, making possible a theory of information based
upon text representation rather than on communication. The ideas are shown to apply to a number
of commonly used text models. Finally, we focus on a Markov model of text generation, suggest
an information theoretic measure of similarity between two probability distributions, and develop a
clustering algorithm based on this measure...
Center for Information and Language Studies, University of Chicago, January 1990. PDF 297 êáàéò PS.RAR 79 êáàéò |
|
Hutchens J.L. | The UpWrite Predictor: A General Grammatical Inference Engine for Symbolic Time Series with Applications in Natural Language Acquisition and Data Compression | This dissertion is concerned with the development of a general Grammatical Inference Engine for symbolic time series -- a device which is capable of automatically constructing a predictive model from arbitrary symbolic data. We are particularly interested in the situation where the data is natural language text... The UpWrite Predictor may potentially be of use in many applications, and we explore one of these, data compression, in this dissertation. We show that the performance of standard PPMC data compressor may be improved by using the UpWrite Predictor to abstract the data prior to compression taking place. During our exploration of the data compression problem, we introduce several modifications and additions to traditional PPM technique...
The University of Western Australia, December 1999. PDF 1432 êáàéò PS.RAR 417 êáàéò |
|
Nevill-Manning C. | Inferring Sequential Structure | Äèññåðòàöèÿ ïî SEQUITUR -- àëãîðèòìó àâòîìàòè÷åñêîãî àäàïòèâíîãî ïîñòðîåíèÿ êîíòåêñòíî-ñâîáîäíûõ ãðàììàòèê. Structure exists in sequences ranging from human language and music to the genetic information encoded in our DNA. This thesis shows how that structure can be discovered automatically and made explicit... We develop simple, robust techniques that reveal interesting structure in a wide range of real-world sequences including music, English text, descriptions of plants, graphical figures, DNA sequences, word classes in language, a genealogical database, programming languages, execution traces, and diverse sequences from a data compression corpus. These techniques are useful: they produce comprehensible visual explanations of structure, identify morphological units and significant phrases, compress data, optimise graphical rendering, infer recursive grammars by recognising self-similarity, and infer programs. University of Waikato, May 1996. PDF 787 êáàéò |
|
Nevill-Manning C., Witten I., Maulsby D. | Compression by Induction of Hierarchical Grammars | Èñïîëüçîâàíèå àäàïòèâíî ñòðîÿùèõñÿ êîíòåêñòíî-ñâîáîäíûõ ãðàììàòèê äëÿ ñæàòèÿ äàííûõ. Îïèñûâàåìûé àëãîðèòì ïîçäíåå ïðåâðàòèëñÿ â àëãîðèòì SEQUITUR.
Computer Science, University of Waikato, Hamilton, New Zealand, 1994. PDF 44 êáàéò |
|
Nevill-Manning C., Witten I. Olsen D. | Compressing semi-structured text using hierarchical phrase identification | È åùå ñòàòüÿ ïðî SEQUITUR. ...This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of “semi-structured,” we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem... Computer Science, University of Waikato, Hamilton, New Zealand, 1996. PDF 53 êáàéò |
|
Nevill-Manning C., Witten I. | Online and offline heuristics for inferring hierarchiies of repetitions in sequences | È åùå îäíà ëåáåäèíàÿ ïåñíÿ ïðî SEQUITUR. Hierarchical dictionary-based compression schemes form a grammar for a text be replacing each repeated string with a production rule... In this paper, we compare the online method with three offline heuristics for selecting the next substring to replace: longest string first, most common string first, and the string that minimizes the size of the input... We evaluate each technique on artificial and natural sequences. In general, the locally-most-compressive heuristic performs best, followed by most frequent, the online technique, and, lagging by some distance, the longest-first technique. 2000 PDF 584 êáàéò |
|
Evans W. | Compression via Guided Parsing | Íå îñîáî óäà÷íûå ýêñïåðèìåíòû ïî ñæàòèþ èñõîäíèêîâ ñ ïîìîùüþ ðàçáèåíèÿ íà ïîòîêè ëåêñåì â ñîîòâåòñòâèè ñ ïðàâèëàìè êîíòåêñòíî-ñâîáîäíîé ãðàììàòèêè ÿçûêà. Àâòîð èññëåäîâàë ñæàòèå ñ ïîòåðÿìè ïîáî÷íîé èíôîðìàöèè (ôîðìàòèðîâàíèå, êîììåíòàðèè).
1997 PDF 183 êáàéò PS.RAR 66 êáàéò |
|
Kieffer J., Yang E. | Grammar Based Codes: A New Class of Universal Lossless Source Codes | We investigate a type of lossless source code called a grammar based code, which, in
response to any input data string x over a fixed finite alphabet, selects a context-free
grammar Gx representing x in the sense that x is the unique string belonging to the language
generated by Gx. Lossless compression of x takes place indirectly via compression
of the production rules of the grammar Gx. It is shown that, subject to some mild restrictions,
a grammar based code is a universal code with respect to the family of finite state
information sources over the finite alphabet. Redundancy bounds for grammar based
codes are established. Reduction rules for designing grammar based codes are presented.
2000 PDF 358 êáàéò PS.RAR 98 êáàéò |
|
Lake M. | Prediction by Grammatical Match | Äîâîëüíîå òóìàííîå îïèñàíèå àâòîðñêîãî àëãîðèòìà PGM (Prediction by Grammatical Match), ïîçâîëÿþùåãî îáúåäèíèòü ïàðñåð íà îñíîâå çàäàííûõ ëåâîñòîðîííèõ ãðàììàòèê ñ PPM.  èòîãå äëÿ ìàëûõ ïîðÿäêîâ ìîäåëè ïîëó÷àåòñÿ îáîãíàòü PPM ïî ñæàòèþ è èñïîëüçóåìîé ïàìÿòè.
Proceedings of the IEEE Data Compression Conference, Snowbird, Utah, March 28-30, 2000. PDF.RAR 135 êáàéò |
|
Ñìîòðèòå òàêæå ìàòåðèàëû:
- Ïðåäñêàçàíèå ïî ÷àñòè÷íîìó ñîâïàäåíèþ (PPM)
- Êîíòåêñòíûå ìåòîäû ñæàòèÿ (áåç PPM)
- Ñæàòèå òåêñòîâ
- Ìîäåëèðîâàíèå åñòåñòâåííûõ ÿçûêîâ
- Êëàññèôèêàöèÿ è ðàçìåòêà òåêñòîâ ñ èñïîëüçîâàíèåì ìåòîäîâ ñæàòèÿ äàííûõ
- Îáçîðû óíèâåðñàëüíûõ àëãîðèòìîâ ñæàòèÿ äàííûõ
- Êíèãà "Ìåòîäû ñæàòèÿ äàííûõ". Ðàçäåë 1 "Ìåòîäû ñæàòèÿ áåç ïîòåðü"
íàâåðõ