Context Navigation

libcsv.tex

Visit:

Last change on this file was 3225, checked in by bird, 18 years ago
Python 2.5
File size: 20.5 KB

Line
1	\section{\module{csv} --- CSV File Reading and Writing}
2
3	\declaremodule{standard}{csv}
4	\modulesynopsis{Write and read tabular data to and from delimited files.}
5	\sectionauthor{Skip Montanaro}{skip@pobox.com}
6
7	\versionadded{2.3}
8	\index{csv}
9	\indexii{data}{tabular}
10
11	The so-called CSV (Comma Separated Values) format is the most common import
12	and export format for spreadsheets and databases. There is no ``CSV
13	standard'', so the format is operationally defined by the many applications
14	which read and write it. The lack of a standard means that subtle
15	differences often exist in the data produced and consumed by different
16	applications. These differences can make it annoying to process CSV files
17	from multiple sources. Still, while the delimiters and quoting characters
18	vary, the overall format is similar enough that it is possible to write a
19	single module which can efficiently manipulate such data, hiding the details
20	of reading and writing the data from the programmer.
21
22	The \module{csv} module implements classes to read and write tabular data in
23	CSV format. It allows programmers to say, ``write this data in the format
24	preferred by Excel,'' or ``read data from this file which was generated by
25	Excel,'' without knowing the precise details of the CSV format used by
26	Excel. Programmers can also describe the CSV formats understood by other
27	applications or define their own special-purpose CSV formats.
28
29	The \module{csv} module's \class{reader} and \class{writer} objects read and
30	write sequences. Programmers can also read and write data in dictionary
31	form using the \class{DictReader} and \class{DictWriter} classes.
32
33	\begin{notice}
34	This version of the \module{csv} module doesn't support Unicode
35	input. Also, there are currently some issues regarding \ASCII{} NUL
36	characters. Accordingly, all input should be UTF-8 or printable
37	\ASCII{} to be safe; see the examples in section~\ref{csv-examples}.
38	These restrictions will be removed in the future.
39	\end{notice}
40
41	\begin{seealso}
42	% \seemodule{array}{Arrays of uniformly types numeric values.}
43	\seepep{305}{CSV File API}
44	{The Python Enhancement Proposal which proposed this addition
45	to Python.}
46	\end{seealso}
47
48
49	\subsection{Module Contents \label{csv-contents}}
50
51	The \module{csv} module defines the following functions:
52
53	\begin{funcdesc}{reader}{csvfile\optional{,
54	dialect=\code{'excel'}}\optional{, fmtparam}}
55	Return a reader object which will iterate over lines in the given
56	{}\var{csvfile}. \var{csvfile} can be any object which supports the
57	iterator protocol and returns a string each time its \method{next}
58	method is called --- file objects and list objects are both suitable.
59	If \var{csvfile} is a file object, it must be opened with
60	the 'b' flag on platforms where that makes a difference. An optional
61	{}\var{dialect} parameter can be given
62	which is used to define a set of parameters specific to a particular CSV
63	dialect. It may be an instance of a subclass of the \class{Dialect}
64	class or one of the strings returned by the \function{list_dialects}
65	function. The other optional {}\var{fmtparam} keyword arguments can be
66	given to override individual formatting parameters in the current
67	dialect. For more information about the dialect and formatting
68	parameters, see section~\ref{csv-fmt-params}, ``Dialects and Formatting
69	Parameters'' for details of these parameters.
70
71	All data read are returned as strings. No automatic data type
72	conversion is performed.
73
74	\versionchanged[
75	The parser is now stricter with respect to multi-line quoted
76	fields. Previously, if a line ended within a quoted field without a
77	terminating newline character, a newline would be inserted into the
78	returned field. This behavior caused problems when reading files
79	which contained carriage return characters within fields. The
80	behavior was changed to return the field without inserting newlines. As
81	a consequence, if newlines embedded within fields are important, the
82	input should be split into lines in a manner which preserves the newline
83	characters]{2.5}
84
85	\end{funcdesc}
86
87	\begin{funcdesc}{writer}{csvfile\optional{,
88	dialect=\code{'excel'}}\optional{, fmtparam}}
89	Return a writer object responsible for converting the user's data into
90	delimited strings on the given file-like object. \var{csvfile} can be any
91	object with a \function{write} method. If \var{csvfile} is a file object,
92	it must be opened with the 'b' flag on platforms where that makes a
93	difference. An optional
94	{}\var{dialect} parameter can be given which is used to define a set of
95	parameters specific to a particular CSV dialect. It may be an instance
96	of a subclass of the \class{Dialect} class or one of the strings
97	returned by the \function{list_dialects} function. The other optional
98	{}\var{fmtparam} keyword arguments can be given to override individual
99	formatting parameters in the current dialect. For more information
100	about the dialect and formatting parameters, see
101	section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters'' for
102	details of these parameters. To make it as easy as possible to
103	interface with modules which implement the DB API, the value
104	\constant{None} is written as the empty string. While this isn't a
105	reversible transformation, it makes it easier to dump SQL NULL data values
106	to CSV files without preprocessing the data returned from a
107	\code{cursor.fetch*()} call. All other non-string data are stringified
108	with \function{str()} before being written.
109	\end{funcdesc}
110
111	\begin{funcdesc}{register_dialect}{name\optional{, dialect}\optional{, fmtparam}}
112	Associate \var{dialect} with \var{name}. \var{name} must be a string
113	or Unicode object. The dialect can be specified either by passing a
114	sub-class of \class{Dialect}, or by \var{fmtparam} keyword arguments,
115	or both, with keyword arguments overriding parameters of the dialect.
116	For more information about the dialect and formatting parameters, see
117	section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters''
118	for details of these parameters.
119	\end{funcdesc}
120
121	\begin{funcdesc}{unregister_dialect}{name}
122	Delete the dialect associated with \var{name} from the dialect registry. An
123	\exception{Error} is raised if \var{name} is not a registered dialect
124	name.
125	\end{funcdesc}
126
127	\begin{funcdesc}{get_dialect}{name}
128	Return the dialect associated with \var{name}. An \exception{Error} is
129	raised if \var{name} is not a registered dialect name.
130	\end{funcdesc}
131
132	\begin{funcdesc}{list_dialects}{}
133	Return the names of all registered dialects.
134	\end{funcdesc}
135
136	\begin{funcdesc}{field_size_limit}{\optional{new_limit}}
137	Returns the current maximum field size allowed by the parser. If
138	\var{new_limit} is given, this becomes the new limit.
139	\versionadded{2.5}
140	\end{funcdesc}
141
142
143	The \module{csv} module defines the following classes:
144
145	\begin{classdesc}{DictReader}{csvfile\optional{,
146	fieldnames=\constant{None},\optional{,
147	restkey=\constant{None}\optional{,
148	restval=\constant{None}\optional{,
149	dialect=\code{'excel'}\optional{,
150	args, *kwds}}}}}}
151	Create an object which operates like a regular reader but maps the
152	information read into a dict whose keys are given by the optional
153	{} \var{fieldnames}
154	parameter. If the \var{fieldnames} parameter is omitted, the values in
155	the first row of the \var{csvfile} will be used as the fieldnames.
156	If the row read has fewer fields than the fieldnames sequence,
157	the value of \var{restval} will be used as the default value. If the row
158	read has more fields than the fieldnames sequence, the remaining data is
159	added as a sequence keyed by the value of \var{restkey}. If the row read
160	has fewer fields than the fieldnames sequence, the remaining keys take the
161	value of the optional \var{restval} parameter. Any other optional or
162	keyword arguments are passed to the underlying \class{reader} instance.
163	\end{classdesc}
164
165
166	\begin{classdesc}{DictWriter}{csvfile, fieldnames\optional{,
167	restval=""\optional{,
168	extrasaction=\code{'raise'}\optional{,
169	dialect=\code{'excel'}\optional{,
170	args, *kwds}}}}}
171	Create an object which operates like a regular writer but maps dictionaries
172	onto output rows. The \var{fieldnames} parameter identifies the order in
173	which values in the dictionary passed to the \method{writerow()} method are
174	written to the \var{csvfile}. The optional \var{restval} parameter
175	specifies the value to be written if the dictionary is missing a key in
176	\var{fieldnames}. If the dictionary passed to the \method{writerow()}
177	method contains a key not found in \var{fieldnames}, the optional
178	\var{extrasaction} parameter indicates what action to take. If it is set
179	to \code{'raise'} a \exception{ValueError} is raised. If it is set to
180	\code{'ignore'}, extra values in the dictionary are ignored. Any other
181	optional or keyword arguments are passed to the underlying \class{writer}
182	instance.
183
184	Note that unlike the \class{DictReader} class, the \var{fieldnames}
185	parameter of the \class{DictWriter} is not optional. Since Python's
186	\class{dict} objects are not ordered, there is not enough information
187	available to deduce the order in which the row should be written to the
188	\var{csvfile}.
189
190	\end{classdesc}
191
192	\begin{classdesc*}{Dialect}{}
193	The \class{Dialect} class is a container class relied on primarily for its
194	attributes, which are used to define the parameters for a specific
195	\class{reader} or \class{writer} instance.
196	\end{classdesc*}
197
198	\begin{classdesc}{excel}{}
199	The \class{excel} class defines the usual properties of an Excel-generated
200	CSV file.
201	\end{classdesc}
202
203	\begin{classdesc}{excel_tab}{}
204	The \class{excel_tab} class defines the usual properties of an
205	Excel-generated TAB-delimited file.
206	\end{classdesc}
207
208	\begin{classdesc}{Sniffer}{}
209	The \class{Sniffer} class is used to deduce the format of a CSV file.
210	\end{classdesc}
211
212	The \class{Sniffer} class provides two methods:
213
214	\begin{methoddesc}{sniff}{sample\optional{,delimiters=None}}
215	Analyze the given \var{sample} and return a \class{Dialect} subclass
216	reflecting the parameters found. If the optional \var{delimiters} parameter
217	is given, it is interpreted as a string containing possible valid delimiter
218	characters.
219	\end{methoddesc}
220
221	\begin{methoddesc}{has_header}{sample}
222	Analyze the sample text (presumed to be in CSV format) and return
223	\constant{True} if the first row appears to be a series of column
224	headers.
225	\end{methoddesc}
226
227
228	The \module{csv} module defines the following constants:
229
230	\begin{datadesc}{QUOTE_ALL}
231	Instructs \class{writer} objects to quote all fields.
232	\end{datadesc}
233
234	\begin{datadesc}{QUOTE_MINIMAL}
235	Instructs \class{writer} objects to only quote those fields which contain
236	special characters such as \var{delimiter}, \var{quotechar} or any of the
237	characters in \var{lineterminator}.
238	\end{datadesc}
239
240	\begin{datadesc}{QUOTE_NONNUMERIC}
241	Instructs \class{writer} objects to quote all non-numeric
242	fields.
243
244	Instructs the reader to convert all non-quoted fields to type \var{float}.
245	\end{datadesc}
246
247	\begin{datadesc}{QUOTE_NONE}
248	Instructs \class{writer} objects to never quote fields. When the current
249	\var{delimiter} occurs in output data it is preceded by the current
250	\var{escapechar} character. If \var{escapechar} is not set, the writer
251	will raise \exception{Error} if any characters that require escaping
252	are encountered.
253
254	Instructs \class{reader} to perform no special processing of quote characters.
255	\end{datadesc}
256
257
258	The \module{csv} module defines the following exception:
259
260	\begin{excdesc}{Error}
261	Raised by any of the functions when an error is detected.
262	\end{excdesc}
263
264
265	\subsection{Dialects and Formatting Parameters\label{csv-fmt-params}}
266
267	To make it easier to specify the format of input and output records,
268	specific formatting parameters are grouped together into dialects. A
269	dialect is a subclass of the \class{Dialect} class having a set of specific
270	methods and a single \method{validate()} method. When creating \class{reader}
271	or \class{writer} objects, the programmer can specify a string or a subclass
272	of the \class{Dialect} class as the dialect parameter. In addition to, or
273	instead of, the \var{dialect} parameter, the programmer can also specify
274	individual formatting parameters, which have the same names as the
275	attributes defined below for the \class{Dialect} class.
276
277	Dialects support the following attributes:
278
279	\begin{memberdesc}[Dialect]{delimiter}
280	A one-character string used to separate fields. It defaults to \code{','}.
281	\end{memberdesc}
282
283	\begin{memberdesc}[Dialect]{doublequote}
284	Controls how instances of \var{quotechar} appearing inside a field should
285	be themselves be quoted. When \constant{True}, the character is doubled.
286	When \constant{False}, the \var{escapechar} is used as a prefix to the
287	\var{quotechar}. It defaults to \constant{True}.
288
289	On output, if \var{doublequote} is \constant{False} and no
290	\var{escapechar} is set, \exception{Error} is raised if a \var{quotechar}
291	is found in a field.
292	\end{memberdesc}
293
294	\begin{memberdesc}[Dialect]{escapechar}
295	A one-character string used by the writer to escape the \var{delimiter} if
296	\var{quoting} is set to \constant{QUOTE_NONE} and the \var{quotechar}
297	if \var{doublequote} is \constant{False}. On reading, the \var{escapechar}
298	removes any special meaning from the following character. It defaults
299	to \constant{None}, which disables escaping.
300	\end{memberdesc}
301
302	\begin{memberdesc}[Dialect]{lineterminator}
303	The string used to terminate lines produced by the \class{writer}.
304	It defaults to \code{'\e r\e n'}.
305
306	\note{The \class{reader} is hard-coded to recognise either \code{'\e r'}
307	or \code{'\e n'} as end-of-line, and ignores \var{lineterminator}. This
308	behavior may change in the future.}
309	\end{memberdesc}
310
311	\begin{memberdesc}[Dialect]{quotechar}
312	A one-character string used to quote fields containing special characters,
313	such as the \var{delimiter} or \var{quotechar}, or which contain new-line
314	characters. It defaults to \code{'"'}.
315	\end{memberdesc}
316
317	\begin{memberdesc}[Dialect]{quoting}
318	Controls when quotes should be generated by the writer and recognised
319	by the reader. It can take on any of the \constant{QUOTE_*} constants
320	(see section~\ref{csv-contents}) and defaults to \constant{QUOTE_MINIMAL}.
321	\end{memberdesc}
322
323	\begin{memberdesc}[Dialect]{skipinitialspace}
324	When \constant{True}, whitespace immediately following the \var{delimiter}
325	is ignored. The default is \constant{False}.
326	\end{memberdesc}
327
328
329	\subsection{Reader Objects}
330
331	Reader objects (\class{DictReader} instances and objects returned by
332	the \function{reader()} function) have the following public methods:
333
334	\begin{methoddesc}[csv reader]{next}{}
335	Return the next row of the reader's iterable object as a list, parsed
336	according to the current dialect.
337	\end{methoddesc}
338
339	Reader objects have the following public attributes:
340
341	\begin{memberdesc}[csv reader]{dialect}
342	A read-only description of the dialect in use by the parser.
343	\end{memberdesc}
344
345	\begin{memberdesc}[csv reader]{line_num}
346	The number of lines read from the source iterator. This is not the same
347	as the number of records returned, as records can span multiple lines.
348	\end{memberdesc}
349
350
351	\subsection{Writer Objects}
352
353	\class{Writer} objects (\class{DictWriter} instances and objects returned by
354	the \function{writer()} function) have the following public methods. A
355	{}\var{row} must be a sequence of strings or numbers for \class{Writer}
356	objects and a dictionary mapping fieldnames to strings or numbers (by
357	passing them through \function{str()} first) for {}\class{DictWriter}
358	objects. Note that complex numbers are written out surrounded by parens.
359	This may cause some problems for other programs which read CSV files
360	(assuming they support complex numbers at all).
361
362	\begin{methoddesc}[csv writer]{writerow}{row}
363	Write the \var{row} parameter to the writer's file object, formatted
364	according to the current dialect.
365	\end{methoddesc}
366
367	\begin{methoddesc}[csv writer]{writerows}{rows}
368	Write all the \var{rows} parameters (a list of \var{row} objects as
369	described above) to the writer's file object, formatted
370	according to the current dialect.
371	\end{methoddesc}
372
373	Writer objects have the following public attribute:
374
375	\begin{memberdesc}[csv writer]{dialect}
376	A read-only description of the dialect in use by the writer.
377	\end{memberdesc}
378
379
380
381	\subsection{Examples\label{csv-examples}}
382
383	The simplest example of reading a CSV file:
384
385	\begin{verbatim}
386	import csv
387	reader = csv.reader(open("some.csv", "rb"))
388	for row in reader:
389	print row
390	\end{verbatim}
391
392	Reading a file with an alternate format:
393
394	\begin{verbatim}
395	import csv
396	reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
397	for row in reader:
398	print row
399	\end{verbatim}
400
401	The corresponding simplest possible writing example is:
402
403	\begin{verbatim}
404	import csv
405	writer = csv.writer(open("some.csv", "wb"))
406	writer.writerows(someiterable)
407	\end{verbatim}
408
409	Registering a new dialect:
410
411	\begin{verbatim}
412	import csv
413
414	csv.register_dialect('unixpwd', delimiter=':', quoting=csv.QUOTE_NONE)
415
416	reader = csv.reader(open("passwd", "rb"), 'unixpwd')
417	\end{verbatim}
418
419	A slightly more advanced use of the reader --- catching and reporting errors:
420
421	\begin{verbatim}
422	import csv, sys
423	filename = "some.csv"
424	reader = csv.reader(open(filename, "rb"))
425	try:
426	for row in reader:
427	print row
428	except csv.Error, e:
429	sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
430	\end{verbatim}
431
432	And while the module doesn't directly support parsing strings, it can
433	easily be done:
434
435	\begin{verbatim}
436	import csv
437	for row in csv.reader(['one,two,three']):
438	print row
439	\end{verbatim}
440
441	The \module{csv} module doesn't directly support reading and writing
442	Unicode, but it is 8-bit-clean save for some problems with \ASCII{} NUL
443	characters. So you can write functions or classes that handle the
444	encoding and decoding for you as long as you avoid encodings like
445	UTF-16 that use NULs. UTF-8 is recommended.
446
447	\function{unicode_csv_reader} below is a generator that wraps
448	\class{csv.reader} to handle Unicode CSV data (a list of Unicode
449	strings). \function{utf_8_encoder} is a generator that encodes the
450	Unicode strings as UTF-8, one string (or row) at a time. The encoded
451	strings are parsed by the CSV reader, and
452	\function{unicode_csv_reader} decodes the UTF-8-encoded cells back
453	into Unicode:
454
455	\begin{verbatim}
456	import csv
457
458	def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
459	# csv.py doesn't do Unicode; encode temporarily as UTF-8:
460	csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
461	dialect=dialect, **kwargs)
462	for row in csv_reader:
463	# decode UTF-8 back to Unicode, cell by cell:
464	yield [unicode(cell, 'utf-8') for cell in row]
465
466	def utf_8_encoder(unicode_csv_data):
467	for line in unicode_csv_data:
468	yield line.encode('utf-8')
469	\end{verbatim}
470
471	For all other encodings the following \class{UnicodeReader} and
472	\class{UnicodeWriter} classes can be used. They take an additional
473	\var{encoding} parameter in their constructor and make sure that the data
474	passes the real reader or writer encoded as UTF-8:
475
476	\begin{verbatim}
477	import csv, codecs, cStringIO
478
479	class UTF8Recoder:
480	"""
481	Iterator that reads an encoded stream and reencodes the input to UTF-8
482	"""
483	def __init__(self, f, encoding):
484	self.reader = codecs.getreader(encoding)(f)
485
486	def __iter__(self):
487	return self
488
489	def next(self):
490	return self.reader.next().encode("utf-8")
491
492	class UnicodeReader:
493	"""
494	A CSV reader which will iterate over lines in the CSV file "f",
495	which is encoded in the given encoding.
496	"""
497
498	def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
499	f = UTF8Recoder(f, encoding)
500	self.reader = csv.reader(f, dialect=dialect, **kwds)
501
502	def next(self):
503	row = self.reader.next()
504	return [unicode(s, "utf-8") for s in row]
505
506	def __iter__(self):
507	return self
508
509	class UnicodeWriter:
510	"""
511	A CSV writer which will write rows to CSV file "f",
512	which is encoded in the given encoding.
513	"""
514
515	def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
516	# Redirect output to a queue
517	self.queue = cStringIO.StringIO()
518	self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
519	self.stream = f
520	self.encoder = codecs.getincrementalencoder(encoding)()
521
522	def writerow(self, row):
523	self.writer.writerow([s.encode("utf-8") for s in row])
524	# Fetch UTF-8 output from the queue ...
525	data = self.queue.getvalue()
526	data = data.decode("utf-8")
527	# ... and reencode it into the target encoding
528	data = self.encoder.encode(data)
529	# write to the target stream
530	self.stream.write(data)
531	# empty queue
532	self.queue.truncate(0)
533
534	def writerows(self, rows):
535	for row in rows:
536	self.writerow(row)
537	\end{verbatim}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: vendor/python/2.5/Doc/lib/libcsv.tex

Download in other formats: