1 | \section{\module{csv} --- CSV File Reading and Writing}
|
---|
2 |
|
---|
3 | \declaremodule{standard}{csv}
|
---|
4 | \modulesynopsis{Write and read tabular data to and from delimited files.}
|
---|
5 | \sectionauthor{Skip Montanaro}{skip@pobox.com}
|
---|
6 |
|
---|
7 | \versionadded{2.3}
|
---|
8 | \index{csv}
|
---|
9 | \indexii{data}{tabular}
|
---|
10 |
|
---|
11 | The so-called CSV (Comma Separated Values) format is the most common import
|
---|
12 | and export format for spreadsheets and databases. There is no ``CSV
|
---|
13 | standard'', so the format is operationally defined by the many applications
|
---|
14 | which read and write it. The lack of a standard means that subtle
|
---|
15 | differences often exist in the data produced and consumed by different
|
---|
16 | applications. These differences can make it annoying to process CSV files
|
---|
17 | from multiple sources. Still, while the delimiters and quoting characters
|
---|
18 | vary, the overall format is similar enough that it is possible to write a
|
---|
19 | single module which can efficiently manipulate such data, hiding the details
|
---|
20 | of reading and writing the data from the programmer.
|
---|
21 |
|
---|
22 | The \module{csv} module implements classes to read and write tabular data in
|
---|
23 | CSV format. It allows programmers to say, ``write this data in the format
|
---|
24 | preferred by Excel,'' or ``read data from this file which was generated by
|
---|
25 | Excel,'' without knowing the precise details of the CSV format used by
|
---|
26 | Excel. Programmers can also describe the CSV formats understood by other
|
---|
27 | applications or define their own special-purpose CSV formats.
|
---|
28 |
|
---|
29 | The \module{csv} module's \class{reader} and \class{writer} objects read and
|
---|
30 | write sequences. Programmers can also read and write data in dictionary
|
---|
31 | form using the \class{DictReader} and \class{DictWriter} classes.
|
---|
32 |
|
---|
33 | \begin{notice}
|
---|
34 | This version of the \module{csv} module doesn't support Unicode
|
---|
35 | input. Also, there are currently some issues regarding \ASCII{} NUL
|
---|
36 | characters. Accordingly, all input should be UTF-8 or printable
|
---|
37 | \ASCII{} to be safe; see the examples in section~\ref{csv-examples}.
|
---|
38 | These restrictions will be removed in the future.
|
---|
39 | \end{notice}
|
---|
40 |
|
---|
41 | \begin{seealso}
|
---|
42 | % \seemodule{array}{Arrays of uniformly types numeric values.}
|
---|
43 | \seepep{305}{CSV File API}
|
---|
44 | {The Python Enhancement Proposal which proposed this addition
|
---|
45 | to Python.}
|
---|
46 | \end{seealso}
|
---|
47 |
|
---|
48 |
|
---|
49 | \subsection{Module Contents \label{csv-contents}}
|
---|
50 |
|
---|
51 | The \module{csv} module defines the following functions:
|
---|
52 |
|
---|
53 | \begin{funcdesc}{reader}{csvfile\optional{,
|
---|
54 | dialect=\code{'excel'}}\optional{, fmtparam}}
|
---|
55 | Return a reader object which will iterate over lines in the given
|
---|
56 | {}\var{csvfile}. \var{csvfile} can be any object which supports the
|
---|
57 | iterator protocol and returns a string each time its \method{next}
|
---|
58 | method is called --- file objects and list objects are both suitable.
|
---|
59 | If \var{csvfile} is a file object, it must be opened with
|
---|
60 | the 'b' flag on platforms where that makes a difference. An optional
|
---|
61 | {}\var{dialect} parameter can be given
|
---|
62 | which is used to define a set of parameters specific to a particular CSV
|
---|
63 | dialect. It may be an instance of a subclass of the \class{Dialect}
|
---|
64 | class or one of the strings returned by the \function{list_dialects}
|
---|
65 | function. The other optional {}\var{fmtparam} keyword arguments can be
|
---|
66 | given to override individual formatting parameters in the current
|
---|
67 | dialect. For more information about the dialect and formatting
|
---|
68 | parameters, see section~\ref{csv-fmt-params}, ``Dialects and Formatting
|
---|
69 | Parameters'' for details of these parameters.
|
---|
70 |
|
---|
71 | All data read are returned as strings. No automatic data type
|
---|
72 | conversion is performed.
|
---|
73 |
|
---|
74 | \versionchanged[
|
---|
75 | The parser is now stricter with respect to multi-line quoted
|
---|
76 | fields. Previously, if a line ended within a quoted field without a
|
---|
77 | terminating newline character, a newline would be inserted into the
|
---|
78 | returned field. This behavior caused problems when reading files
|
---|
79 | which contained carriage return characters within fields. The
|
---|
80 | behavior was changed to return the field without inserting newlines. As
|
---|
81 | a consequence, if newlines embedded within fields are important, the
|
---|
82 | input should be split into lines in a manner which preserves the newline
|
---|
83 | characters]{2.5}
|
---|
84 |
|
---|
85 | \end{funcdesc}
|
---|
86 |
|
---|
87 | \begin{funcdesc}{writer}{csvfile\optional{,
|
---|
88 | dialect=\code{'excel'}}\optional{, fmtparam}}
|
---|
89 | Return a writer object responsible for converting the user's data into
|
---|
90 | delimited strings on the given file-like object. \var{csvfile} can be any
|
---|
91 | object with a \function{write} method. If \var{csvfile} is a file object,
|
---|
92 | it must be opened with the 'b' flag on platforms where that makes a
|
---|
93 | difference. An optional
|
---|
94 | {}\var{dialect} parameter can be given which is used to define a set of
|
---|
95 | parameters specific to a particular CSV dialect. It may be an instance
|
---|
96 | of a subclass of the \class{Dialect} class or one of the strings
|
---|
97 | returned by the \function{list_dialects} function. The other optional
|
---|
98 | {}\var{fmtparam} keyword arguments can be given to override individual
|
---|
99 | formatting parameters in the current dialect. For more information
|
---|
100 | about the dialect and formatting parameters, see
|
---|
101 | section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters'' for
|
---|
102 | details of these parameters. To make it as easy as possible to
|
---|
103 | interface with modules which implement the DB API, the value
|
---|
104 | \constant{None} is written as the empty string. While this isn't a
|
---|
105 | reversible transformation, it makes it easier to dump SQL NULL data values
|
---|
106 | to CSV files without preprocessing the data returned from a
|
---|
107 | \code{cursor.fetch*()} call. All other non-string data are stringified
|
---|
108 | with \function{str()} before being written.
|
---|
109 | \end{funcdesc}
|
---|
110 |
|
---|
111 | \begin{funcdesc}{register_dialect}{name\optional{, dialect}\optional{, fmtparam}}
|
---|
112 | Associate \var{dialect} with \var{name}. \var{name} must be a string
|
---|
113 | or Unicode object. The dialect can be specified either by passing a
|
---|
114 | sub-class of \class{Dialect}, or by \var{fmtparam} keyword arguments,
|
---|
115 | or both, with keyword arguments overriding parameters of the dialect.
|
---|
116 | For more information about the dialect and formatting parameters, see
|
---|
117 | section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters''
|
---|
118 | for details of these parameters.
|
---|
119 | \end{funcdesc}
|
---|
120 |
|
---|
121 | \begin{funcdesc}{unregister_dialect}{name}
|
---|
122 | Delete the dialect associated with \var{name} from the dialect registry. An
|
---|
123 | \exception{Error} is raised if \var{name} is not a registered dialect
|
---|
124 | name.
|
---|
125 | \end{funcdesc}
|
---|
126 |
|
---|
127 | \begin{funcdesc}{get_dialect}{name}
|
---|
128 | Return the dialect associated with \var{name}. An \exception{Error} is
|
---|
129 | raised if \var{name} is not a registered dialect name.
|
---|
130 | \end{funcdesc}
|
---|
131 |
|
---|
132 | \begin{funcdesc}{list_dialects}{}
|
---|
133 | Return the names of all registered dialects.
|
---|
134 | \end{funcdesc}
|
---|
135 |
|
---|
136 | \begin{funcdesc}{field_size_limit}{\optional{new_limit}}
|
---|
137 | Returns the current maximum field size allowed by the parser. If
|
---|
138 | \var{new_limit} is given, this becomes the new limit.
|
---|
139 | \versionadded{2.5}
|
---|
140 | \end{funcdesc}
|
---|
141 |
|
---|
142 |
|
---|
143 | The \module{csv} module defines the following classes:
|
---|
144 |
|
---|
145 | \begin{classdesc}{DictReader}{csvfile\optional{,
|
---|
146 | fieldnames=\constant{None},\optional{,
|
---|
147 | restkey=\constant{None}\optional{,
|
---|
148 | restval=\constant{None}\optional{,
|
---|
149 | dialect=\code{'excel'}\optional{,
|
---|
150 | *args, **kwds}}}}}}
|
---|
151 | Create an object which operates like a regular reader but maps the
|
---|
152 | information read into a dict whose keys are given by the optional
|
---|
153 | {} \var{fieldnames}
|
---|
154 | parameter. If the \var{fieldnames} parameter is omitted, the values in
|
---|
155 | the first row of the \var{csvfile} will be used as the fieldnames.
|
---|
156 | If the row read has fewer fields than the fieldnames sequence,
|
---|
157 | the value of \var{restval} will be used as the default value. If the row
|
---|
158 | read has more fields than the fieldnames sequence, the remaining data is
|
---|
159 | added as a sequence keyed by the value of \var{restkey}. If the row read
|
---|
160 | has fewer fields than the fieldnames sequence, the remaining keys take the
|
---|
161 | value of the optional \var{restval} parameter. Any other optional or
|
---|
162 | keyword arguments are passed to the underlying \class{reader} instance.
|
---|
163 | \end{classdesc}
|
---|
164 |
|
---|
165 |
|
---|
166 | \begin{classdesc}{DictWriter}{csvfile, fieldnames\optional{,
|
---|
167 | restval=""\optional{,
|
---|
168 | extrasaction=\code{'raise'}\optional{,
|
---|
169 | dialect=\code{'excel'}\optional{,
|
---|
170 | *args, **kwds}}}}}
|
---|
171 | Create an object which operates like a regular writer but maps dictionaries
|
---|
172 | onto output rows. The \var{fieldnames} parameter identifies the order in
|
---|
173 | which values in the dictionary passed to the \method{writerow()} method are
|
---|
174 | written to the \var{csvfile}. The optional \var{restval} parameter
|
---|
175 | specifies the value to be written if the dictionary is missing a key in
|
---|
176 | \var{fieldnames}. If the dictionary passed to the \method{writerow()}
|
---|
177 | method contains a key not found in \var{fieldnames}, the optional
|
---|
178 | \var{extrasaction} parameter indicates what action to take. If it is set
|
---|
179 | to \code{'raise'} a \exception{ValueError} is raised. If it is set to
|
---|
180 | \code{'ignore'}, extra values in the dictionary are ignored. Any other
|
---|
181 | optional or keyword arguments are passed to the underlying \class{writer}
|
---|
182 | instance.
|
---|
183 |
|
---|
184 | Note that unlike the \class{DictReader} class, the \var{fieldnames}
|
---|
185 | parameter of the \class{DictWriter} is not optional. Since Python's
|
---|
186 | \class{dict} objects are not ordered, there is not enough information
|
---|
187 | available to deduce the order in which the row should be written to the
|
---|
188 | \var{csvfile}.
|
---|
189 |
|
---|
190 | \end{classdesc}
|
---|
191 |
|
---|
192 | \begin{classdesc*}{Dialect}{}
|
---|
193 | The \class{Dialect} class is a container class relied on primarily for its
|
---|
194 | attributes, which are used to define the parameters for a specific
|
---|
195 | \class{reader} or \class{writer} instance.
|
---|
196 | \end{classdesc*}
|
---|
197 |
|
---|
198 | \begin{classdesc}{excel}{}
|
---|
199 | The \class{excel} class defines the usual properties of an Excel-generated
|
---|
200 | CSV file.
|
---|
201 | \end{classdesc}
|
---|
202 |
|
---|
203 | \begin{classdesc}{excel_tab}{}
|
---|
204 | The \class{excel_tab} class defines the usual properties of an
|
---|
205 | Excel-generated TAB-delimited file.
|
---|
206 | \end{classdesc}
|
---|
207 |
|
---|
208 | \begin{classdesc}{Sniffer}{}
|
---|
209 | The \class{Sniffer} class is used to deduce the format of a CSV file.
|
---|
210 | \end{classdesc}
|
---|
211 |
|
---|
212 | The \class{Sniffer} class provides two methods:
|
---|
213 |
|
---|
214 | \begin{methoddesc}{sniff}{sample\optional{,delimiters=None}}
|
---|
215 | Analyze the given \var{sample} and return a \class{Dialect} subclass
|
---|
216 | reflecting the parameters found. If the optional \var{delimiters} parameter
|
---|
217 | is given, it is interpreted as a string containing possible valid delimiter
|
---|
218 | characters.
|
---|
219 | \end{methoddesc}
|
---|
220 |
|
---|
221 | \begin{methoddesc}{has_header}{sample}
|
---|
222 | Analyze the sample text (presumed to be in CSV format) and return
|
---|
223 | \constant{True} if the first row appears to be a series of column
|
---|
224 | headers.
|
---|
225 | \end{methoddesc}
|
---|
226 |
|
---|
227 |
|
---|
228 | The \module{csv} module defines the following constants:
|
---|
229 |
|
---|
230 | \begin{datadesc}{QUOTE_ALL}
|
---|
231 | Instructs \class{writer} objects to quote all fields.
|
---|
232 | \end{datadesc}
|
---|
233 |
|
---|
234 | \begin{datadesc}{QUOTE_MINIMAL}
|
---|
235 | Instructs \class{writer} objects to only quote those fields which contain
|
---|
236 | special characters such as \var{delimiter}, \var{quotechar} or any of the
|
---|
237 | characters in \var{lineterminator}.
|
---|
238 | \end{datadesc}
|
---|
239 |
|
---|
240 | \begin{datadesc}{QUOTE_NONNUMERIC}
|
---|
241 | Instructs \class{writer} objects to quote all non-numeric
|
---|
242 | fields.
|
---|
243 |
|
---|
244 | Instructs the reader to convert all non-quoted fields to type \var{float}.
|
---|
245 | \end{datadesc}
|
---|
246 |
|
---|
247 | \begin{datadesc}{QUOTE_NONE}
|
---|
248 | Instructs \class{writer} objects to never quote fields. When the current
|
---|
249 | \var{delimiter} occurs in output data it is preceded by the current
|
---|
250 | \var{escapechar} character. If \var{escapechar} is not set, the writer
|
---|
251 | will raise \exception{Error} if any characters that require escaping
|
---|
252 | are encountered.
|
---|
253 |
|
---|
254 | Instructs \class{reader} to perform no special processing of quote characters.
|
---|
255 | \end{datadesc}
|
---|
256 |
|
---|
257 |
|
---|
258 | The \module{csv} module defines the following exception:
|
---|
259 |
|
---|
260 | \begin{excdesc}{Error}
|
---|
261 | Raised by any of the functions when an error is detected.
|
---|
262 | \end{excdesc}
|
---|
263 |
|
---|
264 |
|
---|
265 | \subsection{Dialects and Formatting Parameters\label{csv-fmt-params}}
|
---|
266 |
|
---|
267 | To make it easier to specify the format of input and output records,
|
---|
268 | specific formatting parameters are grouped together into dialects. A
|
---|
269 | dialect is a subclass of the \class{Dialect} class having a set of specific
|
---|
270 | methods and a single \method{validate()} method. When creating \class{reader}
|
---|
271 | or \class{writer} objects, the programmer can specify a string or a subclass
|
---|
272 | of the \class{Dialect} class as the dialect parameter. In addition to, or
|
---|
273 | instead of, the \var{dialect} parameter, the programmer can also specify
|
---|
274 | individual formatting parameters, which have the same names as the
|
---|
275 | attributes defined below for the \class{Dialect} class.
|
---|
276 |
|
---|
277 | Dialects support the following attributes:
|
---|
278 |
|
---|
279 | \begin{memberdesc}[Dialect]{delimiter}
|
---|
280 | A one-character string used to separate fields. It defaults to \code{','}.
|
---|
281 | \end{memberdesc}
|
---|
282 |
|
---|
283 | \begin{memberdesc}[Dialect]{doublequote}
|
---|
284 | Controls how instances of \var{quotechar} appearing inside a field should
|
---|
285 | be themselves be quoted. When \constant{True}, the character is doubled.
|
---|
286 | When \constant{False}, the \var{escapechar} is used as a prefix to the
|
---|
287 | \var{quotechar}. It defaults to \constant{True}.
|
---|
288 |
|
---|
289 | On output, if \var{doublequote} is \constant{False} and no
|
---|
290 | \var{escapechar} is set, \exception{Error} is raised if a \var{quotechar}
|
---|
291 | is found in a field.
|
---|
292 | \end{memberdesc}
|
---|
293 |
|
---|
294 | \begin{memberdesc}[Dialect]{escapechar}
|
---|
295 | A one-character string used by the writer to escape the \var{delimiter} if
|
---|
296 | \var{quoting} is set to \constant{QUOTE_NONE} and the \var{quotechar}
|
---|
297 | if \var{doublequote} is \constant{False}. On reading, the \var{escapechar}
|
---|
298 | removes any special meaning from the following character. It defaults
|
---|
299 | to \constant{None}, which disables escaping.
|
---|
300 | \end{memberdesc}
|
---|
301 |
|
---|
302 | \begin{memberdesc}[Dialect]{lineterminator}
|
---|
303 | The string used to terminate lines produced by the \class{writer}.
|
---|
304 | It defaults to \code{'\e r\e n'}.
|
---|
305 |
|
---|
306 | \note{The \class{reader} is hard-coded to recognise either \code{'\e r'}
|
---|
307 | or \code{'\e n'} as end-of-line, and ignores \var{lineterminator}. This
|
---|
308 | behavior may change in the future.}
|
---|
309 | \end{memberdesc}
|
---|
310 |
|
---|
311 | \begin{memberdesc}[Dialect]{quotechar}
|
---|
312 | A one-character string used to quote fields containing special characters,
|
---|
313 | such as the \var{delimiter} or \var{quotechar}, or which contain new-line
|
---|
314 | characters. It defaults to \code{'"'}.
|
---|
315 | \end{memberdesc}
|
---|
316 |
|
---|
317 | \begin{memberdesc}[Dialect]{quoting}
|
---|
318 | Controls when quotes should be generated by the writer and recognised
|
---|
319 | by the reader. It can take on any of the \constant{QUOTE_*} constants
|
---|
320 | (see section~\ref{csv-contents}) and defaults to \constant{QUOTE_MINIMAL}.
|
---|
321 | \end{memberdesc}
|
---|
322 |
|
---|
323 | \begin{memberdesc}[Dialect]{skipinitialspace}
|
---|
324 | When \constant{True}, whitespace immediately following the \var{delimiter}
|
---|
325 | is ignored. The default is \constant{False}.
|
---|
326 | \end{memberdesc}
|
---|
327 |
|
---|
328 |
|
---|
329 | \subsection{Reader Objects}
|
---|
330 |
|
---|
331 | Reader objects (\class{DictReader} instances and objects returned by
|
---|
332 | the \function{reader()} function) have the following public methods:
|
---|
333 |
|
---|
334 | \begin{methoddesc}[csv reader]{next}{}
|
---|
335 | Return the next row of the reader's iterable object as a list, parsed
|
---|
336 | according to the current dialect.
|
---|
337 | \end{methoddesc}
|
---|
338 |
|
---|
339 | Reader objects have the following public attributes:
|
---|
340 |
|
---|
341 | \begin{memberdesc}[csv reader]{dialect}
|
---|
342 | A read-only description of the dialect in use by the parser.
|
---|
343 | \end{memberdesc}
|
---|
344 |
|
---|
345 | \begin{memberdesc}[csv reader]{line_num}
|
---|
346 | The number of lines read from the source iterator. This is not the same
|
---|
347 | as the number of records returned, as records can span multiple lines.
|
---|
348 | \end{memberdesc}
|
---|
349 |
|
---|
350 |
|
---|
351 | \subsection{Writer Objects}
|
---|
352 |
|
---|
353 | \class{Writer} objects (\class{DictWriter} instances and objects returned by
|
---|
354 | the \function{writer()} function) have the following public methods. A
|
---|
355 | {}\var{row} must be a sequence of strings or numbers for \class{Writer}
|
---|
356 | objects and a dictionary mapping fieldnames to strings or numbers (by
|
---|
357 | passing them through \function{str()} first) for {}\class{DictWriter}
|
---|
358 | objects. Note that complex numbers are written out surrounded by parens.
|
---|
359 | This may cause some problems for other programs which read CSV files
|
---|
360 | (assuming they support complex numbers at all).
|
---|
361 |
|
---|
362 | \begin{methoddesc}[csv writer]{writerow}{row}
|
---|
363 | Write the \var{row} parameter to the writer's file object, formatted
|
---|
364 | according to the current dialect.
|
---|
365 | \end{methoddesc}
|
---|
366 |
|
---|
367 | \begin{methoddesc}[csv writer]{writerows}{rows}
|
---|
368 | Write all the \var{rows} parameters (a list of \var{row} objects as
|
---|
369 | described above) to the writer's file object, formatted
|
---|
370 | according to the current dialect.
|
---|
371 | \end{methoddesc}
|
---|
372 |
|
---|
373 | Writer objects have the following public attribute:
|
---|
374 |
|
---|
375 | \begin{memberdesc}[csv writer]{dialect}
|
---|
376 | A read-only description of the dialect in use by the writer.
|
---|
377 | \end{memberdesc}
|
---|
378 |
|
---|
379 |
|
---|
380 |
|
---|
381 | \subsection{Examples\label{csv-examples}}
|
---|
382 |
|
---|
383 | The simplest example of reading a CSV file:
|
---|
384 |
|
---|
385 | \begin{verbatim}
|
---|
386 | import csv
|
---|
387 | reader = csv.reader(open("some.csv", "rb"))
|
---|
388 | for row in reader:
|
---|
389 | print row
|
---|
390 | \end{verbatim}
|
---|
391 |
|
---|
392 | Reading a file with an alternate format:
|
---|
393 |
|
---|
394 | \begin{verbatim}
|
---|
395 | import csv
|
---|
396 | reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
|
---|
397 | for row in reader:
|
---|
398 | print row
|
---|
399 | \end{verbatim}
|
---|
400 |
|
---|
401 | The corresponding simplest possible writing example is:
|
---|
402 |
|
---|
403 | \begin{verbatim}
|
---|
404 | import csv
|
---|
405 | writer = csv.writer(open("some.csv", "wb"))
|
---|
406 | writer.writerows(someiterable)
|
---|
407 | \end{verbatim}
|
---|
408 |
|
---|
409 | Registering a new dialect:
|
---|
410 |
|
---|
411 | \begin{verbatim}
|
---|
412 | import csv
|
---|
413 |
|
---|
414 | csv.register_dialect('unixpwd', delimiter=':', quoting=csv.QUOTE_NONE)
|
---|
415 |
|
---|
416 | reader = csv.reader(open("passwd", "rb"), 'unixpwd')
|
---|
417 | \end{verbatim}
|
---|
418 |
|
---|
419 | A slightly more advanced use of the reader --- catching and reporting errors:
|
---|
420 |
|
---|
421 | \begin{verbatim}
|
---|
422 | import csv, sys
|
---|
423 | filename = "some.csv"
|
---|
424 | reader = csv.reader(open(filename, "rb"))
|
---|
425 | try:
|
---|
426 | for row in reader:
|
---|
427 | print row
|
---|
428 | except csv.Error, e:
|
---|
429 | sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
|
---|
430 | \end{verbatim}
|
---|
431 |
|
---|
432 | And while the module doesn't directly support parsing strings, it can
|
---|
433 | easily be done:
|
---|
434 |
|
---|
435 | \begin{verbatim}
|
---|
436 | import csv
|
---|
437 | for row in csv.reader(['one,two,three']):
|
---|
438 | print row
|
---|
439 | \end{verbatim}
|
---|
440 |
|
---|
441 | The \module{csv} module doesn't directly support reading and writing
|
---|
442 | Unicode, but it is 8-bit-clean save for some problems with \ASCII{} NUL
|
---|
443 | characters. So you can write functions or classes that handle the
|
---|
444 | encoding and decoding for you as long as you avoid encodings like
|
---|
445 | UTF-16 that use NULs. UTF-8 is recommended.
|
---|
446 |
|
---|
447 | \function{unicode_csv_reader} below is a generator that wraps
|
---|
448 | \class{csv.reader} to handle Unicode CSV data (a list of Unicode
|
---|
449 | strings). \function{utf_8_encoder} is a generator that encodes the
|
---|
450 | Unicode strings as UTF-8, one string (or row) at a time. The encoded
|
---|
451 | strings are parsed by the CSV reader, and
|
---|
452 | \function{unicode_csv_reader} decodes the UTF-8-encoded cells back
|
---|
453 | into Unicode:
|
---|
454 |
|
---|
455 | \begin{verbatim}
|
---|
456 | import csv
|
---|
457 |
|
---|
458 | def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
|
---|
459 | # csv.py doesn't do Unicode; encode temporarily as UTF-8:
|
---|
460 | csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
|
---|
461 | dialect=dialect, **kwargs)
|
---|
462 | for row in csv_reader:
|
---|
463 | # decode UTF-8 back to Unicode, cell by cell:
|
---|
464 | yield [unicode(cell, 'utf-8') for cell in row]
|
---|
465 |
|
---|
466 | def utf_8_encoder(unicode_csv_data):
|
---|
467 | for line in unicode_csv_data:
|
---|
468 | yield line.encode('utf-8')
|
---|
469 | \end{verbatim}
|
---|
470 |
|
---|
471 | For all other encodings the following \class{UnicodeReader} and
|
---|
472 | \class{UnicodeWriter} classes can be used. They take an additional
|
---|
473 | \var{encoding} parameter in their constructor and make sure that the data
|
---|
474 | passes the real reader or writer encoded as UTF-8:
|
---|
475 |
|
---|
476 | \begin{verbatim}
|
---|
477 | import csv, codecs, cStringIO
|
---|
478 |
|
---|
479 | class UTF8Recoder:
|
---|
480 | """
|
---|
481 | Iterator that reads an encoded stream and reencodes the input to UTF-8
|
---|
482 | """
|
---|
483 | def __init__(self, f, encoding):
|
---|
484 | self.reader = codecs.getreader(encoding)(f)
|
---|
485 |
|
---|
486 | def __iter__(self):
|
---|
487 | return self
|
---|
488 |
|
---|
489 | def next(self):
|
---|
490 | return self.reader.next().encode("utf-8")
|
---|
491 |
|
---|
492 | class UnicodeReader:
|
---|
493 | """
|
---|
494 | A CSV reader which will iterate over lines in the CSV file "f",
|
---|
495 | which is encoded in the given encoding.
|
---|
496 | """
|
---|
497 |
|
---|
498 | def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
|
---|
499 | f = UTF8Recoder(f, encoding)
|
---|
500 | self.reader = csv.reader(f, dialect=dialect, **kwds)
|
---|
501 |
|
---|
502 | def next(self):
|
---|
503 | row = self.reader.next()
|
---|
504 | return [unicode(s, "utf-8") for s in row]
|
---|
505 |
|
---|
506 | def __iter__(self):
|
---|
507 | return self
|
---|
508 |
|
---|
509 | class UnicodeWriter:
|
---|
510 | """
|
---|
511 | A CSV writer which will write rows to CSV file "f",
|
---|
512 | which is encoded in the given encoding.
|
---|
513 | """
|
---|
514 |
|
---|
515 | def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
|
---|
516 | # Redirect output to a queue
|
---|
517 | self.queue = cStringIO.StringIO()
|
---|
518 | self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
|
---|
519 | self.stream = f
|
---|
520 | self.encoder = codecs.getincrementalencoder(encoding)()
|
---|
521 |
|
---|
522 | def writerow(self, row):
|
---|
523 | self.writer.writerow([s.encode("utf-8") for s in row])
|
---|
524 | # Fetch UTF-8 output from the queue ...
|
---|
525 | data = self.queue.getvalue()
|
---|
526 | data = data.decode("utf-8")
|
---|
527 | # ... and reencode it into the target encoding
|
---|
528 | data = self.encoder.encode(data)
|
---|
529 | # write to the target stream
|
---|
530 | self.stream.write(data)
|
---|
531 | # empty queue
|
---|
532 | self.queue.truncate(0)
|
---|
533 |
|
---|
534 | def writerows(self, rows):
|
---|
535 | for row in rows:
|
---|
536 | self.writerow(row)
|
---|
537 | \end{verbatim}
|
---|