Unicode Standard Annex #13

Unicode Newline Guidelines

Version	Unicode 3.1.0
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	2001-03-23
This Version	http://www.unicode.org/unicode/reports/tr13/tr13-8.html
Previous Version	http://www.unicode.org/unicode/reports/tr13/tr13-6.html
Latest Version	http://www.unicode.org/unicode/reports/tr13
Tracking Number	7

Summary

This document describes guidelines for how to handle different characters used to represent CRLF and other representations of new lines on different platforms.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. It is a stable document and may be used as reference material or cited as a normative reference from another document.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, carrying the same version number, but is published as a separate document. Note that conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).

1 Introduction
2 Definitions
3 Background
4 Recommendations
References
Modifications

1 Introduction

Newlines are represented on different platforms by carriage return (CR), line feed (LF), CRLF, or next line (NEL). Unfortunately, not only are newlines represented by different characters on different platforms, they also have ambiguous behavior even on the same platform. Especially with the advent of the web, where text on a single machine can arise from many sources, this causes a significant problem.

Unfortunately, these characters are often transcoded directly into the corresponding Unicode codes when a character set is transcoded; this means that even programs handling pure Unicode have to deal with the problems. For information on handling newlines in regular expressions, see UTR #18: Unicode Regular Expression Guidelines [RegExp].

2 Definitions

The following table provides hexadecimal values for the acronyms used in the text. The Unicode Standard does not formally assign control characters, instead it provides the 65 code values for use as in the 7 and 8-bit standards. See The Unicode Standard, Version 2.0, Section 2.6 Controls and Control Sequences.

Hex Values for Acronyms

	Unicode	ASCII	EBCDIC*
`CR`	`000D`	`0D`	`0D`	`0D`
`LF`	`000A`	`0A`	`25`	`15`
`CRLF`	`000D,000A`	`0D,0A`	`0D,25`	`0D,15`
`NEL*`	`0085`	`85`	`15`	`25`
`VT`	`000B`	`0B`	`0B`	`0B`
`FF`	`000C`	`0C`	`0C`	`0C`
`LS`	`2028`	n/a	n/a	n/a
`PS`	`2029`	n/a	n/a	n/a

There are two mappings of LF and NEL used by EBCDIC systems. The first EBCDIC column shows the MVS Open Edition (including CP1047) mapping of these characters, while the second column shows the CDRA mapping. This difference arises from the use of LF character as 'New Line' in ASCII-based Unix environments and in some data transfer protocols that use the Unix assumptions. The second column is based on the standardized definitions — both in ASCII and EBCDIC of LF.
NEL is not actually defined in ASCII: it is defined in ISO 6429 as a C1 control.

For clarity, when referring to the function that a particular character has, we will use lowercase (e.g., paragraph separator); when referring to the specific characters that represent those functions, we will use titlecase or an acronym (e.g., Paragraph Separator or PS).]

The term NLF (new line function) stands for different characters depending on the platform; that is, any of CR, LF, CRLF, or NEL.

3 Background

A paragraph separator is used to indicate a separation between paragraphs, while a line separator indicates where a line break alone should occur, typically within a paragraph. For example:

This is a paragraph with a line separator at this point,
causing the word "causing" to appear on a different line, but not causing the typical paragraph indentation, sentence-breaking, line spacing, or change in flush (right, center or left paragraphs).

For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to older usage of HTML <P> (modern HTML delimits paragraphs by enclosing them in <P>...</P>). In word processors, paragraph separators are usually entered using a keyboard RETURN or ENTER; line separators are usually entered using a modified RETURN or ENTER, such as SHIFT-ENTER.

A record separator is used to separate records. For example, when exchanging tabular data, a common format is to tab-separate the cells, and use a CRLF at the end of a line of cells. This function is not precisely the same as line separation, but the same characters are often used.

Traditionally, NLF started out as a line separator (and sometimes record separator). It is still used as a line separator in simple text editors such as program editors. As platforms and programs started to handle word processing with automatic line-wrap, these characters were reinterpreted to stand for paragraph separators. For example, even such simple programs as the Windows Notepad program or the Mac SimpleText program interpret their platform's NLF as a paragraph separator, not a line separator.

Once NLF was reinterpreted to stand for a paragraph separator, in some cases some other control character was impressed into service as a line separator. For example, vertical tabulation VT is used in Microsoft Word. However, the choice of character for line separator is even less standardized than the choice of character for NLF.

Yet, many internet protocols and a lot of existing text treats NLF as a line separator, so you can't just simply treat NLF as a paragraph separator in all circumstances.

4 Recommendations

The Unicode Standard defines two unambiguous separator characters, Paragraph Separator (PS = 2029₁₆) and Line Separator (LS = 2028₁₆). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous. Otherwise, the following specifies how to cope with an NLF when converting from other character sets to Unicode, when interpreting characters in text, and when converting from Unicode to other character sets.

Note: Even if you know which characters represents NLF on your particular platform, on input and in interpretation, treat CR, LF, CRLF, and NEL the same. Only on output do you need to distinguish between them.

4.1 Converting from other character code sets

If you do know the exact usage of any NLF, then convert it to LS or PS.
If you don't know the exact usage of any NLF, remap it to your platform NLF. (This doesn't really help you in interpreting Unicode text unless you are the only source of that text, since someone else may have left in LF, CR, CRLF, or NEL.)

4.2 Interpreting characters in text

Always interpret PS as paragraph separator and LS as line separator.
In word processing, interpret any NLF the same as PS.
In simple text editors, interpret any NLF the same as LS.
In parsing, choose the safest interpretation. For example, if you are dealing with sentence-break heuristics, you would reason in the following way that it is safer to interpret any NLF as a LS:
- Suppose you misinterpret an NLF as LS, when it was meant to be PS. Since most paragraphs are terminated with punctuation anyway, in only a few cases would this cause misidentification of sentence boundaries.
- Suppose you misinterpret an NLF as PS, when it was meant to be LS. In this case, line breaks would cause sentence breaks, which would mess up the sentence break heuristics significantly.

4.3 Converting to other character code sets

If you know the intended target, map NLF, LS, and PS appropriately, depending on the target conventions. For example, when mapping to Microsoft Word's internal conventions for Windows documents you would map LS to VT, and PS and any NLF to CRLF.
If you don't know the intended target, map NLF, LS, and PS to the platform newline convention (CR, LF, CRLF, or NEL). In Java, for example, this is done by mapping to a string nlf, defined as:
String nlf = System.getProperties("line.separator");

4.4 Input and Output

A readline function should stop at NLF, LS, FF, or PS. In the typical implementation it does not include the NLF, LS, PS, or FF that caused it to stop. Note that since the separator is lost, the use of readline is limited to text processing, where there is no difference among the flavors of separators.
A writeline (or newline) function should convert NLF, LS, and PS according to the conventions in §4.3 Converting to other character code sets.
In C, gets is defined to terminate at a newline and replaces the newline with '\0', while fgets is defined to terminate at a newline and includes the newline in the array it copies the data into. C implementations interpret '\n' either as LF or as the underlying platform newline NLF depending on where it occurs. EBCDIC C compilers substitute the relevant codes, based on the EBCDIC execution set.

4.5 Page Separator

FF is commonly used as a page separator, and it should be interpreted that way in text. When displaying on the screen, it causes the text after the separator to be forced to the next page. It should be independent of paragraph separation: a paragraph can start on one page and continue on the next page. Except when displaying on pages, in most parsing and in readline it is interpreted in the same way as a LS.

References

[RegExp]

Unicode Technical Report #18: Unicode Regular Expression Guidelines
UTR #18: Unicode Regular Expression Guidelines

Modifications

The following summarizes modifications from the previous version of this document.

7	Updated for Unicode 3.1 Minor editing

Copyright © 1998-2001 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.