Skip to content

Proof-of-concept for updating RFCs to contain errata

License

Notifications You must be signed in to change notification settings

adamroach/patch-errata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

patch-errata

Proof-of-concept for updating RFCs to contain errata

Disclaimers

This is a fairly simple script that I threw together pretty quickly, with two high-level goals:

  • Determining the feasibility of automatically inserting RFC Errata into the documents in sensible locations
  • Exploring how such errata can be displayed to readers in a sensible and obvious fashion.

The script is neither elegant nor complete. There are a number of limitations and known corner cases that need to be handled.

  • Currently, the script operates by scraping the human-readable errata page from the rfc-editor.org website. While this generally works pretty well, it can get tripped up by the presence of certain phrases (e.g., "Notes:") in the errata text itself. The database that backs the RFC Editor's page contains semantic information (section, orig_text, and correct_text); using that information will eliminate such confusion.
  • For the same reason, formatting of the "Notes" section of errata is occasionally haphazard. Again, working directly with the data in the form it exists in the errata database will eliminate this issue.
  • The script operates on HTML-escaped text. Where the escape-html npm module and the RFC Editor's webpage have different notions about the necessity of escaping certain characters, matches that should be possible are overlooked. Operating directly on the database information should improve this situation.
  • Errata are applied at the first place in the document that matches the old text. In the rare cases that the old text appears in multiple sections, the erratum may be inserted at the wrong point. This can be fixed by isolating document sections and searching only within the correct section.
  • When multiple errata attempt to correct the same text, the results can be unexpected.
  • Older documents are inconsistent about whether subsection headings and/or tables of contents are indented. This can confuse the section location logic.
  • Several errata (especially early ones) could be matched more effectively by accounting for the use of | and ^ characters that some reporters use to indicate specific lines and columns to be changed (see, e.g., RFC 6016)
  • Many errata contain smartquotes and other UTF-8 characters, which should be normalized to their ASCII equivalents before attempting to apply them.
  • When original text spans page boundaries but does not include the footer and header text, it is not currently matched. This should be straightforward to fix.
  • While it renders okay, the HTML file that is output is pretty horrific.
  • Characterset handling is incorrect.

Prerequisites

This script is written in node, and requires relatively modern features.

  • On OS X with Homebrew: brew install node
  • On Debian-based Linux distributions (including Ubuntu): sudo apt-get install node

This script relies on the following node libraries, which are available via npm:

You can install them with: npm install node-fetch node-html-parser escape-html

Use

./patch-errata.js <number> where <number> is a raw number (no "rfc" or ".txt") (e.g. ./patch-errata.js 3261)

The output is placed in a subdirectory called out (which you must create if it does not exist) as rfc<number>.html.

Preliminary Statistics

The script first attempts to place the errata directly where the "text to replace" occurs. Over the entire corpus of RFCs with associated errata, it is able to do this 56% of the time. (Note that this number is significantly higher for more modern errata, which are given in a more rigorous "Old Text/New Text" format). The majority of the remaining errata (27% of all errata) are attached to the start of the corresponding section. Those stragglers that cannot be matched in either way (16%) are appended to the end of the document, and listed at the very top.

Inline: 56%, Section: 27%, Endnote: 16%

Example Output

The following documents have significant deployment and numerous errata associated with them, and demonstrate the output of the script. Note that the directory these files are in contains output for all RFCs that have associated errata.

  • RFC 3261 - Session Initiation Protocol (SIP)
  • RFC 5246 - Transport Layer Security (TLS) 1.2

About

Proof-of-concept for updating RFCs to contain errata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published