New NFS to bring parallel storage to the masses

January 21, 2009

This article was contributed by Joab Jackson

Sometime around the end of January or early February, the Internet Engineering Task Force will give its final blessing to the latest version of the venerable Network File System (NFS), version 4.1. While the authors of the standard have stressed that this is a minor revision of NFS, it does have at least one seemingly radical new option, called Parallel NFS (pNFS).

The "parallel" tag of pNFS means NFS clients can access large pools of storage directly, rather than go through the storage server. Unbeknown to the clients, what they store is striped across multiple disks, so when that data is needed it can be called back in parallel, cutting retrieval time even more. If you run a cluster computer system, you may immediately recognize the appeal of this approach.

"We're starting the process of feeding all these patches up to the Linux NFS maintainers," said Brent Welch, the director of software architecture for Panasas who is also one of that storage company's contributors of the pNFS code. He noted that the work for the prototyping and implementing pNFS in Linux, as part of NFS, has been going on for about two years. Ongoing work has included updating both the NFS client and NFS server software.

The code will be proposed for the Linux kernel in two sets, according to Welch. The first set will have the basic procedures for setting up and tearing down pNFS sessions, using Remote Procedure Call (RPC) operations for exchanging IDs and initiating and ending sessions. The development teams are gunning to have this basic outline of pNFS included in the 2.6.30 version of the kernel. The second set, ready for the 2.6.31 version of the kernel, will be a larger patch, including the I/O commands for accessing and changing file layouts as well as reading and writing data. Given that it will take a few more months after the 2.6.31 Kernel for it to be picked up by the major distributions, pNFS probably won't start to be deployed by even the most ambitious IT shops at least until the early part of 2010.

We all know NFS. It allows client machines to mount Unix drives that reside across the network as if they were local disks. Many Network Attached Storage (NAS)-based storage arrays use NFS. With NAS, a lot of hard drives all lie behind a single IP address, the drives are all managed by the NAS box. NAS allows organizations to pool storage, so storage administrators can more fluidly (and hence efficiently) allocate that storage across all users.

In a 2004 problem statement, two of the developers responsible for getting pNFS in motion, Panasas chief technology officer Garth Gibson and Network Appliance (NetApp) engineer Peter Corbett, explained the limitations of this approach, especially in high performance computing environments:

The storage I/O bandwidth requirements of clients are rapidly outstripping the ability of network file servers to supply them. [...] The NFSv4 protocol currently requires that all the data in a single file system be accessible through a single exported network endpoint, constraining access to be through a single NFS server.

In a nutshell, the potential roadblock with NAS, or any type of NFS-based network storage, is the NAS head, or server, they explained. If too many of your clients hit the NAS server at the same time, then the I/O slows for everyone. You could go back to direct access, but you lose the efficiencies of pooled storage. For cluster computer systems, in which dozens of nodes can be working on the same data set, such partitioned storage just isn't feasible. Nor are multiple storage servers: An NFS-based system can not support multiple servers writing to the same file system.

Gibson and Corbett were early champions of developing pNFS, along with Los Alamos National Laboratory's Gary Grider. Additional work was carried out by engineers at EMC, Panasas, NetApp and other companies. The University of Michigan's Center for Information Technology Integration (CITI), along with members of the IBM Almaden Research Center are developing a pNFS implementation for Linux, both for clients and storage servers.

pNFS will allow clients to connect directly to the storage devices they need, rather than go through a storage gateway of some sort. The folks behind pNFS like to say that their approach separates the control traffic from the data traffic. When a client requests a particular file or block of storage, it sends a request to a server called the Metadata Server (MDS), which returns a map of where all the data resides within the storage network. The client can then access that data directly, according to permissions set by the file system. Once that storage is altered, the client notifies the MDS of the changes, which updates the file layout.

Since pNFS allows clients to talk directly to the storage devices, as well as permitting client data to be striped across multiple storage devices, the client can enjoy a higher I/O rate than would be had simply by going through a single NAS head—or by communicating with a single storage server. In 2007, three developers from the IBM Almaden Research Center, Dean Hildebrand, Marc Eshel and Roger Haskin, demonstrated [PDF] at the Supercomputing 2007 conference (SC07) how three clients could saturate a 10 gigabit link by drawing data from 336 Linux-based storage devices. Such throughput "would be hard to achieve using standard NFS in terms of accessing a single file," Hildebrand said. "We wanted to show that pNFS could scale to the network hardware available."

pNFS is largely made up of three sets of protocols. One protocol is for the mapping, or layout, of resources, which resides on the client. It interprets and utilizes the data map returned from the metadata server. The second is the transport protocol, which also resides on the client. It coordinates data transfer between the clients and storage devices. The transport protocol handles the actual I/O with the storage devices. A control protocol will synchronize the metadata server with the storage devices. This last protocol is the only one not specified by NFS—It will be left to storage the vendors, though much of the work that this protocol will do can be codified in NFS commands.

pNFS can work with three types of storage—file-based storage, object-based storage and block-based storage. The NFSv4.1 protocol itself contains the file-based storage protocol. Additional RFCs are being developed for object and block protocols. File-based storage is what most system administrators think of as storage; it is the standard approach of nesting files within a hierarchical set of directories. Block-based storage is used in Storage Area Networks (SANs), in which the applications access disk space directly, by sending the Small Computer System Interface (SCSI) commands over Fibre Channel, or, increasingly of late, TCP/IP via the Internet SCSI (iSCSI) protocol. Object-based storage is somewhat of a newer beast, a parallel approach that involves embedding the data itself with self-describing metadata.

A word on semantics: Keep in mind that just as NFS is not a file system itself, neither is pNFS. NFS provides the protocols to work with remote files as if they were local. Likewise, pNFS offer the ability to work with files managed by a parallel file system as if they were on a local drive, handling such tasks as setting permissions and ensuring data integrity. Fortunately, a number of parallel file systems have been spawned over the past few years that should work easily with pNFS. On the open source front, there is the the parallel Virtual File System (pVFS). Perhaps the most widely-used open-source parallel file system now in use is Lustre, now overseen by Sun Microsystems. On the commercial front, Panasas' PanFS file system has been successfully deployed in high performance computer clusters, as has IBM's General Parallel File System (GPFS). All of these approaches use a similar idea—let the clients talk to the storage server's devices directly, while having some form of metadata server keep track of the storage layout. But most other options rely on using a single vendor's gear.

"The main advantage [to using pNFS] is expected to be on the client side", noted CITI programmer J. Bruce Fields, who does the NFS 4.1 testing on Linux servers. With most parallel file systems you have to do some kernel reconfigurations on the clients so that they can work with the file systems. With the prototype Linux client, you can run a standard mount command and get the files you need. "The client will automatically negotiate pNFS and find the data servers. By the time we're done that should work on any out-of-the-box Linux client from the distribution of your choice", he says.

The advantage that pNFS will bring is familiarity, and that it will come already built in as part of NFS. Since NFS is a standard component in almost all Linux kernel builds, that will greatly reduce the amount of work administrators need to do to set up a parallel file system for Linux servers. Most administrators are more familiar with the general operating procedures of NFS, much more so than dealing directly with, say, Lustre, which requires numerous kernel patches and a different mindset when it comes to understanding commands.

pNFS should help storage vendors as well, as they will not have to port client software to numerous Linux distributions. Welch, for instance, noted that Panasas has to maintain code for dozens of different Linux distributions. Instead, they can rely on NFS and focus on storage devices. Already, Panasas, NetApp, EMC, IBM and have all promised [PDF] to support pNFS in at least some of their storage products, according to a collective talk some of the developers gave last month at the SC08 conference. Sun Microsystems also plans to support pNFS in Solaris.

And while much of the early focus of pNFS has been for large scale cluster operations, one day it may be feasible that even workstations and desktops will use pNFS in some form. LANL's Gary Grider pointed out that, "at some point, having several teraflops may even be possible in your office, in which case you may need something more than just NFS for data access for such a powerful personal system. pNFS may end up being handy in this environment as well."

Indeed. Once upon a time we were limited to working on files on our own machines, FTP'ing in anything that was located elsewhere. But NFS allowed us to mount drives across the network with a relatively simple command. Now, pNFS may take simplify things a step further, by allowing to us to pull in and write large files or myriad files with a speed that we can now only dream about. At least that is the promise of pNFS.

Index entries for this article
Kernel	Clusters/Filesystems
Kernel	Network filesystems
GuestArticles	Jackson, Joab

New NFS to bring parallel storage to the masses

Posted Jan 22, 2009 4:21 UTC (Thu) by jwb (guest, #15467) [Link] (7 responses)

Does a pNFS mount have the full POSIX semantics like a real local filesystem? Can you safely deliver mail on it? I've tested storage from vendors like Ibrix and Isilon and I've always found that they fail in simple scenarios, such as two clients that open the same file in O_APPEND mode. On local unix filesystems this works fine, and you get coherent results, but on most commercial cluster storage you get gibberish. The only place where I've successfully exercised the full POSIX feature set is Lustre, which works perfectly in my experience. I hope that pNFS takes after Lustre more than it takes after NFSv4.

New NFS to bring parallel storage to the masses

Posted Jan 22, 2009 12:50 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Is there a test suite you can run to check a filesystem's POSIX compliance for cases like this?

New NFS to bring parallel storage to the masses

Posted Jan 22, 2009 16:18 UTC (Thu) by eli (guest, #11265) [Link]

I don't know if fsx specifically tests that case, but it may:
http://www.codemonkey.org.uk/projects/fsx/

New NFS to bring parallel storage to the masses

Posted Jan 22, 2009 16:21 UTC (Thu) by snitm (guest, #4031) [Link] (4 responses)

Can you elaborate on your "two clients that open the same file in O_APPEND mode" scenario? What are your expectations? What is each clients' write workload? Are they each just blasting N bytes into the same file without any higher-level application coordination? What are you saying Lustre gets right and Ibrix, Isilon, etc. get wrong?

New NFS to bring parallel storage to the masses

Posted Jan 22, 2009 16:56 UTC (Thu) by jwb (guest, #15467) [Link] (3 responses)

When a file is opened with O_APPEND, a call to write() causes a seek to the end of the file and a write, atomically. On a normal local filesystem, n-many clients can do this to the same file at once, and their writes will all be atomic. This also works on Lustre. It definitely does not work on ordinary NFS, and it also does not work on some of the other commercial distributed/cluster filesystems I have tested.

New NFS to bring parallel storage to the masses

Posted Jan 22, 2009 20:32 UTC (Thu) by felixfix (subscriber, #242) [Link]

There used to be a string attached to those atomic writes. If it was too many bytes, either by absolute limit (4096 bytes?) or crossed a page boundary, it was split into multiple atomic writes. But I haven't had need to worry about this for many years, so I may misremember details.

New NFS to bring parallel storage to the masses

Posted Jan 24, 2009 14:14 UTC (Sat) by xav (guest, #18536) [Link] (1 responses)

I don't think there has ever been a guarantee on write() to be atomic. What write() does is return the number of bytes it could store, that's all.

New NFS to bring parallel storage to the masses

Posted Jan 24, 2009 17:19 UTC (Sat) by jwb (guest, #15467) [Link]

"If set, then all write operations write the data at the end of the file, extending it, regardless of the current file position. This is the only reliable way to append to a file. In append mode, you are guaranteed that the data you write will always go to the current end of the file, regardless of other processes writing to the file. Conversely, if you simply set the file position to the end of file and write, then another process can extend the file after you set the file position but before you write, resulting in your data appearing someplace before the real end of file."

http://theory.uwinnipeg.ca/gnu/glibc/libc_144.html

New NFS to bring parallel storage to the masses

Posted Jan 25, 2009 22:04 UTC (Sun) by job (guest, #670) [Link]

So how will a typical set up look? Do you run several pNFS nodes on individual block devices? On top of Lustre? Instead of Lustre?

New NFS to bring parallel storage to the masses

Posted Jan 29, 2009 16:26 UTC (Thu) by malcolmparsons (guest, #46787) [Link] (1 responses)

> Unbeknown to the clients, what they store is striped across multiple discs

s/disc/disk/

New NFS to bring parallel storage to the masses

Posted Jan 29, 2009 16:48 UTC (Thu) by jake (editor, #205) [Link]

> s/disc/disk/

fixed, thanks!

jake