Chapter 8: So You Want to Run Your Own Serverby Philip Greenspun, part of Philip and Alex's Guide to Web PublishingRevised May 2008 |
There are three levels at which you can take responsibility for the technical layers of your public Web site:
Most Internet Service Providers (ISPs), including universities, will give customers a free Web site on one of their servers. You won't be able to do anything fancy with relational databases. You will have a URL that includes the ISP's domain name, e.g., http://members.aol.com/netspringr/ or http://www.mit.edu/people/thomasc/. If you are using a commercial ISP, they will usually be willing to sell you "domain-level service" for $50 to $100 a month. So now instead of advertising http://members.aol.com/greenspun/ as your public site, you advertise http://www.greenspun.com. Instead of "lovetochat@aol.com", you can publish an email address of "philip@greenspun.com". Domain-level service gives you a certain degree of freedom from your ISP. You can unsubscribe from AOL and all of the links to your site that you've carefully nurtured over the years will still work.
Even if you have your own domain, there are some potential downsides to having it parked on someone else's physical computer. You are entirely at the mercy of the system administrators of the remote machine. If e-mail for your domain isn't being forwarded, you can't have your own system administrator fix the problem or go digging around yourself--you have to wait for people who might not have the same priorities that you do. If you are building a sophisticated relational database-backed site, you might not have convenient access to your data because you aren't allowed shell access to the remote machine, only the ability to FTP and HTTP PUT files.
If you want your own domain and basic Web hosting, good starting points are Yahoo Domains, GoDaddy, and www.register.com.
The downside to running your own box is that you have to carefully watch over your computer and Web server program. Nobody else will care about whether your computer is serving pages. You'll have to carry a pager and sign up for a service like Uptime (http://uptime.openacs.org/uptime/) or http://www.redalert.com/ that will email you and beep you when your server is unreachable. You won't be able to go on vacation unless you find someone who understands your computer and the Web server configuration to watch it. Unless you are using free software, you may have to pay shockingly high licensing fees. Internet service providers charge between $250 a month and $4,000 a month for physical hosting, depending on who supplies the hardware, how much system administration the ISP performs, and how much bandwidth your site consumes.
The people who do this usually call the service co-location. The better vendors have terminal concentrator in reverse so you can telnet into one of their boxes and then connect to the serial port on the back of your Unix box, just as though you were at the machine's console. For the Windows crowd, the good colo providers have a system whereby you can visit a Web page and power cycle your machine to bring it back up. If all else fails, they have technicians in their facilities 24x7. Most colo providers have redundant connectivity and private peering with Internet backbone operators. If you go on a tour they'll show you their huge bank of batteries and back-up natural gas generator. If this seems like overkill, here's an email that we got a few days after moving a server from MIT to a colo facility:
From: William Wohlfarth <wpwohlfa@PLANT.MIT.EDU> Subject: Power Outage August 7 , 1997 Date: Fri, 8 Aug 1997 07:39:43 -0400 At approximately 5:35pm on August 7, 1997 a manhole explosion in the Kendall Sq area caused Cambridge Electric to lose Kendall Station. MIT lost all power and the gas turbine tripped. Power was fully restored at 7pm. At approximately 7:05pm, a second manhole explosion,caused 1 fatality, and injuries to 4 other Cambridge Electric utilitymen including a Cambridge Policeman. Putnam Station was also tripped and MIT lost all power again. At approximately 10:30pm, MIT had power restored to all buildings within our distribution system. Several East campus ( E28, E32, E42, E56, E60, NE43)and North/Northwest buildings( N42, N51/52, N57, NW10, NW12, NW14, NW15, NW17, NW20, NW21, NW22, NW30, NW61, NW62, W11, WW15) which are fed directly from Cambridge Electric, were restored by Cambridge Electric personal. Cambridge Electric is still sorting out the chain of events. At last discussions with them, a total of 3 manhole explosions had taken place. Additional information will be posted when available.
Traditional someone else's computer and someone else's network services are provided by hosting.com, rackspace.com, and a range of competitors.
Most people buying a server computer make a choice between Unix and Windows. These operating systems offer important 1960s innovations like multiprocessing and protection among processes and therefore either choice can be made to work. Certainly the first thing to do is figure out which operating system supports the Web server and database management software that you want to use. If the software that appeals to you runs on both operating systems, make your selection based on which computer you, your friends, and your coworkers know how to administer. If that doesn't result in a conclusion, read the rest of this section.
Setting up a Unix server properly requires an expert with years of experience. The operating system configuration resides in hundreds of strangely formatted text files, which are typically edited by hand. One of the good things about a Unix server is stability. Assuming that you never touch any of these configuration files, the server will keep doing its job more or less forever.
After all was said and done, Sun's version of Unix proved to be the winner just in time for the entire commercial Unix market to be rendered irrelevant by the open source movement.
The Free Software Foundation, started in 1983 by Richard Stallman, developed a suite of software functionally comparable to the commercial Unices. This software, called GNU for "GNU is Not Unix", was not only open-source but also free and freely alterable and improvable by other programmers. Linus Torvalds built the final and most critical component (the kernel) and gave the finished system its popular name of "Linux", although it would be more accurate to call the system "GNU/Linux".
The GNU/Linux world is a complex one with incompatible distributions available from numerous vendors. Here's how an expert described the different options: "People swear by SuSE. Everyone hates RedHat, but they are the market leaders. Debian is loved by nerds and geeks. Mandrake is to give to your momma to install."
Windows NT crashed.
I am the Blue Screen of Death.
No one hears your screams.
-- Peter Rothman (in Salon Magazine's haiku error message contest)
The name "Microsoft Windows" has been applied to two completely different product development streams. The first was Win 3.1/95/98. This stream never resulted in a product suitable for Internet service or much of anything else. The second stream was Windows NT/2000/XP. Windows NT 3.1 was released in 1993 and did not work at all. NT 3.51 was released in 1995 and worked reasonably well if you were willing to reboot it every week. Windows 2000 was the first Microsoft operating system that was competitive with Unix for reliability. Windows XP contained some further reliability improvements and a hoarde of fancy features for desktop use. Windows .NET Server can be described as "XP for servers" and is just being released now (late 2002).
Windows performance and reliability were so disappointing for so long that Microsoft servers still have a terrible reputation among a lot of older (25+) computer geeks. This reputation is no longer justified however and has been demonstrated to be false by a large number of high-volume Web services that operate from Microsoft servers.
Open-source purists don't like the fact that Windows is closed-source. However, if your goal is to be innovating with an Internet service it is very unlikely that your users will be well served by you mucking around in the operating system source code. The OS is several layers below anything that matters to a user.
Microsoft Windows is good for zero-time projects. You find someone who knows how to read a menu and click a mouse. That person should be able to put up a basic Web service on an XP or .NET Server machine without doing any programming or learning any arcane languages.
GNU/Linux is good for enormous server farms and extreme demands. The folks who run search engines such as Google can benefit by modifying the operating system to streamline the most common tasks. Also if you have 10,000 machines in a rack the savings on software license fees from running a free operating system are substantial.
Microsoft Windows is good for its gentle slope from launching a static Web site to a workgroup to a sophisticated RDBMS-backed multi-protocol Internet service. A person can start with menus and programs such as Microsoft FrontPage and graduate to line-by-line software development when the situation demands.
GNU/Linux is good for embedded systems. If you've built a $200 device that sits on folks' television sets you don't want to pay Microsoft a $10 tax on every unit and you don't want to waste time and money negotiating a deal with Microsoft to get the Windows source code that you'll need to make the OS run on your weird hardware. So you grab GNU/Linux and bend it to the needs of your strange little device.
In our course "Software Engineering for Internet Applications", about half of the students choose to work with Windows and half with GNU/Linux. The students are seniors at the Massachusetts Institute of Technology, majoring in computer science. As a group the Windows users are up and running much faster and with much more sophisticated services. A fair number of GNU/Linux users have such difficulty getting set up that they are forced to drop out of the course. By the end of the semester the remaining students do not seem to be much affected by their choice of operating system. Their struggles are with the data model, page flow, and user experience.
What if you have a database-backed site? If you're doing an SQL query for every page load, don't count on getting more than 500,000 hits a day out of a cheap computer (alternative formulation: no more than 10 requests per second per processor). Even 500,000 a day is too much if your RDBMS installation is regularly going out to disk to answer user queries. In other words, if you are going to support 10 to 20 SQL queries per second, you must have enough RAM to hold your entire data set and must configure your RDBMS to use that RAM. You must also be using a Web/DB integration tool like AOLserver or Microsoft IIS that does not spawn new processes or database connections for each user request.
What if your site gets linked from yahoo.com, aol.com, abcnews.com, and cnn.com ... all on the same day? Should you plan for that day by ordering an 8-CPU Oracle server, a rack of Web servers, and a load balancing switch? It depends. If you've got unlimited money and sysadmin resources, go for the big iron. Nobody hands out prizes in the IT world for efficient use of funds. People never adjust for the effort involved in putting together and serving a site. Consider the average catalog shopping site, which cost the average company between $7 million and $35 million according to a mid-1999 Interactive Week survey. Users connecting to the site never say "My God, what a bunch of losers. They could have had a high-school kid build this with Microsoft .NET and the free IBuySpy demo toolkit in about a week." The only thing that a user will remember is that they placed their order and got their Beanie Baby. Assuming that the site was built by a contractor for a publisher, the only thing that the publisher will remember is whether or not it was built on time.
More: See the "Scaling Gracefully" chapter in Internet Application Workbook (http://philip.greenspun.com/internet-application-workbook/).
Each of these factors needs to be elaborated upon.
A Web server API makes it possible for you to customize the behavior of a Web server program without having to write a Web server program from scratch. In the early days of the Web, all the server programs were free. You would get the source code. If you wanted the program to work differently, you'd edit the source code and recompile the server. Assuming you were adept at reading other peoples' source code, this worked great until the next version of the server came along. Suppose the authors of NCSA HTTPD 1.4 decided to organize the program differently than the authors of NCSA HTTPD 1.3. If you wanted to take advantage of the features of the new version, you'd have to find a way to edit the source code of the new version to add your customizations.
An API is an abstraction barrier between your code and the core Web server program. The authors of the Web server program are saying, "Here are a bunch of hooks into our code. We guarantee and document that they will work a certain way. We reserve the right to change the core program but we will endeavor to preserve the behavior of the API call. If we can't, then we'll tell you in the release notes that we broke an old API call."
An API is especially critical for commercial Web server programs where the vendor does not release the source code. Here are some typical API calls from the AOLserver documentation (http://www.aolserver.com):
ns_passwordcheck user password returns 1 (one) if the user and password combination is legitimate. It returns 0 (zero) if either the user does not exist or the password is incorrect.
ns_sendmail to from subject body sends a mail message
Originally AOLserver was a commercial product. So the authors of
AOLserver wouldn't give you their source code and they wouldn't tell you
how they'd implemented the user/password database for URL access
control. But they gave you a bunch of functions like
ns_passwordcheck
that let you query the database. If they
redid the implementation of the user/password database in a subsequent
release of the software then they redid their implementation of
ns_passwordcheck
so that you wouldn't have to change your
code. The ns_sendmail
API call not only shields you from
changes by AOLserver programmers, it also allows you to not think about
how sending e-mail works on various computers. Whether you are running
AOLserver on Windows, HP Unix, or Linux, your extensions will send
e-mail after a user submits a form or requests a particular page.
AOLserver is now open-source, with source code and a developer's community available from www.aolserver.com.
Aside from having a rich set of functions, a good API has a rapid development environment and a safe language. The most common API is for the C programming language. Unfortunately, C is probably the least suitable tool for Web development. Web sites are by their very nature experimental and must evolve. C programs like Microsoft Word remain unreliable despite hundreds of programmer-years of development and thousands of person-years of testing. A small error in a C subroutine that you might write to serve a single Web page could corrupt memory critical to the operation of the entire Web server and crash all of your site's Web services. On operating systems without interprocess protection, such as Windows 95 or the Macintosh, the same error could crash the entire computer.
Even if you were some kind of circus freak programmer and were able to consistently write bug-free code, C would still be the wrong language because it has to be compiled. Making a small change in a Web page might involve dragging out the C compiler and then restarting the Web server program so that it would load the newly compiled version of your API extension.
By the time a Web server gets to version 2.0 or 3.0, the authors have usually figured that C doesn't make sense and have compiled in an interpreter for Tcl, Visual Basic, Java byte codes, or JavaScript.
You'll want a Web server that can talk to this RDBMS. All Web servers can invoke CGI scripts that in turn can open connections to an RDBMS, execute a query, and return the results formatted as an HTML page. However, some Web servers offer built-in RDBMS connectivity. The same project can be accomplished with much cleaner and simpler programs and with a tenth of the server resources.
Did this same development cycle work well for Web server programs? Although the basic activity of transporting bits around the Internet has been going on for three decades, there was no Web at Xerox PARC in 1975. The early Web server programs did not anticipate even a fraction of user needs. Web publishers did not want to wait years for new features or bug fixes. Thus an important feature for a Web server circa 1998 was source code availability. If worst came to worst, you could always get an experienced programmer to extend the server or fix a bug.
At the time of this revision, November 2002, two factors combined to make source code availability irrelevant for most Internet application developers: (1) Web server programs today address all the most important user needs that were identified between 1990 and 2001, and (2) the pace of innovation among publishers and developers has slowed to the point where commercial software release cycles are reasonably in step with user goals.
If you're building something that is unlike anything else that you've seen on the public Internet, e.g., a real-time multiuser game with simultaneous voice and video streams flying among the users, you should look for a Web server program whose source code is available. If on the other hand you're building something vaguely like amazon.com or yahoo.com it is very likely that you'll never have the slightest interest in delving down into the Web server source.
It is possible to throw away 90 percent of your computer's resources by choosing the wrong Web server program. Traffic is so low at most sites and computer hardware so cheap that this doesn't become a problem until the big day when the site gets listed on the Netscape site's What's New page. In the summer of 1996, that link delivered several extra users every second at the Bill Gates Personal Wealth Clock (http://philip.greenspun.com/WealthClock). Every other site on Netscape's list was unreachable. The Wealth Clock was working perfectly from a slice of a 70 Mhz pizza-box computer. Why? It was built using AOLserver and cached results from the Census Bureau and stock quote sites in AOLserver's virtual memory.
Given the preceding criteria, let's evaluate the three Web server programs that are sensible choices for a modern Web site (see "Interfacing a Relational Database to the Web" for a few notes on why the rest of the field was excluded).
AOLserver has a rich and comprehensive set of API calls. Some of the more interesting ones are the following:
These are accessible from C and, more interestingly, from the Tcl interpreter that they've compiled into the server. It is virtually impossible to crash AOLserver with a defective Tcl program. There are several ways of developing Tcl software for the AOLserver but the one with the quickest development cycle is to use *.tcl URLs or *.adp pages.
A file with a .tcl extension anywhere among the .html pages will be sourced by the Tcl interpreter. So you have URLs like "/bboard/fetch-msg.tcl". If asking for the page results in an error, you know exactly where in the Unix file system to look for the program. After editing the program and saving it in the file system, the next time a browser asks for "/bboard/fetch-msg.tcl" the new version is sourced. You get all of the software maintenance advantages of interpreted CGI scripts without the CGI overhead.
What the in-house AOL developers like the most are their .adp pages. These work like Microsoft Active Server Pages. A standard HTML document is a legal .adp page, but you can make it dynamic by adding little bits of Tcl code inside special tags.
AOLserver programmers also use Java and PHP.
Though AOLserver shines in the API department, its longest suit is its RDBMS connectivity. The server can hold open pools of connections to multiple relational database management systems. Your C or Tcl API program can ask for an already-open connection to an RDBMS. If none is available, the thread will wait until one is, then the AOLserver will hand your program the requested connection into which you can pipe SQL. This architecture improves Web/RDBMS throughput by at least a factor of ten over the standard CGI approach.
Source code for AOLserver is available from www.aolserver.com.
The most substantial shrink-wrapped software package specifically for AOLserver is OpenACS, available from www.openacs.org, a toolkit for building online communities.
The Apache API reflects this dimorphism. The simple users aren't expected to touch it. The complex users are expected to be C programming wizards. So the API is flexible but doesn't present a very high level substrate on which to build.
Support for Apache is as good as your wallet is deep. You download the software free from http://www.apache.org and then buy support separately from the person or company of your choice. Because the source code is freely available there are thousands of people worldwide who are at least somewhat familiar with Apache's innards.
There don't seem to be too many shrink-wrapped software packages for Apache that actually solve any business problem as defined by a business person. If you're a programmer, however, you'll find lots of freeware modules to make your life easier, including some popular ones that let you
Apache is a pre-forking server and is therefore reasonably fast.
Source code availability is a weak point for IIS. If it doesn't do what you want you're stuck. On the plus side, however, the API for IIS is comprehensive and is specially well suited to the Brave New World of Web Services (computers talking to other computers using HTTP and XML-format requests).
Virtually everyone selling shrink-wrapped software for Web services offers a version for IIS, including Microsoft itself with a whole bunch of packages.
IIS is one of the very fastest Web server programs available.
For Web service, it is risky to rely on anyone other than a backbone network for T1 service. It is especially risky to route your T1 through a local ISP who in turn has T1 service from a backbone operator. Your local ISP may not manage their network competently. They may sell T1 service to 50 other companies and funnel you all up to Sprint through one T1 line.
You'll have to be serving more than 500,000 hits a day to max out a T1 line (or be serving 50 simultaneous real-time audio streams). When that happy day arrives, you can always get a 45-Mbps T3 line for about $15,000 a month.
If your office is in a city where a first-class colocation company does business, an intelligent way to use the T1 infrastructure is to co-locate your servers and then run a T1 line from your office to your co-location cage. For the price of just the "local loop" charge (under $500 per month), you'll have fast reliable access to your servers for development and administration. Your end-users will have fast, reliable, and redundant access to your servers through the colocation company's connections.
If all of the folks on your block started running pornographic Web servers on their Linux boxes then you'd have a bit of a bandwidth crunch because you are sharing a 10-Mbit channel with 100 other houses. But there is no reason the cable company couldn't split off some fraction of those houses onto another Ethernet in another unused video channel.
Cable modems are cheap and typically offer reasonable upload speeds of 384 Kbps (remember that it is the upload speed that affects how well your cable modem connection will function for running a Web service). There are some limitations with cable modems that make it tough to run a Web server from one. The first is that the IP address is typically dynamically assigned via DHCP and may change from time to time. This can be addressed by signing up to dyndns.org and writing a few scripts to periodically update the records at dyndns.org with your server's IP address. When a user asks to visit "www.africancichlidclub.org" the dyndns.org translates that hostname to the very latest IP address that it has been supplied with. A limitation that is tougher to get around is that some cable companies block incoming requests to port 80, the standard HTTP service port. This may be an effort to enforce a service agreement that prohibits customers from running servers at home. It may be possible to get around this with a URL forwarding service that will bounce a request for "www.africancichlidclub.org" to "24.128.190.85:8000".
An alternative to cable is Digital Subscriber Line or "DSL". Telephony uses 1 percent of the bandwidth available in the twisted copper pair that runs from the central phone company office to your house ("the local loop"). ISDN uses about 10 percent of that bandwidth. With technology similar to that in a 28.8 modem, ADSL uses 100 percent of the local loop bandwidth, enough to deliver 6 Mbps to your home or business. There are only a handful of Web sites on today's Internet capable of maxing out an ADSL line. The price of ADSL is low because the service competes with cable modems, i.e., $20-50 per month. Telephone monopolies such as Verizon often impose restrictions on the use of the DSL line that make it impossible to run a home server. Look for competitive local exchange carriers (CLECs), smaller companies that provide DSL service from the same physical phone company office. An example of a CLEC is www.covad.com, which sells a fixed IP address for $70 per month (1.5 Mbps down and 384 Kbps up).
"Sprint's ION integrates voice, video, and data over one line," said Charles Flackenstine, a manager of technology services at Sprint. "For small and medium businesses it leverages the playing field, giving them the capability to become a virtual corporation."Processing power per dollar has been growing exponentially since the 1950s. In 1980, a home computer was lucky to execute 50,000 instructions per second and a fast modem was 2,400 bps. In 1997, a home computer could do 200 million instructions per second but it communicated through a 28,800-bps modem. We got a 4,000-fold improvement in processing power but only a tenfold improvement in communication speed.
-- news.com (June 2, 1998)
The price of bandwidth is going to start falling at an exponential rate. Our challenge as Web publishers is to figure out ways to use up all that bandwidth.