Anatomy of the Linux networking stack
Summary: One of the greatest features of the Linux® operating system is its networking stack. It was initially a derivative of the BSD stack and is well organized with a clean set of interfaces. Its interfaces range from the protocol agnostics, such as the common sockets layer interface or the device layer, to the specific interfaces of the individual networking protocols. This article explores the structure of the Linux networking stack from the perspective of its layers and also examines some of its major structures.
While formal introductions to networking commonly refer to the Open Systems Interconnection (OSI) model, this introduction to the basic networking stack in Linux uses the four-layer model known as the Internet model (see Figure 1).
Figure 1. The Internet model of a network stack
At the bottom of the stack is the link layer. The link layer refers to the device drivers providing access to the physical layer, which could be numerous mediums, such as serial links or Ethernet devices. Above the link layer is the network layer, which is responsible for directing packets to their destinations. The next layer, called the transport layer, is responsible for peer-to-peer communication (for example, within a host). While the network layer manages communication between hosts, the transport layer manages communication between endpoints within those hosts. Finally, there’s the application layer, which is commonly the semantic layer that understands the data being moved. For example, the Hypertext Transfer Protocol (HTTP) moves requests and responses for Web content between a server and a client.
Practically speaking, the layers of the networking stack go by much more recognizable names. At the link layer, you find Ethernet, the most common high-speed medium. Older link-layer protocols include the serial protocols such as the Serial Line Internet Protocol (SLIP), Compressed SLIP (CSLIP), and the Point-to-Point Protocol (PPP). The most common network layer protocol is Internet Protocol (IP), but other protocols exist at the network layer that satisfy other needs, such as the Internet Control Message Protocol (ICMP) and the Address Resolution Protocol (ARP). At the transport layer is the Transmission Control Protocol (TCP) and User Datagram Protocol (UDP). Finally, the application layer includes many familiar protocols, including the standard Web protocol, HTTP, and the e-mail protocol, Simple Mail Transfer Protocol (SMTP).
Now on to the architecture of the Linux network stack and how it implements the Internet model. Figure 2 provides a high-level view of the Linux network stack. At the top is the user space layer, or application layer, which defines the users of the network stack. At the bottom are the physical devices that provide connectivity to the networks (serial or high-speed networks such as Ethernet). In the middle, or kernel space, is the networking subsystem that is the focus of this article. Through the interior of the networking stack flow socket buffers (
sk_buffs) that move packet data between sources and sinks. You’ll see the
Figure 2. Linux high-level network stack architecture
First, here’s a quick overview of the core elements of the Linux networking subsystem, followed by more detail in later sections. At the top (see Figure 2) is the system call interface. This simply provides a way for user-space applications to gain access to the kernel’s networking subsystem. Next is a protocol-agnostic layer that provides a common way to work with the underlying transport-level protocols. Next are the actual protocols, which in Linux include the built-in protocols of TCP, UDP, and, of course, IP. Next is another agnostic layer that permits a common interface to and from the individual device drivers that are available, followed at the end by the individual device drivers themselves.
System call interface
The system call interface can be described from two perspectives. When a networking call is made by the user, it is multiplexed through the system call interface into the kernel. This ends up as a call to
sys_socketcall in ./net/socket.c, which then further demultiplexes the call to its intended target. The other perspective of the system call interface is the use of normal file operations for networking I/O. For example, typical read and write operations may be performed on a networking socket (which is represented by a file descriptor, just as a normal file). Therefore, while there exist a number of operations that are specific to networking (creating a socket with the
socket call, connecting it to a destination with the
connect call, and so on), there are also a number of standard file operations that apply to networking objects just as they do to regular files. In the end, the syscall interface provides the means to transfer control between the user-space application and the kernel.
Protocol agnostic interface
The sockets layer is a protocol agnostic interface that provides a set of common functions to support a variety of different protocols. The sockets layer not only supports the typical TCP and UDP protocols, but also IP, raw Ethernet, and other transport protocols, such as Stream Control Transmission Protocol (SCTP).
Communication through the network stack takes place with a socket. The socket structure in Linux is
struct sock, which is defined in linux/include/net/sock.h. This large structure contains all of the required state of a particular socket, including the particular protocol used by the socket and the operations that may be performed on it.
The networking subsystem knows about the available protocols through a special structure that defines its capabilities. Each protocol maintains a structure called
proto (found in linux/include/net/sock.h). This structure defines the particular socket operations that can be performed from the sockets layer to the transport layer (for example, how to create a socket, how to establish a connection with a socket, how to close a socket, and so on).
The network protocols section defines the particular networking protocols that are available (such as TCP, UDP, and so on). These are initialized at start of day in a function called
inet_init in linux/net/ipv4/af_inet.c (as TCP and UDP are part of the
inet family of protocols). The
inet_init function registers each of the built-in protocols using the
proto_register function. This function is defined in linux/net/core/sock.c, and, in addition to adding the protocol to the active protocol list, it also optionally allocates one or more slab caches if required.
You can see how individual protocols identify themselves through the
proto structure in files tcp_ipv4.c, udp.c, and raw.c in linux/net/ipv4/. Each of these protocol structures are mapped by type and protocol into the
inetsw_array, which maps the built-in protocols to their operations. The structure of
inetsw_array and its relationships is shown in Figure 3. Each of the protocols in this array is initialized at start of day into
inetsw through a call to
inet_init also initializes the various
inet modules, such as the ARP, ICMP, the IP modules, and the TCP and UDP modules.
Figure 3. Structure of the Internet protocol array
Socket protocol correlation
Recall that when a socket is created one defines the type and protocol, such as
my_sock = socket( AF_INET, SOCK_STREAM, 0 ). The
AF_INET indicates an Internet address family with a stream socket defined as
SOCK_STREAM (as shown here in
Note from Figure 3 that the
proto structure defines the transport-specific methods, while the
proto_ops structure defines the general socket methods. Additional protocols can be added to
inetswprotocol switch through a call to
inet_register_protosw. For example, the SCTP adds itself through a call to
sctp_init in linux/net/sctp/protocol.c.
Data movement for sockets takes place using a core structure called the socket buffer (
sk_buff contains packet data and also state data that cover multiple layers of the protocol stack. Each packet sent or received is represented with an
sk_buff structure is defined in linux/include/linux/skbuff.h and shown in Figure 4.
As shown, multiple
sk_buff may be chained together for a given connection. Each
sk_buff identifies the device structure (
net_device) to which the packet is being sent or from which the packet was received. As each packet is represented with an
sk_buff, the packet headers are conveniently located through a set of pointers (
mac for the Media Access Control, or MAC, header). Because the
sk_buff are central to the socket data management, a number of support functions have been created to manage them. Functions exist for
sk_buff creation and destruction, cloning, and queue management.
Socket buffers are designed to be linked together for a given socket and include a multitude of information, including the links to the protocol headers, a timestamp (when the packet was sent or received), and the device associated with the packet.
Device agnostic interface
Below the protocols layer is another agnostic interface layer that connects protocols to a variety of hardware device drivers with varying capabilities. This layer provides a common set of functions to be used by lower-level network device drivers to allow them to operate with the higher-level protocol stack.
First, device drivers may register or unregister themselves to the kernel through a call to
unregister_netdevice. The caller first fills out the
net_device structure and then passes it in for registration. The kernel calls its
init function (if one is defined), performs a number of sanity checks, creates a
sysfs entry, and then adds the new device to the device list (a linked list of devices active in the kernel). You can find the
net_device structure in linux/include/linux/netdevice.h. The various functions are implemented in linux/net/core/dev.c.
To send an
sk_buff from the protocol layer to a device, the
dev_queue_xmit function is used. This function enqueues an
sk_buff for eventual transmission by the underlying device driver (with the network device being defined by the
sk_buff->dev reference in the
dev structure contains a method, called
hard_start_xmit, that holds the driver function for initiating transmission of an
Receiving a packet is performed conventionally with
netif_rx. When a lower-level device driver receives a packet (contained within an allocated
sk_buff is passed up to the network layer through a call to
netif_rx. This function then queues the
sk_buff to an upper-layer protocol’s queue for further processing through
netif_rx_schedule. You can find the
netif_rx functions in linux/net/core/dev.c.
Recently, a new application program interface (NAPI) was introduced into the kernel to allow drivers to interface with the device agnostic layer (
dev). Some drivers use NAPI, but the large majority still use the older frame reception interface (by a rough factor of six to one). NAPI can yield better performance under high loads by avoiding taking an interrupt for each incoming frame.
At the bottom of the network stack are the device drivers that manage the physical network devices. Examples of devices at this layer include the SLIP driver over a serial interface or an Ethernet driver over an Ethernet device.
At initialization time, a device driver allocates a
net_device structure and then initializes it with its necessary routines. One of these routines, called
dev->hard_start_xmit, defines how the upper layer should enqueue an
sk_buff for transmission. This routine takes an
sk_buff. The operation of this function is dependent upon the underlying hardware, but commonly the packet described by the
sk_buff is moved to a hardware ring or queue. Frame receipt, as described in the device agnostic layer, uses the
netif_rx interface or
netif_receive_skb for a NAPI-compliant network driver. A NAPI driver puts constraints on the capabilities of the underlying hardware.