Anatomy of the Linux networking stack
Summary: One of the greatest features of the Linux® operating system is its networking stack. It was initially a derivative of the BSD stack and is well organized with a clean set of interfaces. Its interfaces range from the protocol agnostics, such as the common sockets layer interface or the device layer, to the specific interfaces of the individual networking protocols. This article explores the structure of the Linux networking stack from the perspective of its layers and also examines some of its major structures.
Protocols introduction
While formal introductions to networking commonly refer to the Open Systems Interconnection (OSI) model, this introduction to the basic networking stack in Linux uses the four-layer model known as the Internet model (see Figure 1).
Figure 1. The Internet model of a network stack
At the bottom of the stack is the link layer. The link layer refers to the device drivers providing access to the physical layer, which could be numerous mediums, such as serial links or Ethernet devices. Above the link layer is the network layer, which is responsible for directing packets to their destinations. The next layer, called the transport layer, is responsible for peer-to-peer communication (for example, within a host). While the network layer manages communication between hosts, the transport layer manages communication between endpoints within those hosts. Finally, there’s the application layer, which is commonly the semantic layer that understands the data being moved. For example, the Hypertext Transfer Protocol (HTTP) moves requests and responses for Web content between a server and a client.
Practically speaking, the layers of the networking stack go by much more recognizable names. At the link layer, you find Ethernet, the most common high-speed medium. Older link-layer protocols include the serial protocols such as the Serial Line Internet Protocol (SLIP), Compressed SLIP (CSLIP), and the Point-to-Point Protocol (PPP). The most common network layer protocol is Internet Protocol (IP), but other protocols exist at the network layer that satisfy other needs, such as the Internet Control Message Protocol (ICMP) and the Address Resolution Protocol (ARP). At the transport layer is the Transmission Control Protocol (TCP) and User Datagram Protocol (UDP). Finally, the application layer includes many familiar protocols, including the standard Web protocol, HTTP, and the e-mail protocol, Simple Mail Transfer Protocol (SMTP).
Now on to the architecture of the Linux network stack and how it implements the Internet model. Figure 2 provides a high-level view of the Linux network stack. At the top is the user space layer, or application layer, which defines the users of the network stack. At the bottom are the physical devices that provide connectivity to the networks (serial or high-speed networks such as Ethernet). In the middle, or kernel space, is the networking subsystem that is the focus of this article. Through the interior of the networking stack flow socket buffers (sk_buffs
) that move packet data between sources and sinks. You’ll see the sk_buff
structure shortly.
Figure 2. Linux high-level network stack architecture
First, here’s a quick overview of the core elements of the Linux networking subsystem, followed by more detail in later sections. At the top (see Figure 2) is the system call interface. This simply provides a way for user-space applications to gain access to the kernel’s networking subsystem. Next is a protocol-agnostic layer that provides a common way to work with the underlying transport-level protocols. Next are the actual protocols, which in Linux include the built-in protocols of TCP, UDP, and, of course, IP. Next is another agnostic layer that permits a common interface to and from the individual device drivers that are available, followed at the end by the individual device drivers themselves.
System call interface
The system call interface can be described from two perspectives. When a networking call is made by the user, it is multiplexed through the system call interface into the kernel. This ends up as a call to sys_socketcall
in ./net/socket.c, which then further demultiplexes the call to its intended target. The other perspective of the system call interface is the use of normal file operations for networking I/O. For example, typical read and write operations may be performed on a networking socket (which is represented by a file descriptor, just as a normal file). Therefore, while there exist a number of operations that are specific to networking (creating a socket with the socket
call, connecting it to a destination with the connect
call, and so on), there are also a number of standard file operations that apply to networking objects just as they do to regular files. In the end, the syscall interface provides the means to transfer control between the user-space application and the kernel.
Protocol agnostic interface
The sockets layer is a protocol agnostic interface that provides a set of common functions to support a variety of different protocols. The sockets layer not only supports the typical TCP and UDP protocols, but also IP, raw Ethernet, and other transport protocols, such as Stream Control Transmission Protocol (SCTP).
Communication through the network stack takes place with a socket. The socket structure in Linux is struct sock
, which is defined in linux/include/net/sock.h. This large structure contains all of the required state of a particular socket, including the particular protocol used by the socket and the operations that may be performed on it.
The networking subsystem knows about the available protocols through a special structure that defines its capabilities. Each protocol maintains a structure called proto
(found in linux/include/net/sock.h). This structure defines the particular socket operations that can be performed from the sockets layer to the transport layer (for example, how to create a socket, how to establish a connection with a socket, how to close a socket, and so on).
Network protocols
The network protocols section defines the particular networking protocols that are available (such as TCP, UDP, and so on). These are initialized at start of day in a function called inet_init
in linux/net/ipv4/af_inet.c (as TCP and UDP are part of theinet
family of protocols). The inet_init
function registers each of the built-in protocols using the proto_register
function. This function is defined in linux/net/core/sock.c, and, in addition to adding the protocol to the active protocol list, it also optionally allocates one or more slab caches if required.
You can see how individual protocols identify themselves through the proto
structure in files tcp_ipv4.c, udp.c, and raw.c in linux/net/ipv4/. Each of these protocol structures are mapped by type and protocol into the inetsw_array
, which maps the built-in protocols to their operations. The structure of inetsw_array
and its relationships is shown in Figure 3. Each of the protocols in this array is initialized at start of day into inetsw
through a call to inet_register_protosw
from inet_init
. Functioninet_init
also initializes the various inet
modules, such as the ARP, ICMP, the IP modules, and the TCP and UDP modules.
Figure 3. Structure of the Internet protocol array
Socket protocol correlation
Recall that when a socket is created one defines the type and protocol, such as my_sock = socket( AF_INET, SOCK_STREAM, 0 )
. The AF_INET
indicates an Internet address family with a stream socket defined as SOCK_STREAM
(as shown here in inetsw_array
).
Note from Figure 3 that the proto
structure defines the transport-specific methods, while the proto_ops
structure defines the general socket methods. Additional protocols can be added to inetsw
protocol switch through a call to inet_register_protosw
. For example, the SCTP adds itself through a call to sctp_init
in linux/net/sctp/protocol.c.
Data movement for sockets takes place using a core structure called the socket buffer (sk_buff
). An sk_buff
contains packet data and also state data that cover multiple layers of the protocol stack. Each packet sent or received is represented with an sk_buff
. The sk_buff
structure is defined in linux/include/linux/skbuff.h and shown in Figure 4.
Figure 4. Socket buffer and its relationship to other structures
As shown, multiple sk_buff
may be chained together for a given connection. Each sk_buff
identifies the device structure (net_device
) to which the packet is being sent or from which the packet was received. As each packet is represented with ansk_buff
, the packet headers are conveniently located through a set of pointers (th
, iph
, and mac
for the Media Access Control, or MAC, header). Because the sk_buff
are central to the socket data management, a number of support functions have been created to manage them. Functions exist for sk_buff
creation and destruction, cloning, and queue management.
Socket buffers are designed to be linked together for a given socket and include a multitude of information, including the links to the protocol headers, a timestamp (when the packet was sent or received), and the device associated with the packet.
Device agnostic interface
Below the protocols layer is another agnostic interface layer that connects protocols to a variety of hardware device drivers with varying capabilities. This layer provides a common set of functions to be used by lower-level network device drivers to allow them to operate with the higher-level protocol stack.
First, device drivers may register or unregister themselves to the kernel through a call to register_netdevice
orunregister_netdevice
. The caller first fills out the net_device
structure and then passes it in for registration. The kernel calls its init
function (if one is defined), performs a number of sanity checks, creates a sysfs
entry, and then adds the new device to the device list (a linked list of devices active in the kernel). You can find the net_device
structure in linux/include/linux/netdevice.h. The various functions are implemented in linux/net/core/dev.c.
To send an sk_buff
from the protocol layer to a device, the dev_queue_xmit
function is used. This function enqueues ansk_buff
for eventual transmission by the underlying device driver (with the network device being defined by the net_device
orsk_buff->dev
reference in the sk_buff
). The dev
structure contains a method, called hard_start_xmit
, that holds the driver function for initiating transmission of an sk_buff
.
Receiving a packet is performed conventionally with netif_rx
. When a lower-level device driver receives a packet (contained within an allocated sk_buff
), the sk_buff
is passed up to the network layer through a call to netif_rx
. This function then queues the sk_buff
to an upper-layer protocol’s queue for further processing through netif_rx_schedule
. You can find thedev_queue_xmit
and netif_rx
functions in linux/net/core/dev.c.
Recently, a new application program interface (NAPI) was introduced into the kernel to allow drivers to interface with the device agnostic layer (dev
). Some drivers use NAPI, but the large majority still use the older frame reception interface (by a rough factor of six to one). NAPI can yield better performance under high loads by avoiding taking an interrupt for each incoming frame.
Device drivers
At the bottom of the network stack are the device drivers that manage the physical network devices. Examples of devices at this layer include the SLIP driver over a serial interface or an Ethernet driver over an Ethernet device.
At initialization time, a device driver allocates a net_device
structure and then initializes it with its necessary routines. One of these routines, called dev->hard_start_xmit
, defines how the upper layer should enqueue an sk_buff
for transmission. This routine takes an sk_buff
. The operation of this function is dependent upon the underlying hardware, but commonly the packet described by the sk_buff
is moved to a hardware ring or queue. Frame receipt, as described in the device agnostic layer, uses the netif_rx
interface or netif_receive_skb
for a NAPI-compliant network driver. A NAPI driver puts constraints on the capabilities of the underlying hardware.