snowhp / 42_Webserv

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Webserv Project

This project aims to develop a simple local HTTP server capable of hosting multiple ports, each with its own environment. We've dived into HTTP, TCP/IP (and its differences from UDP), and implemented the project using C++98. My focus was on ensuring smooth request/response behavior similar to real-life scenarios. Additionally, we explored CGI handling and various transfer-encoding scenarios. Throughout the project, we followed a simple Agile approach to our workflow, where I took on the role of both team leader and developer.

Table Of Contents

Requirements and Configuration

Before running this project, ensure proper folder management and the presence of basic HTML default files, aswell that your machine has php-cgi and python3 installed. For detailed instructions, please refer to the ConfigReadme.md file.

Please note that this setup is necessary as the server is designed to function locally on any machine for any user. Therefore, it's not intended to be a "free-for-all" configuration in terms of default directories and pages. This approach ensures correct behavior, considering the vast array of machine configurations.

Configuration File

Webserv's configuration draws inspiration from NGINX. However, it features its own syntax and capabilities. Please refer to the ConfigReadme.md file for detailed information on how to configure Webserv. Additionally, we've included a couple of configuration files as reference.

Execution and Usage

After correctly setting up the configuration files, directories, and default pages, simply run make. In our Makefile, we use a Dockerhub repository to fetch our default settings. If you encounter errors, remove the rule responsible for this in the Makefile.

We've included a run rule in the Makefile. When invoked (make run), it executes the program. We included this feature because on most machines, binding ports below 1024 requires elevated privileges. However, we advise against using sudo for this purpose as a warning will be shown.

Once compiled successfully, execute the program with:

./webserv <configuration file>

The program will attempt to parse the servers specified in the configuration file. If parsing is successful, it will try to initialize them. Please note that servers that pass the parsing phase may still fail during initialization if an invalid address or port is specified.

If a server is successfully started, it will be listed along with its settings. You can then access it via:

localhost:<port>/<optional_directory>/<optional_page>

If you wish to access it via the server_name, you must first add it to your /etc/hosts file.

To terminate the program use Ctrl + C or close the terminal.

CGI

If you're considering the use of CGI scripting within our project, it's essential to understand the standard procedures outlined in RFC 3875. CGI scripts are responsible for handling incoming requests and generating appropriate responses. In compliance with RFC guidelines, CGI scripts expect certain environment variables to be set, detailing crucial information about the request.

In our project, we facilitate CGI scripting by setting up the necessary environment variables based on the request type. For GET requests, the environment variables are created from the data passed through the URL. On the other hand, for POST requests, users would need to modify the source code in a manner that allows them to parse each block of information generated by the HTML form they intend to execute. This involves ensuring that the source code can extract and set the required environment variables in accordance with the form data. Additionally, it's important to understand that we cannot provide a standardized solution for parsing form data due to the variability in form designs and data structures. Therefore, users must exercise caution and responsibility when modifying the source code for POST requests, ensuring that any changes align with the intended form data structure and functionality. If they intend to modify the source code and use it, users must first ask permission from the project administrators.

If you wish to use CGI scripting, we've provided a few scripts. For file uploads, you can utilize our upload.py script in your HTML form. Similarly, for file deletion, you can use our delete.php. These are for POST requests.

If you wish to use CGI via GET requests, then use:

localhost:<port>/cgi-bin/<script_name>/<additional_path>?<query_string_name>=<data>

It's important to note that while we provide two hardcoded CGI scripts (upload.py and delete.py) for testing purposes, these scripts do not adhere to the RFC standards as they serve specific use cases. Users are encouraged to develop their own CGI scripts in compliance with RFC guidelines, ensuring that they handle the required environment variables appropriately.

Furthermore, it's imperative for users to exercise caution and responsibility when creating CGI scripts. Misuse or mishandling of user input within CGI scripts can lead to security vulnerabilities or unexpected errors. Users should thoroughly understand and implement secure handling practices for user input to mitigate potential risks.

If you have questions about CGI, please visit the RFC 3875 documentation.

Chunked Transfer Encoding

Chunked Transfer Encoding is a mechanism used in HTTP to send data as a series of chunks, instead of in a single block. This allows the server to start sending the response before it knows the total size, which is particularly useful when the size of the response is not known in advance or when the response is being generated dynamically.

Syntax

In HTTP, chunked transfer encoding allows a request to be sent in chunks, primarily when the Content-Length header is not present or when the Transfer-Encoding header with the value chunked is included. The request body is divided into smaller "chunks," each prefixed by its size in hexadecimal format, followed by a carriage return and line feed (CRLF) pair, the chunk data, and another CRLF pair. The final chunk has a size of 0, indicating the end of the request body. This mechanism enables sending requests with dynamic or streaming content without requiring the total size of the data to be known in advance.

In a response, the use of chunked transfer encoding is indicated by the presence of the Transfer-Encoding: chunked header. Each chunk consists of the following components:

  1. Chunk Size: The size of the chunk in hexadecimal followed by a CRLF (Carriage Return + Line Feed).
  2. Chunk Data: The actual chunk data.
  3. Chunk Terminator: A CRLF that marks the end of the chunk.

The final chunk is a zero-length chunk, indicated by a chunk size of 0.

Chunked Response Chunked Request
HTTP/1.1 200 OK
Content-Type: text/plain
Transfer-Encoding: chunked

4\r\n Wiki\r\n 5\r\n pedia\r\n E\r\n in\r\n\r\nchunks.\r\n 0\r\n \r\n

POST /api/data HTTP/1.1
Host: example.com
Content-Type: text/plain
Transfer-Encoding: chunked

4\r\n data\r\n 5\r\n chunk\r\n 0\r\n \r\n

Real-life Scenarios

Chunked transfer encoding finds application in various scenarios:

  • Streaming media: Videos or audio files may be transferred in chunks to initiate playback before the entire file is received.
  • Large file downloads: Content delivery networks (CDNs) and download managers often utilize chunked encoding to enhance download performance, particularly for large files.
  • Dynamic content generation: Web servers generating responses dynamically may employ chunked encoding to send data to the client as it becomes available, without waiting for the complete response to be generated.

It's worth noting that while chunked responses are relatively common, chunked requests are much rarer in comparison. This is because most client-side software, such as web browsers, typically sends entire HTTP requests in one go rather than using chunked encoding. Chunked requests are generally reserved for specialized cases where the size of the request body is not known in advance or when the client needs to start sending data before it has received the entire request.

Project Application

For this project, we extensively discussed Chunked Transfer Encoding (CTE). There was debate about its availability in base NGINX, which HTTP version supports it, and in what form. Multiple sources claim that CTE is available in HTTP/1.1, but only for chunked responses (which we don't generate in our project), while others imply it's available for both responses and requests. Some references suggest that although HTTP/1.1 allows it, base NGINX doesn't and requires additional modules to be installed. Or you can disable CTE handling directly on a NGINX configuration file. Additionally, some results indicate that when HTTP/1.1 receives a chunked request, it forwards it to HTTP/1.0 to handle unexpected behavior.

To meet a requirement of the project, which mandates that chunked requests be unchunked and decoded, we've provided a Python script (chunker.py) to generate chunked requests. We couldn't find a way to generate them locally using a browser, hence the need for the script. We've implemented a simple handler that, upon receiving a chunked request, waits until the full request is received, decodes the hexadecimal values, and sends a simple 200 OK response to the script. This is a hardcoded example serving as a proof of concept since we rarely encounter chunked requests in normal scenarios.

To test this functionality, open a second terminal and run the Python script from the command line.

Logs

To help you track server block configurations, incoming requests, and Valgrind output, we’ve implemented a logging system. The first two types of logs are generated by the program. For Valgrind logs, we’ve added a dedicated rule in the Makefile. Once the program is compiled, you can run Valgrind with the following command:

make valgrind CONFIG_FILE=<configuration file>



Basics of HTTP Server

HTTP (Hypertext Transfer Protocol) is the cornerstone protocol used for exchanging information over the internet, serving as the foundation of the World Wide Web. It enables communication between web browsers and servers, facilitating both server-side and client-side programming.

TCP and UDP

Before delving into HTTP, it's essential to understand the underlying transport protocols: TCP/IP and UDP.

  • TCP/IP (Transmission Control Protocol/Internet Protocol): TCP/IP is a reliable, connection-oriented protocol suite that provides error checking and guarantees the delivery of data packets in the correct order. It establishes a connection between the client and server before transferring data and ensures that packets arrive intact and in sequence. TCP/IP is commonly used for applications that require high reliability, such as web browsing, email, and file transfer.

  • UDP (User Datagram Protocol): UDP is a lightweight, connectionless protocol that operates without establishing a connection between the client and server. Unlike TCP/IP, UDP does not provide error checking, packet sequencing, or guaranteed delivery of data packets. Instead, it offers low-latency communication, making it suitable for real-time applications like online gaming, video streaming, and VoIP (Voice over Internet Protocol).

Understanding the differences between TCP/IP and UDP is crucial for designing and implementing efficient network applications, including HTTP servers.

Request-Response Cycle

The request-response cycle is a critical process in web communication, involving:

  1. Client Request: A client (e.g., web browser or mobile app) sends a request to the server, which includes the requested resource and any additional parameters.
  2. Server Processing: Upon receiving the request, the server processes it and generates a response message.
  3. Server Response: The server sends a response to the client, typically including the requested resource along with any additional information or metadata.
  4. Client Processing: The client receives and processes the server's response, often rendering the content in a web browser or displaying it in an app.

Clients may initiate additional requests, repeating this cycle as necessary.

Creating a Valid HTTP Request

To create a valid HTTP request, the following elements are essential:

  • URL: Represents the unique name pointing to a specific resource on the server.
  • HTTP Method: Indicates the action the client desires the server to take, such as GET, POST, or DELETE.
  • Headers: Provide context and additional instructions to the server from the client.
  • Body (optional): Contains data sent from the client to the server as part of the request.

Example of a simple GET request:

GET /watch?v=dQw4w9WgXcQ HTTP/1.1

Example of a more detailed GET request:

GET /api/data HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36
Accept: application/json
Accept-Language: en-US,en;q=0.5
Authorization: Token abc123
Cache-Control: no-cache
Connection: keep-alive
Referer: https://www.google.com/
Pragma: no-cache

Response

The server's response to an HTTP request includes:

  • Status Line: Contains the HTTP version, status code, and a message indicating the outcome of the request.
  • Headers: Provide additional information about the response.
  • Message Body: Contains the actual response data (e.g., HTML, JSON, XML).

Example response:

HTTP/1.1 200 OK
Date: Sun, 28 Mar 2023 10:15:00 GMT
Content-Type: application/json
Server: Apache/2.4.39 (Unix) OpenSSL/1.1.1c PHP/7.3.6
Content-Length: 1024

{
	"name": "John Wick",
	"email": "johnwick@example.com",
	"age": 35,
	"address": {
		"street": "123 Main St",
		"city": "Anytown",
		"state": "CA",
		"zip": "12345"
	}
}

Common HTTP Status Codes

  • 100 Continue
  • 101 Switching Protocols
  • 200 OK
  • 201 Created
  • 202 Accepted
  • 203 Non-Authoritative Information
  • 404 Not Found - The requested resource was not found on the server.
  • 500 Internal Server Error - The server encountered an error while processing the request.
  • 301 Moved Permanently - The requested resource has been permanently moved to a new URL.

IO Multiplexing

I/O Multiplexing is a technique used for managing multiple input/output operations over a single blocking system call. It's crucial for applications that need to handle multiple data streams simultaneously without dedicating a separate thread or process to each one, thus significantly improving efficiency and performance in networked applications.

We chose to use epoll for our project, but first, we need to understand each type of I/O multiplexing: select, poll, and epoll.

select()

The select() system call allows a program to monitor multiple file descriptors to see if one or more of them are ready for an I/O operation (e.g., reading or writing).

How select() Works:

  1. Prepare three sets of file descriptors: read, write, and exceptions.
  2. Define a timeout duration.
  3. Invoke select(), which blocks until at least one descriptor becomes ready or a timeout occurs.
  4. After select() returns, check which descriptors are ready and proceed with the necessary operations.

Limitations:

  • Has a fixed limit on the number of descriptors (FD_SETSIZE).
  • Inefficient for large sets of descriptors due to its linear scanning mechanism.

poll()

poll() serves a similar purpose to select(), monitoring multiple file descriptors, but it does so using an array of struct pollfd structures to overcome some of select()'s limitations.

How poll() Works:

  1. Initialize an array of pollfd structures with descriptors and the events of interest.
  2. Call poll(), specifying the array, its size, and a timeout.
  3. After returning, iterate through the array to identify which descriptors had their events occur and handle them accordingly.

Unlike select(), poll() is more scalable and does not have a preset limit on the number of descriptors it can handle.

epoll()

Exclusive to Linux, epoll() is a modern alternative to select() and poll(), designed to efficiently handle a large number of file descriptors.

How epoll() Works:

  1. Create an epoll instance with epoll_create().
  2. Use epoll_ctl() to add file descriptors to the epoll instance, specifying the events to monitor.
  3. Call epoll_wait(), which blocks until events occur, returning only those descriptors with active events.

Advantages:

  • Reduces CPU usage by eliminating the need to check all file descriptors.
  • Scales efficiently, capable of managing thousands of concurrent connections.

In conclusion, while select() and poll() offer portability across Unix-like systems, epoll() provides superior performance and scalability on Linux systems, making it especially suitable for server applications that handle many simultaneous connections.

In our StartServers() function, we utilize our epoll group functions. We initialize an epoll_fd variable, serving as the control structure for all I/O operations managed by epoll. The _nServ parameter denotes the number of servers in our cluster, determining the quantity of file descriptors to handle.

Now, we define a structure epoll_event with a buffer size of 10, initialized to monitor input (EPOLLIN) and output (EPOLLOUT) events. Additionally, we include hang-up (EPOLLHUP) and error (EPOLLERR) events. It's crucial to add these events to prevent unexpected errors, as epoll checks for them even if they're not explicitly added.

epoll_ctl(epoll_fd, EPOLL_CTL_ADD, events.data.fd, &events) registers the server's sockets, enabling the server cluster to monitor events on these sockets.

epoll_wait(epoll_fd, event_buffer, 10, 5000) is called to wait for events on the monitored sockets. The function blocks until events are available, or the timeout (5000 milliseconds) occurs. If epoll_wait returns a negative value, it checks if a signal (gSignalStatus) has interrupted the wait, typically managed by SIGINT or Ctrl+C, which signifies the initiation of the shutdown procedure for the program.

For each event detected, we check if the file descriptor from the event buffer belongs to a known server socket, indicating an incoming connection attempt. If it's a server socket, it accepts the new incoming connection using accept(event_buffer[i].data.fd, (sockaddr*)&client_address, (socklen_t*)&addrlen), assigning it to a new client socket. The new client socket is then set to non-blocking mode and added to the epoll monitoring setup using epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client_socket, &event_buffer[i]) < 0).

If an existing connection has new data to read (EPOLLIN event), the connection is handled by a method called connectionHandler(), which processes the connection and sends the request. EPOLLIN typically indicates incoming data (the request), while EPOLLOUT usually signifies that the socket is ready to send data (the response). If checkSocketActivity() closes a file descriptor, we break the cycle and return to the beginning to avoid iterating over possibly removed file descriptors from the buffer. This method may also be called to perform periodic maintenance, such as removing inactive sockets from epoll.

We only add a new connection file descriptor if it does not exist previously. If it already exists, we use the existing one. This ensures that, for example, a request needing to fetch an HTML page and its favicon or an image, won't create new connections for both but will use the same one. This approach also works for chunked requests.

Useful links:

Introductory Videos
Advanced Videos
Blog Posts
RFCs
CGI
HTTP
Response Content Type
Select and Poll Functions
Other

Collaborators

About


Languages

Language:C++ 94.4%Language:Makefile 3.5%Language:Python 1.6%Language:PHP 0.3%Language:Dockerfile 0.1%