Stanford CS348K Assignment: A Burst-Mode Camera RAW Processing Pipeline

In this assignment you will implement a simple RAW image processing pipeline for the camera of the world's hottest smartphone, the kPhone 348. Your job is to process the data coming off the device's sensor to produce the highest quality image you can. The assignment is split into two parts, with two separate handins.

In part 1, you'll process a single RAW image to demoaisc the image and correct for sensor defects.
In part 2, you'll extend your implementation to process a burst of images from a scene featuring a wide range of intensities. You'll align and merge the images of the burst to reduce noise, and then perform local tone mapping to produce a compelling HDR result. Part 2 involves more implementation effort than part 1.

Getting Started

Grab the assignment starter code.

git clone git@github.com:stanford-cs348k/camera_asst.git

To run the assignment, you will also need to download the scene datasets, located at http://cs348k.stanford.edu/fall18content/asst/scenes.tgz.

For example, on the Linux SU myth machines:

wget http://cs348k.stanford.edu/fall18content/asst/scenes.tgz
tar -xvf scenes.tgz

Build Instructions

The codebase uses a simple Makefile as the build system. To build the starter code, run make from the top level directory. The assignment source code is in src/, and object files and binaries will be generated into build/ and bin/ respectively.

Running the starter code:

Now you can run the camera. Just run:

./bin/kcamera MY_SCENES_DIR/taxi.bin output.bmp

(where MY_SCENES_DIR is where you extracted the scene datasets into) The camera will "take a picture" and output the result of processing the RAW sensor data from the sensor to output.bmp. The starter code just copies the sensor data verbatim into the red, green, and blue channels of the output image. (so the output is just a visualization of the RAW data from the sensor). So for a scene that looks like the image at left, you should see output that looks a bit like this.

Part 1 (35 points)

Due Monday October 8th, 11:59pm

In the first part of the assignment you must process the raw image data to produce an RGB image that, simply put, looks as good as you can make it. The entry point to your code should be CameraPipeline::ProcessShot() in camera_pipeline.cpp. This method reads RAW data from the sensor, and outputs an RGB image.

You will need to do the following in this function:

Demosaic the interleaved RGB channels in the raw sensor data. You may use any technique you wish.
Denoise the image and correct for any sensor defects (such as dead pixels that always read out as white). You may use any techniques you wish.
Rescale the floating point pixel values to 8-bit integer values in the 0-255 range. We recommend that you first do this by remapping the floating point range (0,1) to the range (0,255). We'll address more intelligent forms of "tone mapping" in Part 2 of the assignment.

Test scenes:

Your primary test scenes are taken directly from Google's HDR+ dataset. These scenes have been chosen to stress different aspects of real-world images. They are: taxi.bin (busy street scene with low-light), hand.bin (close-up shot with lots of texture), church.bin (high dynamic range), and path.bin (complex, high-resolution scene). For each of these scenes, we've provided a SCENE_solution_part1.bmp which is the output of a reference pipeline which would achieve full points for this part of the assignment, and SCENE_google.bmp which is the output of the full Google HDR+ pipeline (as you can see, it's just a bit better than our reference implementation!).
In addition, we've provided some helpful debugging scenes, such as: black.bin (an all black image), gray.bin (a 50% gray image, for which pixels without defects should be [128,128,128]), stripe.bin (a tough case for demosaicing), color.bin, and stanford.bin)

Tips:

You may implement this assignment in any way you wish. Certainly, you will have to demosaic the image to recover RGB values. The techniques you employ for handling noise, bad pixels, defect pixels are up to you.
We guarantee that pixel defects (stuck pixels, pixels with extra sensitivity) are static defects that are the same for every photograph taken for a given scene. Note that while the overall noise statistics of the sensor are the same per photograph, the perturbation of individual pixel values due to noise varies per photograph (that's the nature of noise!).
You may assume that RAW data read from the sensor is linear in incident light, and in the range [0, 1].
The following is the Bayer filter pattern used on the kPhone's sensor. Pixel (0,0) is the top-left of the image.
You should start with basic linear interpolation demosicing as discussed in class. However, we encourage you to attempt more advanced demosiacing solution as discussed in lecture, or in this paper that's listed under the course recommended readings page.

Description of the Starter Code

Much of the scaffolding code (reading and writing data, storing images, etc.) is provided for you. Your changes should only go in regions marked like:

// BEGIN: CS348K STUDENTS MODIFY THIS CODE 
...
// END: CS348K STUDENTS MODIFY THIS CODE

Only two files contain such regions:

camera_pipeline.hpp and camera_pipeline.cpp where you can customize the CameraPipeline class (while maintaining the same API), and implement the camera pipeline itself.

The driver code for this assignment (containing main()) is located in camera_main.cpp.

CameraSensor is a class which presents the same interface as a real camera sensor. It has methods like SetLensCap() and GetSensorData(). From inside CameraPipeline class you can access the sensor via the local member variable sensor_.

CameraSensorData is a wrapper around the raw camera sensor data. It has a method data(row, col) which returns the floating point intensity at index (row, col) in the sensor array. This value will always be between 0 and 1. This is the object returned by CameraSensor::GetSensorData().

CameraPipeline holds your implementation of the CameraPipelineInterface interface, whose job it is to take in a CameraSensor object and output a processed RGB image.

Image is an image container for both RGB- and YUV-space images. It is nothing more than a wrapper on top of a buffer of pixels. It has one important function, which is operator ()(row, col), which returns the pixel at row row and column col. The pixel is returned by reference, so it can be modified directly.

Pixel (in Pixel.hpp) is struct of convenience routines for 3-channel pixels. You may find the routines RgbToYuv and YuvToRgb helpful.

Tips:

Most of the detail of the starter code has been abstracted away for you and effectively your entire implementation of RAW processing can go in to the CameraPipeline::ProcessShot() function.
For the relevant classes, look at the .hpp files to understand the public API.
If you are having trouble using some of the provided functionality, feel free to implement your own versions. We just ask that you do not modify existing classes, but add new ones if necessary.
The starter code makes heavy use of std::unique_ptr. If you are not familiar with std::unique_ptr, see https://shaharmike.com/cpp/unique-ptr/.

Grading

Part 1 of the assignment will be graded on image quality. A reasonable implementation will address the challenges of demosaicing sensor output, correcting pixel defects and removing noise. We don't have a numeric definition of good since there is no precise right answer here... it's a photograph, you'll know a reasonably looking one when you see it! We encourage you to start with simple algorithms, get them to work, and then if there is time, attempt to improve image quality to move to more advanced techniques.

Handin

This assignment will be handed in using Canvas: http://canvas.stanford.edu

Please hand in camera_pipeline.cpp, camera_pipeline.hpp, We should be able to build and run the code on the myth machines by dropping these files into a freshly checked out starter code tree.
Please also include a short writeup describing the techniques you employed to implement demoasicing and image-quality problems caused by noise and sensor defects.

Part 2 (65 pts): Burst Mode Alignment for Denoising + Local Tone Mapping

Due Monday October 22nd, 11:59pm

Note: We've updated the scene assets from Part 1 to include reference images produced using a simple reference implementation of the alignment and tone mapping algorithms you will implement in this part of the assignment. You should redownload the scenes.tgz file if you'd like to compare against these references. (Note: the reference solutions involve a basic implementation of the required techniques. Motivated students will certainly be able to do better.)

When implementing your solution to the first part of this assignment, you might have noticed visual artifacts in your output. Consider the taxi.bin image:

There are at least two issues you might notice here:

Under-exposure: The scene exhibits high dynamic range, so to avoid over-exposing the bright sunset in the background, the image has been deliberately under-exposed. As a result, regions of the the image which don't receive as much illumination (e.g. the front parts of the taxis) are very dark.
Noise: Because the image was under-exposed, the effects of sensor noise are much more noticeable. In particular, the zoomed in region from the figure above shows how dominant the noise can become in dark regions.

In this part of the assignment, you'll address high dynamic range using a local tone mapping algorithm called exposure fusion. Then you will reduce noise in the tone mapped output by aligning and merging a sequence of underexposed shots as discussed in Burst Photography for High Dynamic Range and Low-light Imaging on Mobile Cameras.

Local Tone Mapping via Exposure Fusion

Tone mapping converts a high dynamic range image (with greater than 8 bits of information per channel) to a low dynamic range image (e.g., 8 bits per channel) that can be viewed on a low-dynamic range display. In a local tone mapping algorithm, different parts of the image are exposed differently so that detail is retained in both very bright and dark regions.

In this assignment, we'd like you to implement a modified version of Exposure Fusion as described by Mertens et al. The key idea of exposure fusion is that, while it is difficult to capture a single image where all parts of the image are well-exposed, it's possible to capture multiple exposures of the same scene and then combine the well-exposed parts of each of these images to create a satisfying high dynamic range photo.

Recall that the pixel data you receive from the sensor via GetSensorData() is represented as a 32-bit floating point value between 0 and 1. (Even though the mantissa of a single-precision float number is 23 bits, the data is from Google HDR+'s dataset, acquired via a Pixel phone, so the actual precision of these values is about 10 bits.) Rather than take multiple exposures with the camera as described in the paper, you'll first virtually create two 8-bit exposures from the high-precision input.

Your specific solution is allowed to differ (see further detail in the "Dynamic Range Compression" part of Section 6 of the HDR+ paper for heuristics), but one basic approach would create the following two virtual exposures after processing the data with your pipeline from part 1 of the assignment (but before conversion to 8-bit values):

dark, which is a grayscale version of the RGB image after basic RAW processing. (RGB to YUV conversion to get grayscale)
bright is formed by multiplying dark by a scale factor.

The dark image will retain detail in the areas of the image with a lot of light (since it is largely under-exposed), while the bright image will retain detail in the areas with little light (since the digital gain applied by the scale factor brightens dark parts of the image).

Exposure fusion then computes a per-pixel weight that selects between the bright and dark images. You can either see Section 3.1 of Mertens et al. for example heuristics, or the much simpler version described in the "Dynamic Range Compression" part of Section 6 of the HDR+ paper. The Laplacian pyramids of the dark and bright images (not initial image pixels) are blended together according to this weight. Last, the resulting Laplacian pyramid is flatted to get a merged grayscale image.

This modified grayscale image is then combined with the UV channels of the original pre-tone mapped RGB image to get a modified result.

In summary, here's a sketch of the modified exposure fusion algorithm (you should read over the original paper to understand each step in more detail):

Convert output image from Part 1 of the assignment into grayscale
Create the two artifical exposure brackets from the grayscale images: dark and bright
Using a weighting function of your choice (section 3.1 in the paper), compute weights for both images
Compute a Laplacian pyramid of both images and a Gaussian pyramid of the weights
Blend the Laplacian pyramids together using the Gaussian pyramid of weights (section 3.2 in the paper)
Extract the exposure fused image from the blended pyramid (flatten the pyramid) and use this as the Y channel of the final output image.

For example, here's our reference pipeline's dark and bright images with their corresponding weights and the final output:

White in the weight images represent a high value, and black represents a low value. As you can see, the weights in the dark image select the well-exposed sky in the background while the bright image selects the brightened taxis in the foreground.

Note: Be careful about whether you perform local tone mapping operations in linear intensity space (on luminance) or in a non-linear perceptual space (luma). The role of local tone mapping is to mimic how a human would perceive a scene if they were there in person. If we are using heuristics to select the exposure of different parts of the scene based on what we think would look good to a human, does it make more sense to apply these heuristics on luminance or luma values? How do you convert from luminance to luma? How does one convert luma back to luminance?

This algorithm makes the image look much brighter in the dark regions without blowing out the already bright regions. But what about the noise? let's zoom back into that dark region we were looking at before:

Notice that while this algorithm produces a result where there is detail in all regions, by boosting the dark regions (which are already prone to sensor noise), we have accentuated noise artifacts. Fortunately, burst mode alignment from the HDR+ paper solves this problem and is the next sub-part of this assignment.

Reducing Noise by Aligning an Image Burst

In this sub-part of the assignment, you will write code to align and merge a burst of (potentially noisy) sensor captures to produce a less noisy output image. The entry point to your code is the same as in the previous parts, but instead of calling sensor_->GetSensorData(), you should call sensor_->GetBurstSensorData(). This method reads a burst of RAW data from the sensor and returns it as a std::vector of bayered images. Your job is to implement a simplified version of the alignment and merging steps from the HDR+ paper to produce a denoised bayer image that can be processed by your existing camera pipeline code. __Note: the align/merge algorithm is used to produce a new (higher bit depth) pre-demosaiced RAW image that should then be passes through the rest of your RAW processing pipeline (including local tone mapping).

To illustrate why aligning the burst is necessary, consider the output of simply summing each capture in burst, as shown below:

The resulting image is certainly less noisy, since summation increases signal to noise ratio in dark regions. However, the result is also now blurry because the input images were captured at different points in time. This is the motivation for the HDR+ image alignment and merging steps. Here is a sketch of a simple implementation, though feel free to make modifications or enhancements to this algorithm that you think can produce a better result:

Alignment (Section 4 in the HDR+ paper): The following is a suggestion for how to implement the alignment step:

Convert the stack of raw bayer images from GetBurstSensorData() to grayscale by averaging together every 2x2 bayer grid (effectively downsampling by 2x).
Compute Gaussian pyramids for each of these grayscale images (You should be able to use your code from the earlier exposure fusion part of the assignment.).
For each of the images in the burst, perform a hierarchical alignment to the reference image (the first image in the burst) by following steps 4-6.
For each level of the Gaussian pyramid, starting at the coarsest:
For each tile of the reference image at this level, find the closest matching tile in the image that is being matched against (we use the absolute difference between the tiles as a measure of distance)
Upsample the offsets to the next level and repeat step 5 using the upsampled offsets as starting points for the next search

The paper mentions many other additional steps: subpixel alignment using an L2 metric, a robust upsampling strategy for the alignment fields, varying search radii, Fourier transforms for fast matching, etc. These can certainly improve your alignment, and we encourage interested students to attempt to implement some of these more advanced techniques, but they are not necessary to achieve a reasonable output image for this assignment.

Merging (Section 5 in the HDR+ paper):

The merging algorithm in the HDR+ paper uses an advanced noise model and operates in the frequency domain. We encourage you to implement the merging step in whichever way you choose, but below we provide a sketch of a basic implementation which will produce decent results:

For overlapping tiles in the reference image (Our reference uses tiles of size 16 with stride 8):
For each neighboring image, use the alignment offset to find the tile to merge.
Compute a merging weight for the tile from step 2 by comparing that tile to the reference image tile. In our implementation, we use the distance metric from the alignment step to compute an initial weight and then clamp all values below some minimum to 1 (full weight) and all values above some maximum distance to 0 (no weight) in order to throw out bad tile alignments which would blur the image if merged.
Merge the weighted neighboring image tile into the reference tile.
Repeat step 2-4 for all neighboring images.
Repeat step 1-5 for all overlapping tiles in the reference image.
Blend together the overlapping tiles using a raised cosine window (Overlapped tiles section in the paper).

Now that your implementation can align/merge a burst of RAW images, and then apply exposure fusion to the result, you should obtain a result that looks something like this (obviously results will vary based on algorithms used):

Test scenes: Your primary test scenes are a subset of the scenes in Part 1. Specifically: taxi.bin, church.bin, and path.bin. These scenes each have a burst of 3 images. For each of these scenes, we've provided a SCENE_solution_part2.bmp which is the output of a reference pipeline which would achieve full points for this part of the assignment, and SCENE_google.bmp which is the output of the full Google HDR+ pipeline (as you can see, it's just a bit better than our reference implementation!).

Tips:

You may implement this assignment in any way you wish. We've provided a recommended sketch of the basic algorithms, but feel free to improve upon our suggested algorithm. (The algorithms described in the reference readings are certainly more advanced than the baseline approaches described here.)
Implement utility functions for generating Gaussian and Laplacian pyramids first, as they will be used in all parts of this assignment.

Grading

Like Part 1, Part 2 of the assignment will be graded on image quality. Your implementation should contain an approach for aligning/merging images in a burst, and a valid implementation of exposure fusion. You may adjust/improve algorithms however you seek. We encourage you to start with simple algorithms, get them to work, and then if there is time, attempt to improve image quality to move to more advanced techniques.

Handin

This assignment will be handed in using Canvas:

Please hand in camera_pipeline.cpp, camera_pipeline.hpp, We should be able to build and run the code on the myth machines by dropping these files into a freshly checked out starter code tree.
Please also include a short writeup describing the algorithm you implemented for exposure fusion and burst denoising.

DanFu09 / camera_asst

Stanford CS348K Assignment: A Burst-Mode Camera RAW Processing Pipeline

Getting Started

Part 1 (35 points)

Due Monday October 8th, 11:59pm

Description of the Starter Code

Grading

Handin

Part 2 (65 pts): Burst Mode Alignment for Denoising + Local Tone Mapping

Due Monday October 22nd, 11:59pm

Local Tone Mapping via Exposure Fusion

Reducing Noise by Aligning an Image Burst

Grading

Handin

About

Languages