hadley / r-internals

Documentation for R's internal C API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Object size

hadley opened this issue · comments

Formerly in adv-r

Something interesting occurs if we use obj_size() to systematically explore the size of an integer vector. The code below computes and plots the memory usage of integer vectors ranging in length from 0 to 50 elements. You might expect that the size of an empty vector would be zero and that memory usage would grow proportionately with length. Neither of those things are true! \index{vectors!size of}

sizes <- sapply(0:50, function(n) obj_size(seq_len(n)))
plot(0:50, sizes, xlab = "Length", ylab = "Size (bytes)", 
  type = "s")

This isn't just an artefact of integer vectors. Every length 0 vector occupies 40 bytes of memory:

obj_size(numeric())
obj_size(logical())
obj_size(raw())
obj_size(list())

Those 40 bytes are used to store four components possessed by every object in R:

  • Object metadata (4 bytes). These metadata store the base type (e.g. integer)
    and information used for debugging and memory management.

  • Two pointers: one to the next object in memory and one to the previous
    object (2 * 8 bytes). This doubly-linked list makes it easy for internal
    R code to loop through every object in memory.

  • A pointer to the attributes (8 bytes).

All vectors have three additional components: \indexc{SEXP}

  • The length of the vector (4 bytes). By using only 4 bytes, you might expect
    that R could only support vectors up to $2 ^ {4 \times 8 - 1}$ ($2 ^ {31}$, about
    two billion) elements. But in R 3.0.0 and later, you can actually have
    vectors up to $2 ^ {52}$ elements. [Read R-internals][long-vectors] to see how
    support for long vectors was added without having to change the size of this
    field. \index{long vectors} \index{atomic vectors!long}

  • The "true" length of the vector (4 bytes). This is basically never used,
    except when the object is the hash table used for an environment. In that
    case, the true length represents the allocated space, and the length
    represents the space currently used.

  • The data (variable number of bytes). An empty vector has 0 bytes of data. Numeric vectors occupy 8 bytes for
    every element, integer vectors 4, and complex vectors 16.

If you're keeping count you'll notice that this only adds up to 36 bytes. The remaining 4 bytes are used for padding so that each component starts on an 8 byte (= 64-bit) boundary. Most cpu architectures require pointers to be aligned in this way, and even if they don't require it, accessing non-aligned pointers tends to be rather slow. (If you're interested, you can read more about it in C structure packing.)

This explains the intercept on the graph. But why does the memory size grow irregularly? To understand why, you need to know a little bit about how R requests memory from the operating system. Requesting memory (with malloc()) is a relatively expensive operation. Having to request memory every time a small vector is created would slow R down considerably. Instead, R asks for a big block of memory and then manages that block itself. This block is called the small vector pool and is used for vectors less than 128 bytes long. For efficiency and simplicity, it only allocates vectors that are 8, 16, 32, 48, 64, or 128 bytes long. If we adjust our previous plot to remove the 40 bytes of overhead, we can see that those values correspond to the jumps in memory use.

plot(0:50, sizes - 40, xlab = "Length", 
  ylab = "Bytes excluding overhead", type = "n")
abline(h = 0, col = "grey80")
abline(h = c(8, 16, 32, 48, 64, 128), col = "grey80")
abline(a = 0, b = 4, col = "grey90", lwd = 4)
lines(sizes - 40, type = "s")

Beyond 128 bytes, it no longer makes sense for R to manage vectors. After all, allocating big chunks of memory is something that operating systems are very good at. Beyond 128 bytes, R will ask for memory in multiples of 8 bytes. This ensures good alignment.

Exercises

  1. Repeat the analysis above for numeric, logical, and complex vectors.

  2. If a data frame has one million rows, and three variables (two numeric, and
    one integer), how much space will it take up? Work it out from theory,
    then verify your work by creating a data frame and measuring its size.

  3. Compare the sizes of the elements in the following two lists. Each
    contains basically the same data, but one contains vectors of small
    strings while the other contains a single long string.

    vec <- lapply(0:50, function(i) c("ba", rep("na", i)))
    str <- lapply(vec, paste0, collapse = "")
    
  4. Which takes up more memory: a factor (x) or the equivalent character
    vector (as.character(x))? Why?

  5. Explain the difference in size between 1:5 and list(1:5).

The "true" length of the vector (4 bytes). This is basically never used,

The "true" length is also used when a list is enlarged due to subassignment. (source: https://github.com/wch/r-source/blob/9b2a6fc19ca853cff5ae09cb0efcfb179c6da3a5/src/main/subassign.c#L143)

Example:

v = vector("list", 20)
.Internal(inspect(v))  # true length = 0
#> @7fd774c36370 19 VECSXP g0c7 [NAM(1)] (len=20, tl=0)
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   ...
v[[21]] = 21
.Internal(inspect(v))  # new address, true length = 22
#> @7fd774c36440 19 VECSXP g0c7 [NAM(1),gp=0x20] (len=21, tl=22)
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   ...
v[[22]] = 22
.Internal(inspect(v))  # grow in place, possibly because this memory was allocated previously
#> @7fd774c36440 19 VECSXP g0c7 [NAM(1),gp=0x20] (len=22, tl=22)
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   @7fd76f82c8e0 00 NILSXP g1c0 [MARK,NAM(7)]
#>   ...