### How does `split.data.frame`

work?

```
function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...),
function(ind) x[ind, , drop = FALSE])
```

It calls `split.default`

to split row index vector `seq_len(nrow(x))`

, then use an `lapply`

loop to extract associated rows into a list entry.

This isn’t strictly a “data.frame” method. **It splits any 2-dimensional objects by the 1st dimension, including splitting a matrix by rows**.

### How does `split.default`

work?

```
function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...)
{
if (!missing(...))
.NotYetUsed(deparse(...), error = FALSE)
if (is.list(f))
f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
else if (!is.factor(f))
f <- as.factor(f)
else if (drop)
f <- factor(f)
storage.mode(f) <- "integer"
if (is.null(attr(x, "class")))
return(.Internal(split(x, f)))
lf <- levels(f)
y <- vector("list", length(lf))
names(y) <- lf
ind <- .Internal(split(seq_along(x), f))
for (k in lf) y[[k]] <- x[ind[[k]]]
y
}
```

- if
`x`

has no classes (i.e., mostly an atomic vector),`.Internal(split(x, f))`

is used; - otherwise, it uses
`.Internal(split())`

to split the index along`x`

, then uses a`for`

loop to extract associated elements into a list entry.

An atomic vector (see `?vector`

) is a vector with the following mode:

- “logical”, “integer”, “numeric”, “complex”, “character” and “raw”
- “list”
- “expression”

An object with class… Er… there are so many!! Let me just give three examples:

- “factor”
- “data.frame”
- “matrix”

In my opinion the `split.default`

is not well written. There are so many objects with classes, yet `split.default`

would deal with them in the same way via`"["`

. This works fine with “factor” and “data.frame” (so we will be splitting data frame along the columns!), but it definitely does not work with a matrix in a way we expect.

```
A <- matrix(1:9, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
split.default(A, c(1, 1, 2)) ## it does not split the matrix by columns!
#$`1`
#[1] 1 2 4 5 7 8
#
#$`2`
#[1] 3 6 9
```

Actually recycling rule has been applied to `c(1, 1, 2)`

, and we are equivalently doing:

```
split(c(A), rep_len(c(1,1,2), length(A)))
```

Why doesn’t R core write another line for a “matrix”, like

```
for (k in lf) y[[k]] <- x[, ind[[k]], drop = FALSE]
```

Till now the only way to safely split a matrix by columns is to transpose it, then `split.data.frame`

, then another transpose.

```
lapply(split.data.frame(t(A), c(1, 1, 2)), t)
```

Another workaround via `lapply(split.default(data.frame(A), c(1, 1, 2)), as.matrix)`

is buggy if `A`

is a character matrix.

### How does `.Internal(split(x, f))`

work?

This is really the core of the core! I will take a small example below for explanation:

```
set.seed(0)
f <- sample(factor(letters[1:3]), 10, TRUE)
# [1] c a b b c a c c b b
#Levels: a b c
x <- 0:9
```

Basically there are 3 steps. To enhance readability, Equivalent R code are provided for each step.

**step 1: tabulation (counting occurrence of each factor level)**

```
## a factor has integer mode so `tabulate` works
tab <- tabulate(f, nbins = nlevels(f))
[1] 2 4 4
```

**step 2: storage allocation of the resulting list**

```
result <- vector("list", nlevels(f))
for (i in 1:length(tab)) result[[i]] <- vector(mode(x), tab[i])
names(result) <- levels(f)
```

I would annotate this list as follows, where each line is a list element which is a vector in this example, and each `[ ]`

is a placeholder for an entry of that vector.

```
$a: [ ] [ ]
$b: [ ] [ ] [ ] [ ]
$c: [ ] [ ] [ ] [ ]
```

**step 3: element allocation**

Now it is useful to uncover the internal integer mode for a factor:

```
.f <- as.integer(f)
#[1] 3 1 2 2 3 1 3 3 2 2
```

We need to scan `x`

and `.f`

, filling `x[i]`

into **the right entry** of `result[[.f[i]]]`

, informed by an accumulator buffer vector.

```
ab <- integer(nlevels(f)) ## accumulator buffer
for (i in 1:length(.f)) {
fi <- .f[i]
counter <- ab[fi] + 1L
result[[fi]][counter] <- x[i]
ab[fi] <- counter
}
```

In the following illustration, `^`

is a pointer to elements that are accessed or updated.

```
## i = 1
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [0] [0] [0] ## on entry
^
$a: [ ] [ ]
$b: [ ] [ ] [ ] [ ]
$c: [0] [ ] [ ] [ ]
^
ab: [0] [0] [1] ## on exit
^
```

```
## i = 2
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [0] [0] [1] ## on entry
^
$a: [1] [ ]
^
$b: [ ] [ ] [ ] [ ]
$c: [0] [ ] [ ] [ ]
ab: [1] [0] [1] ## on exit
^
```

```
## i = 3
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [0] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [ ] [ ] [ ]
^
$c: [0] [ ] [ ] [ ]
ab: [1] [1] [1] ## on exit
^
```

```
## i = 4
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [1] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [3] [ ] [ ]
^
$c: [0] [ ] [ ] [ ]
ab: [1] [2] [1] ## on exit
^
```

```
## i = 5
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [2] [1] ## on entry
^
$a: [1] [ ]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [ ] [ ]
^
ab: [1] [2] [2] ## on exit
^
```

```
## i = 6
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [1] [2] [2] ## on entry
^
$a: [1] [5]
^
$b: [2] [3] [ ] [ ]
$c: [0] [4] [ ] [ ]
ab: [2] [2] [2] ## on exit
^
```

```
## i = 7
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [2] ## on entry
^
$a: [1] [5]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [6] [ ]
^
ab: [2] [2] [3] ## on exit
^
```

```
## i = 8
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [3] ## on entry
^
$a: [1] [5]
$b: [2] [3] [ ] [ ]
$c: [0] [4] [6] [7]
^
ab: [2] [2] [4] ## on exit
^
```

```
## i = 9
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [2] [4] ## on entry
^
$a: [1] [5]
$b: [2] [3] [8] [ ]
^
$c: [0] [4] [6] [7]
ab: [2] [3] [4] ## on exit
^
```

```
## i = 10
x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
^
ab: [2] [3] [4] ## on entry
^
$a: [1] [5]
$b: [2] [3] [8] [9]
^
$c: [0] [4] [6] [7]
ab: [2] [4] [4] ## on exit
^
```