# Dynamic programming on trees with minimal memory

Dynamic programming on trees has always been a beloved topic of mine. I think it nicely teaches fundamental principles of DP and recurrence, at least on a theoretical level. But in practice, when one has to implement a concrete program using DP on trees, there are some traps which have to be avoided. For instance, one has to be aware of the fact that usually such a program consumes pretty decent amount of memory. Asymptotically it is linear on the number of vertices in the tree, but the constant could be quite big. In this article we will show how much of this memory is really necessary when programming in C++, the most used language in programming competitions.

We will illustrate it with a classic problem of counting the number of maximum-size matchings in an undirected tree. There is nothing special about this particular problem, apart from the fact that I like it and it has nice educational properties – it is not very complicated, but at the same time no trivial and it nicely exemplifies what we can expect from DP on trees. Therefore most of the things we say here can be applied to other DP on trees (maybe except the very last bits, in which we will use some features of this particular problem). Especially the part dealing with reading a tree from the input to the memory is universal and interesting all by itself.

This article features task *Matchings* that I have prepared for Polish Olympiad in Informatics
Training Camp in 2010 and used once more at Brazilian ICPC Summer School in 2015.
It is significantly expanded version of a lecture I gave at the latter camp.

## Definition of the problem and DP recurrences

We are given an undirected tree which has \(n\) vertices and \(n-1\) edges.
A *matching* is a subset of edges in which every vertex is adjacent
to at most one edge from the subset. The size of a matching is the number of
edges in this subset. We want to calculate the maximum size of a matching
and also the number of matchings of this size.
The size of a matching is limited by \( \frac{n}{2} \), since each edge in a matching “uses” two vertices.
However, the number
of matchings can be exponential (in a star-shaped tree where each ray has three vertices,
there are $\frac{n}{3} 2^{n/3}$ matchings), so we use a standard trick and ask for
this number modulo given integer \(M\).

The DP solution for this problem is not very hard. As usual, we pick an arbitrary vertex $v_\star$ and make it a root of the tree, then we direct every edge to make a directed tree. Finally we traverse this tree from leaves to the root, performing some calculations for every vertex $v$ based on calculations done previously for its children $u_0, \ldots, u_{m-1}$.

### Finding the maximum size

First, let's concentrate on finding the maximum size. We will proceed bottom-up and for every vertex \(v\) we will calculate the maximum size of a matching in a subtree rooted in vertex \(v\). In fact we will calculate two values: \(dt[v]\) – the maximum size of a matching in the subtree which uses vertex \(v\), i.e. there is an edge \(vu_i\) for some \(i\) in the matching, and \(dn[v]\) – the maximum size of a matching in the subtree which does not use vertex \(v\), i.e. there is no such edge.

The recurrence for these values is as follows. When the matching does not use vertex \(v\), all its children \(u_i\) are free to be used (or not) in their respective subtrees which can be considered independently. Thus it is:

\(dn[v] = \sum_{i} \max(dt[u_i], dn[u_i]) \)

For case when the matching uses vertex \(v\), we have to decide which edge \(vu_i\) goes into the matching. Then in the subtree rooted in \(u_i\) we cannot use this vertex, since it is already matched with its parent. Of course we use the edge which results in a matching of the largest size. Therefore

\(dt[v] = \max_{i} (1 + dn[u_i] + \sum_{j \neq i} \max(dt[u_j], dn[u_j])) \)

We assume that sum over empty set is \(0\), and max over empty set is \(-\infty\), thus for a leaf \(v\) we have \(dn[v] = 0\) and \(dt[v] = -\infty\). The above formula for $dt[v]$ is not very efficient, since it calculates sums over vertices many times. It is better to calculate the sum once (and in fact this is exactly what we have done in \(dn[v]\)), and then remove one element from it:

\(dt[v] = dn[v] + \max_{i} (1 + dn[u_i] - \max(dt[u_i], dn[u_i])) \)

Now calculating \(dn[v]\) and \(dt[v]\) takes linear time in number of children of vertex \(v\), therefore calculating these values for all vertices takes \(O(n)\).

We can simplify these formulas a little bit. Assume that for some vertex \(v\) we have \(dn[v] \geq 0\), therefore there is a non-empty matching which does not use \(v\). If we consider a matching of maximum size and a top-most vertex which is matched in it by some edge, we can (since vertices above it are not used) move this edge to the top, so vertex \(v\) will be used by it. Thus we have \(dt[v] \geq dn[v]\). On the other hand if \(dn[v] = 0\), then either \(v\) is a leaf (and then \(dt[v] = -\infty\)) or \(dt[v] \geq 1\). So let's fix the definition of \(dt[v]\) to be \(0\) in case when $v$ is a leaf. Thanks to that change we will always have \(dt[v] \geq dn[v]\), thus \(dt[v] = \max(dt[v], dn[v])\) and the formulas become:

\(dn[v] = \sum_{i} dt[u_i] \)

\(dt[v] = dn[v] + \max_{i} (1 + dn[u_i] - dt[u_i]) \)

And for leaves we have \(dn[v] = dt[v] = 0\).

### Finding the number of maximum-size matchings

Now let's proceed to counting the matchings. Again, for every vertex \(v\) we introduce two variables \(cn[v]\) and \(ct[v]\) – the numbers of maximum-size matchings in subtree \(v\) which (do not) use vertex \(v\). We also use notation $c[v]$ to denote a number of maximum-size matchings in the subtree, regardless of situation of vertex \(v\). Since \(dt[v] \geq dn[v]\), then \(ct[v]\) is always included, but we include \(cn[v]\) only if \(dn[v]\) is equal to \(dt[v]\):

\( c[v] = ct[v] + cn[v] \cdot [ dn[v] = dt[v] ] \)

For \(cn[v]\), that is when calculating number of maximum-size matchings which do not use vertex $v$, we independently consider all subtrees rooted in children of vertex $v$, so we just take a product of numbers of all maximum-size matchings in these subtrees:

\( cn[v] = \prod_{i} c[u_i] \)

For calculating \(ct[v]\) we must iterate over each possible edge \(vu_i\), and see whether the maximum size of a matching using this edge equals to $dt[v]$. In case the answer is positive, we replace $c[u_i]$ with $cn[u_i]$ in already calculated product $cn[v]$ (because we do not want to recalculate the product again), and finally take a sum of such products:

\( ct[v] = \sum_{i} ( cn[v] / c[u_i] \cdot cn[u_i] ) \cdot [ dt[v] = dn[v] + 1 + dn[u_i] - dt[u_i] ] \)

Unfortunately, there is a slight problem with the above formula – we cannot perform division. It would be perfectly fine, if we had worked with numbers in full precision, since $c[v]$ is never equal to $0$ and we can do normal division. But since we are working with numbers modulo \(M\), we have to multiply by inversion, and $c[u_i]$ could simply evaluate to $0$ modulo $M$. (And if \(M\) was not a prime number, even more numbers wouldn't have their inversions.)

Therefore we need to use a slightly different approach, but since it is a little bit technical, we will postpone it to a later part of the article. For now just suppose that it can be done, in linear time and with no additional memory.

If vertex $v_\star$ is the root of the tree, then numbers $dt[v_\star]$ and $c[v_\star]$ are the final answer to the problem. Therefore, we have an algorithm which solves the problem in optimal time \(O(n)\) and memory complexity $O(n)$. To be more precise, dynamic programming calculations use four variables per vertex, making in a total of $4n$ variables.

## The standard approach for implementation

We will make some assumptions on implementation.
We assume that the standard integer type *int* is big enough to hold
numbers \(2n\) and \(M-1\), but not necessarily bigger.
We also assume that all pointers to the memory are of the same size as this int.
The codes below will be written in C++ with the help of one macro:

We assume that all global variables and arrays are initialized to $0$.

The implementation consists of two phases: reading the description of the tree from the standard input, and traversing the tree and performing DP calculations.

### Reading the tree

First we need to know how the tree is given in the input. The most common way for specifying graphs (and trees in particular) is by listing their edges. More precisely, first we number vertices of the tree (e.g. using integers from \(0\) to \(n-1\)), and then we put number \(n\) in the input, followed by \(n-1\) pairs \(a, b\) which mean that vertex number \(a\) is adjacent to vertex number \(b\). Note that we number vertices only for convenience of such representation. We do not care that a particular vertex has some particular number.

The easiest way to input such a tree is to create for each vertex a vector \(adj[v]\) which will store all vertices adjacent to \(v\). The code is as follows:

The whole tree is defined in the input using about \(2n\) ints (to be precise one int to describe the size and $2n-2$ ints to describe edges, but we will ignore additive constants here). Let's calculate how much memory does it need.

We have \(n\) vectors, each consists of three pointers:
start of the allocated buffer (pointer `begin`),
one place after the last element in the buffer (pointer `end`),
and one place after the allocated buffer (pointer `capacity`).
Thus the length of the allocated buffer and the amount of filled the buffer
are represented as differences between pointers `capacity` and `end`
with the pointer `begin`.
As we have assumed in the beginning, the pointers and integers have the same size,
so we need \(3n\) ints just for storing vectors (even when they are empty).

So what happens if we read a tree?
On the first `push_back` to each vector, it will allocate a buffer of size $1$,
and every time if current buffer is completely filled, it will reallocate it to a buffer twice as big.
So sum of these buffers will need at least \(2n\) int to store all values,
but most probably more, because of unused reserved space.
In worst-case it could be as big as \(4n\) ints, e.g. for a star-shaped tree
with one vertex $v$ of degree $n-1$ and $n-2$ being a power of two.
The last `push_back` to $adj[v]$ will result in a vector of size $2n-4$
which during reallocating must coexist in memory with the old vector of size $n-2$,
and of course we need $n-1$ ints for the one-element vectors for all the other vertices.

But that is not everything.
What could be not obvious is that `vector` when allocating a buffer calls
system function `malloc`
which needs some memory to somehow keep track of all these allocated buffers.
How much it depends on the implementation (and other factors such as memory alignment),
but for sure it needs at least one pointer or pointer-size int, so let's assume that.
That gives us another \(n\) ints.

In summary: just to read the tree in the memory using vectors we need \(6n\) ints, possibly more (even up to \(8n\) ints).

### Calculating DP

Having read the tree we want to calculate the variables defined by DP recurrences. Again, the easiest way to do this is to make a DFS search through the tree starting from an arbitrary root, let's say $v_\star = 0$. We should track from which vertex we entered vertex \(v\), so we will know which vertex from \(adj[v]\) is the parent of \(v\); the rest of them are its children. During the search we can calculate all DP arrays:

The above code is quite general,
and below is a function which calculates arrays `dn` and `dt`
for our specific problem
(we won't show yet how to calculate arrays `cn` and `ct`
due to a problem with division mentioned before).

This is the second phase: dynamic programming. Memory used by this phase can
be divided into two parts: memory used by the DP arrays and memory used in order to traverse the tree.
Let's calculate the latter. Since we are using recursion, we need to know how it's
implemented by the compiler. There will be frames on the stack, each contains
*at least*: return address, function arguments, and local variables. Return address is a
pointer and in our case we have two arguments (`v`, `par`) and one local variable
(`it`, which is iterator, thus can be implemented as a simple pointer). That's
four ints in one frame. In the worst case our tree could be just a long path
(so will be of depth \(n\)), thus the stack could contain \(n\) frames, and will need
\(4n\) ints.

Actually it could be worse, since we do not know exactly what is inside the stack frame, until we dive into assembly. It is not uncommon to have there a pointer to the previous stack frame, temporary local variables (used to calculate values of subexpressions in complex expressions). Moreover, on certain architectures stack frames must be aligned to certain boundaries. So we state that we need at least \(4n\) ints, but the reality could be much worse.

To sum up both phases of the algorithm, we need at least $6n$ ints (probably more) for storing the tree, $4n$ ints (probably more) for traversing it, and additional $4n$ ints for storing DP arrays. That is at least \(14n\) ints for solving the task. Can you guess now how much this can be improved?

I just say: a lot. But I won't disclose precisely how much, I think it is more fun to unwind it step by step.

So let's begin.

## Replacing things we cannot control

The saddest part of the above estimates is that we said we need *at least but probably more*.
First, we try to get rid of sources of this uncertainty: vectors with malloc
(which were used for reading and storing the tree) and recursion
(which was used for DP calculations).

### Replacing vectors

One clear redundancy with malloced vectors is that both malloc and vector store
the length of allocated buffers.
Since we cannot change the behavior of `vector` class
(at least not without creating custom allocators which is not what I would like to do here),
perhaps we can somehow manage without it.
Which leads to a question: why do we need vectors in the first place?
The problem is that we need vectors to store adjacency lists for vertices,
but since we are loading edges one by one, we do not know beforehand,
how much vertices will go to each list.
Thus we need some kind of structure which will be *resizable*,
and vectors look like the natural choice here.

But maybe vectors are not the best choice to store adjacency lists?
Maybe *real lists* would fit better?
Since list support `push_back` operation,
then we only have to change declaration to `list<int> adj[N];`.

Memory calculation: each element of the list, apart from the value (int) stores two pointers to the previous and to the next element on the list (list is doubly-linked) – three ints in total. Moreover, each time we insert a new element, new memory for this single element is malloced, which results in at least one more int for malloc bookkeeping. There will be \(2n\) elements in all lists, so the total memory consumption would be at least \(8n\) ints. It's even bigger than the lower bound for vectors.

Of course, we don't need doubly-linked lists. We can use just fine
`forward_list<int>` class, and replace `push_back`
operations with `push_front` (since the actual order of vertices
on lists is not important for us).
That will leave us with at least \(6n\) ints, and part of this memory still goes
to malloc bookkeeping.
It's some improvement over vectors, but not much (only over their upper bound).

Let's try a different approach. We wouldn't need vectors if we know the size of each adjacency list. Actually we could get this information, if we could go twice through the input:

For each vertex \(v\) we first calculate its degree \(deg[v]\), which is exactly the number of vertices in its adjacency list. After that we malloc a correctly-sized buffer \(adj[v]\). Then we rewind the file, and read it once more, and now we store vertices in the arrays. Index \(ptr[v]\) will keep track of where to put the next value in this array.

That gives us \(3n\) ints for arrays `deg`, `ptr` and `adj`
and at least another \(n\) ints for malloc bookkeeping.
And sum of buffer lengths is \(2n\) ints, thus we need \(6n\) ints here,
same as lists. But we can make a clever trick here. Each time we
put a vertex number during the second phase, we can advance pointer `adj`, so we
do not need array `ptr`. If we need original beginnings of buffers,
we can just subtract their lengths. Thus the code for the part after rewinding the input file is:

Then the code to iterate over adjacency list of $v$ is as follows:

This uses only \(5n\) ints for reading the tree. Unfortunately in many contests we load the data from the standard input, so we can only read it once. Thus this approach will not work, but it's good to know it anyhow. But don't throw away above code. If we cannot read list of edges twice, we can just store it using additional \(2n\) ints:

This approach needs \(7n\) ints for reading the tree.
It's certainly better than upper bound for vectors, but approach with lists gave us \(6n\) ints.
But the above code has one big advantage: we do need \(7n\) ints for the reading phase,
but after that phase we only need \(5n\) ints, since arrays `a` and `b`
will no longer be needed and can be *reused* for different purposes.
That is why, from now on, we will specify memory requirements of the first phase using two numbers:
memory needed to store the tree after the phase
plus additional memory needed for actually performing the phase.
In this notation the last approach needs \((5+2)n\) ints.

### Replacing malloc

The above approach still uses malloc.
It is wasteful, since we do not control memory used by malloc,
but moreover we know that it at least uses \(n\) ints for keeping track
of lengths of reserved buffers, and we also redundantly do it in array `deg`.
Instead of malloc we will keep all adjacency lists in one array `Adj`
of length \(2n\):

The idea is that the subsequent lists are stored after each other in `Adj`.
After the second loop value \(deg[v]\) stores the index after the last element of adjacency list of \(v\).
Therefore after the third loop each \(deg[v]\) stores the beginning.
That is why the better name for this array after the load code is `begin`.
Since it will be quite common for us to reuse memory for different purposes,
we will use aliases for clarity:

Note that array `begin` has length $n+1$, since it has to store the end
of the array `Adj` in the last element.
That is how to
iterate over adjacency list of \(v\):

In this approach we need \(5n\) ints for all arrays, but after that
only `deg` and `Adj` are used, so it's just \(3n\) ints.
In total memory consumption for this phase is exactly \((3+2)n\) ints tops,
which is clearly better than the first approach with vectors (even if
we do not care to reuse arrays `a` and `b`).

### Replacing recursion

As we said before, recursion costs us at least \(4n\) ints, but in practice it could be much worse.
We pay all this just for the convenience that recursive programs are slightly easier to write,
but for just a little more programming work we can rewrite the function in iterative way,
and get rid of the most of this overhead.
Let's see what data from the stack frame is essential:
actually it's only vertex number `v` and iterator `it`.
Observe that the second argument `par` is just a copy of `v` from the previous frame,
and since `recursive_dp` function (apart from the call in main) is only called in one place,
thus return addresses in all frames (except the first) will be the same.
Clearly this is a lot of wasted space, which would not be wasted if we write it by hand.

So as a start, let's just translate this function in an iterative way:

**int**qe, q[N]; vector<

**int**>::iterator it[N];

**void**iterative_dp(

**int**root) { qe = 0; q[qe++] = -1; /* Sentinel */ q[qe++] = root;

**REP**(i, n) { it[i] = adj[i].begin(); }

**while**(qe > 1) {

**const int**v = q[qe-1], par = q[qe-2];

**if**(it[v] == adj[v].end()) { /* Update DP arrays for node v. */ update_v(v, par); /* Do something with edge from par to v. */ qe--; }

**else**{

**const int**u = *it[v]++;

**if**(u != par) { q[qe++] = u; } } } }

As we can see, we only use LIFO queue `q` (which replaces stack)
and array `it`, so indeed we have managed to do DP traversal in memory of \(2n\) ints.
Of course, in the previous section we replaced vectors with array `Adj`.
We can therefore rewrite this:

Finally we have a new solution without vectors, malloc and recursion. Since the first phase used \((3+2)n\) ints of memory and in the second phase we use \(2n\) ints for tree traversal, in total we use \(5n\) ints. Adding \(4n\) ints for DP arrays we got \(9n\) ints, which is slightly less than two-thirds of memory usage for the first solution. But this is just a warm-up.

## Compressing the tree

In the previous section we took a standard implementation of a standard approach and got rid of unnecessary memory allocated by parts of the code we didn't have control over (library and compiler code). Now when dust have settled after we squeezed the standard implementation, let's take a closer look at the standard approach we are using. Maybe here we also have some room for improvement?

We specifically focus on a format in which we store the tree in memory in order to make a DP traversal over it.

### Remove pointers to parents

For every vertex \(v\) we store its adjacency list, that is a list of all
vertices incident to it.
But the DP calculations are done over a rooted tree in which one of this
incident vertex becomes a parent of vertex \(v\).
Luckily, we enter \(v\) from this parent, so we can skip it while iterating
over adjacency list of \(v\).
But since the existence of these parents in adjacency lists is only a nuisance, why
don't we remove all of them before calculating DP?
Since there are \(n-1\) parent elements in all lists, this removal would
shorten array `Adj` by half!

This can be accomplished by another DFS traversal over tree (performed before DP calculations) during which for each non-root vertex \(v\) we copy last element of its adjacency list in the position of parent element on this list:

Next we can safely remove the last elements from all non-root adjacency lists. That means that we must move elements on adjacency list of vertex \(v\) by $v-1$ or $v$ positions to the left, depending whether number $v$ is greater than $v_\star$ or not:

After such transformation we no longer have to check for parents during DP tree traversal.
But what's even more important, the second half of array `Adj` is no longer
needed and its $n$ ints can be reused.
To be able to reuse only part of some array, we can make one big array `array`
holding all the memory we need to reuse,
and make aliases inside this array, e.g. like this:

Just remember to initialize the array with zeros, if it comes from reused space.

Thanks to that the first phase uses memory of \((2+3)n\) ints. Adding $2n+4n$ ints for the second phase, we get $8n$ ints for the whole program.

### Changing order of DP computations

Observe that we can perform DP calculations in any order, as long as calculations
for every child of given vertex \(v\) is performed before calculations for vertex \(v\).
In functions `recurrence_dp` and `iterative_dp` we have done it
using DFS traversal, since we at the same time needed to know the parents of vertices.
But now, after removing parents, we can also traverse vertices in BFS manner if we please:

This code fills the array \(q\) of size $n$ with BFS order of visiting vertices in the tree.
Note that in the loop we no longer have access to the parent of \(v\), but we don't need it.
One important property of this order is that vertex \(v\) is in array `q`
before each of its children. Therefore we can make DP traversal on the tree using
only arrays `Adj`, `begin` and `q`. We iterate over vertices
from `q` in reversed order:

Therefore in order to traverse the tree in the second phase we can use BFS order in array `q`,
instead of stack in array `q` and array `ptr`.
Thanks to that we only need $n+4n$ ints for the second phase,
thus $7n$ ints for the whole solution (half of memory usage of the standard approach).

### Remove vertex numbers

All of this work we have done during reading the tree was about finding the correct
order of vertices.
We hinted before that the actual number of vertices do not matter – the only thing
that matters is the shape of the tree.
So wouldn't it be nice if the vertices were numbered in the input in some sensible order,
for example in the BFS order we calculated in the previous section?
But since they aren't, maybe we can *renumber* them?

The idea is as follows. The array `q` contains numbers of vertices
in the BFS ordering. Therefore the \(i\)-th node (zero-based) in this ordering
has number \(q[i]\). We would like to change number of this node to simply \(i\).
We store node numbers in array `Adj` and they are also used as indices to array `begin`.

First, let's look how would array `begin` look like after doing such
renumbering.
We can do it by restoring degrees of all vertices in BFS order
(and storing them in array `newdeg`),
and once more transforming it into array `begin`:

It looks like we would have more problems with array `Adj`, since
the vertices' numbers in it should be renumbered according to `q`,
but also we should move blocks in this array according to changes in `begin`.
But it would turn out, that new array `Adj` became trivial.
Look closely how array `q` was constructed in the first place.
We iterated through subsequent vertices in BFS order, so subsequent cells of `Adj`
(after renumbering) must be equal to subsequent cells of `q`
(except the first element of `q` which corresponds to the root of the tree).
Therefore $Adj[i] = i+1$ for all $i$.

Since we don't need `q` too (because it's also trivial after renumbering),
we only need temporary array of $n$ ints to store `newdeg`.
Therefore new code for DP computation doesn't use `q` and `Adj`:

Wow. We have just compressed the whole information about the tree in just one array `begin`
of size \(n\). Moreover having just this array we can perform DP traversal over the tree.
That's something. Therefore the first phase costs us \((1+4)n\) ints.
And we don't need any additional memory for traversing the tree in the second phase,
so it's just \(4n\) ints of DP arrays.
In total our solution uses $5n$ ints.

### Simpler method of compressing the tree

Actually, when we know that we can represent the whole tree using only one array `begin`,
we can calculate this array even simpler.
The process is as follows: we renumber the vertices in the tree in the BFS order,
and then we create array `deg` which contains degrees of vertices in this BFS order.
From array `deg` we retrieve new `begin` using standard method.

We start from our representation before doing any tree compression, thus
from array `Adj` of size $2n$ (containing parents) and array with indices, which
we call `oldbegin` for clarity.
We will use
an array `q` for queue (which will be reused for storing the degree of each
vertex popped from the queue) and array `vis` for storing visited vertices
(so we don't visit parents during BFS traversal):

At the end we only need array `begin`.
Thus the cost of the first phase is \((1+4)n\) ints.

So with the compression of the tree the whole problem can be solved in just \(5n\) ints, a little bit over one-third of original \(14n\). We won't gain much improving the first phase, as long as DP arrays will need \(4n\) ints, so let's switch back to DP calculations performed by our solution, to see if we can improve that.

## Improving DP calculations

First we keep the promise of showing how to calculate value of $ct[v]$ without division. Then we try to reduce the memory needed by DP calculations.

### Finishing finding the number of maximum-size matchings

Remember, that in formula for $ct[v]$ we used \(cn[v] / c[u_i]\) to calculate a product of values $c[\cdot]$ for all children except $u_i$:

\(c[u_0] \cdot c[u_1] \cdot \ldots \cdot c[u_{i-1}] \cdot c[u_{i+1}] \cdot c[u_{i+2}] \cdot \ldots \cdot c[u_{m-1}] \)

Such approach worked fine with sum (because we could use subtraction), but with multiplication without ability to do division it does not work. However, we can solve it by calculating prefixes and suffixes as follows:

\(cnl[i] = c[u_0] \cdot c[u_1] \cdot \ldots c[u_{i-1}] \)

\(cnr[i] = c[u_{i+1}] \cdot c[u_{i+2}] \cdot \ldots c[u_{m-1}] \)

And then use the equality \(cn[v] / cn[u_i] = cnl[i] \cdot cnr[i]\).
The arrays `cnl` and `cnr` are calculated as follows:

\(cnl[0] = 1;\quad cnl[i] = cnl[i-1] \cdot c[u_{i-1}]\)

\(cnr[m-1] = 1;\quad cnr[i] = c[u_{i+1}] \cdot cnr[i+1]\)

So we are ready to present full C++ code of DP calculations.
The first loop calculates arrays `dn` and `dt` in the same manner
as before (we only use array `begin` rather than vectors).
The second loop calculates helper arrays `cnl` and `cnr`.
Finally, the third loop calculates arrays `cn` and `ct`.
For clarity, we don't include taking modulo $M$ in the code.

Unfortunately, we need additional memory to store arrays `cnl` and `cnr`.
These are arrays of size \(m\), since they are only used for calculations for one vertex,
but maximal \(m\) can be as large as \(n-1\).
Therefore we need additional \(2n\) ints, which is not good.

So let's take a closer look at the formula for \(ct[v]\). It has a form:

\( \sum_{0 \leq i < m} a_{0} \cdot a_{1} \cdot \ldots \cdot a_{i-1} \cdot b_{i} \cdot a_{i+1} \cdot a_{i+2} \cdot \ldots \cdot a_{m-1} \)

where \(a_i = c[u_i]\) and \(b_i = cn[u_i] \cdot [ dt[v] = dn[v] + 1 + dn[u_i] - dt[u_i] ] \). Let's write $m$ products of this sum in $m$ subsequent rows, forming a $m \times m$ matrix. The following image shows it for $m=5$:

We can calculate this sum by evaluating sums in subsequent submatrices of size $i \times i$ as shown on the image. The value $S_i$ will denote sum of submatrix $i \times i$ and helper value $A_i$ will denote partial product $a_0 \cdot a_1 \cdot \ldots \cdot a_i$. We start with \(S_0 = 0\) and \(A_0 = 1\), and use the formulas:

\( S_i = S_{i-1} \cdot a_i + A_{i-1} \cdot b_i \)

\( A_i = A_{i-1} \cdot a_i \)

The answer is \(cn[v] = S_m\). We can also observe that \(ct[v] = A_m\). The code following this method is much simpler than the previous one, and what is most important, it does use only constant additional memory:

### Compressing memory usage

Now let's take a closer look at computations for finding maximum size. We showed that \(dt[v] \geq dn[v]\), that is maximum-size matching which uses vertex \(v\) is no smaller that maximum-size matching which does not use this vertex. But how much bigger it can be? Let's take a matching which uses vertex \(v\) and remove the edge matching \(v\). Then we get matching one edge smaller which does not use \(v\), therefore \(dn[v] \geq dt[v] - 1\). Thus these two numbers can differ only by $1$, so let's make a notation:

\( dt[v] = dn[v] + z[v] \quad \) where \(\quad z[v] \in \{ 0, 1 \}\)

This way we no longer have to use \(dt[v]\), we just use \(dn[v]\) and \(z[v]\). Let's rewrite recursions using this new notation. First of all we can simplify Boolean expressions used in the formulas:

\([dn[v] = dt[v]] = [z[v] = 0] = 1 - z[v]\)

\([dt[v] = dn[v] + 1 + dn[u_i] - dt[u_i]] = [z[v] = 1-z[u_i]] = z[v] \ \textrm{xor}\ z[u_i]\)

Then we can rewrite all formulas:

\( dn[v] = \sum_i (dn[u_i] + z[u_i]) \)

\( z[v] = \max_i (1 - z[u_i]) \)

\( c[v] = ct[v] + cn[v] \cdot (1 - z[v]) \)

\( cn[v] = \prod_i c[u_i] \)

\( ct[v] = \sum_i (cn[v] / c[u_i] \cdot cn[u_i]) \cdot (z[v] \ \textrm{xor}\ z[u_i]) \)

We have two observations here. First of all, we replaced array `dt`
with array `z`. It looks like we didn't gain much here, but note that
array `dt` was int array (it used \(n\) ints), and `z` is bit array
(it needs only \(n\) *bits*).
Bit array uses less memory, since it can be packed.
If int has \(B\) bits, it uses only \( \lceil n/B \rceil \) ints.

But actually we can pack it so it won't use additional memory at all. Recall that we assumed that int variable can store number \(2n\). And variable \(dn[v]\) stores numbers no larger than \(\frac{n}2\). Therefore we can use unused most significant bit in \(dn[v]\) to store bit \(z[v]\). Thus both arrays to calculate size of maximum-size matching will use only \(n\) ints of memory. This reduces memory consumption of DP arrays to \(3n\) ints.

What about arrays `cn` and `ct` for calculating numbers of
maximum-size matchings? Of course, we need them both.
But interesting thing is that for calculating them we don't need array `dn` at all,
only bit array `z`.
Thus memory used by `dn` after we computed first step can be reused
(for instance for storing one of `cn`, `ct`).
But then again, we need to store array `z` somewhere else.
We cannot pack it with `cn` or `ct`, because these arrays stores numbers
modulo \(M\), which possibly uses all of bits in their cells.
But don't forget we have one more array we are using: array `begin` in which
we store the tree.
It stores numbers up to \(n-1\), then each cell of this array has an unused bit.
Therefore we can pack array `z` there.

To recap: first we calculate array `dn` and bit array `z`, the latter
is packed together with array `begin`.
Next, we calculate `cn` and `ct`, using bit array `z` stored
in `begin` and reusing memory from no longer used array `dn`.
Therefore the total memory consumption from DP arrays decreased to only \(2n\) ints.

The following C++ code uses helper macros to quickly get and set bits in array `begin`.

The cost of the first phase is $(1+4)n$ ints, but we have managed to squeeze the memory used by the second phase to only $2n$ ints. Once again, the memory bottleneck of the algorithm is the first phase, so let's see if we can reduce it, to reduce the total consumption of memory.

## Better tree reading

So we know that we only need \(n\) ints of array to store the tree, and be able to traverse it in order to perform DP computations. In this light, additional \(4n\) ints needed to create the array is way too much. Let's take a completely different approach to decrease it.

### Parent representation

The problem is that the representation of a tree as an edge list is not very convenient. It hides the structure of the tree, and if we want to do any traversal over the tree, we must convert it to some other representation. And since it is usually the first thing we do after reading such a tree, in some programming problems that deal with trees, other, more convenient representations are used in the input.

One of such representation is a list of parents. In fact, it comes in two flavors, which we call here strict and relaxed. For the strict variant, we number each vertex with a number from \(0\) to \(n-1\). Next, we list \(n-1\) edges, but the \(i\)-th edge (one-based) must connect vertex \(i\) to some other vertex $p[i]$ of smaller number ($p[i] < i$). That means that by considering edges one by one, we add new vertices to one connected component. Of course, the obvious advantage of this representation is that it only uses \(n\) ints to store the tree.

But that representation has another important property: for every vertex $v$ except $0$ there is some edge incident to the vertex which leads to vertex $p[v]$ of a smaller number. Therefore, starting from any vertex $v$ and using such edges, we eventually end up in vertex $0$. That shows that if we root the tree in vertex $0$, then for each non-root vertex \(v\) we have that vertex \(p[v]\) is the parent of \(v\). And also that the number of every child is greater that the number of its parent.

For the relaxed variant, we also number vertices from $0$ to $n-1$, but now we input $n$ numbers $p[i]$ for $0\leq i < n$. One number $i$ is special and we denote it by $p[i]=-1$, and for all other numbers we assume that we have an edge between vertex $i$ and $p[i]$.

Let's denote by $v_\star$ the special vertex for which $p[v_\star] = -1$. If we do the same trick as before: from any vertex $v$ we start moving using edges directed from $i$ to $p[i]$, we finally must arrive at vertex $v_\star$ (since we cannot have any cycle in a tree). That shows that for the relaxed variant, if we root the tree in vertex $v_\star$, then for each non-root vertex $v$ we have that vertex $p[v]$ is the parent of $v$.

So let's imagine that we got parent representation (strict or relaxed) in array `p`.
How can we calculate array `begin` from it?
First is easy to calculate indegree of each vertex (in array `deg`).
Then we just fill array `Adj` in standard way:

The above code, based on array `p`,
calculates arrays `Adj` and `begin` (equal to `deg`).
These two arrays store the full information about the tree in memory using \(2n\) ints.
So, if we would have array `p`, we could perform the first phase
in \((2+1)n\) ints, and the whole problem in just \(4n\) ints.
(As a side note: performing DP on arrays obtained from strict variant of parent representation
is even easier, since the vertices are already in good order.)

We could also use technique of removing vertex numbers to calculate array `begin`
which fully represent the tree.
We no longer need array `vis`, since `Adj` does not store parents.
So the cost would be \((1+2)n\) ints, which results in only \(3n\) ints for
the whole problem!

### Obtaining parent representation using xoring

And here it's where magic begins. It turns out that we do not need much memory to calculate parent representation of a tree from its edge list representation, but the idea is quite ingenious. The code is as follows:

First we just iterate over all edges without storing them and for each vertex \(v\) we calculate its degree \(deg[v]\) and also \(hash[v]\) which stands for bitwise xor of numbers of all vertices adjacent to $v$.

Next we will build the tree from the leaves up to some root. Observe that for any node \(v\) of degree \(1\), we know that it has exactly one adjacent node, so we know that the number of this adjacent node is \(p = hash[v]\). We therefore assume that the parent of vertex \(v\) is \(p\), and then we remove vertex \(v\) from the tree. The removal boils down to decreasing degree of vertex \(p\) and un-xoring \(v\) from \(hash[p]\) which results in a smaller tree which we attack recursively.

Thus we maintain a queue `q` of leaves of the current tree.
At first we put there all vertices with degree $1$.
Then we iteratively remove vertex $v$ from the queue, assign its parent $p$,
and update $deg[p]$ and $hash[p]$.
Anytime degree of such $p$ decreases to \(1\), we can push it on the queue.

After removing $n-1$ vertices, we end up with the last vertex on the queue, which becomes the root of the tree, and any other vertex has correctly calculated parent \(p[v] = hash[v]\).

We used \(3n\) ints for arrays `deg`, `hash` and `q`,
but only array `p` (equal to `hash`) is needed at the end.
Using the code before, we calculate `begin`,
so it results in the first phase of cost \((1+2)n\) ints.
Therefore the whole problem can be solved in \(3n\) ints;
slightly more than one-fifth of original.

### First-child-next-sibling representation

It turns out that we can easily improve the code to calculate array `p`.
Note that use queue `q` to store vertices of degree \(1\), so we know
which of them to process later.
But is it really necessary?
Instead of putting them into queue in the first loop, we could just process
them right away.
Moreover, if during processing some vertex, its parent became a vertex of
degree \(1\) we can also process it immediately (which could result in another
such parent).
Following code does not use queue at all:

Every removed $v$ vertex is marked by putting $deg[v] = -1$. Last considered vertex will become the parent.

By removing array `q` we reduced the memory needed
to calculate array `p` to only \((1+1)n\) ints.
So maybe we could also reduce memory needed for calculating array `begin` from array `p`?

To do this we introduce one more representation of a tree.
It is called *first-child-next-sibling* and for each
vertex $v$ we store two numbers: number of its first child $chi[v]$ (equal to \(-1\)
if vertex has no children) and number of its next sibling $sib[v]$
(number of a child after this vertex in its parent child list;
or \(-1\) if its last vertex on this list).
It's fairly easy to calculate this representation from array of parents:

Each time we consider new vertex \(v\), we assign it as the first child
of its parent $p[v]$, and if any vertex was this child before, it will be
assigned as next sibling of \(v\).
Of course, we want to store all computations in \(2n\) ints,
but this could be done by observing that we can reuse memory between arrays
`p` and `sib`, since after setting $sib[i]$ we no longer need access to $p[i]$.
Thus the first beginning of the above code could look like this:

The representation using `chi` and `sib`
can be easily used to perform BFS traversal.
We start by putting root in the queue `q`, and
every time we visit vertex $v$, we can iterate over its children
by starting from $w = chi[v]$ and then performing $w = sib[w]$ until we hit $-1$.
Therefore for each visited vertex we can also calculate
its outdegree.
The trick is to store this number in memory shared by queue `q`.
Finally, because these vertices are in BFS order, we get for free removal of their numbers:

The last loop performs calculation of prefix sums in place, starting from $1$
(thus we first need to move array one position to the right).
At the end we use only array `q` as array `begin`.
But we used \(3n\) ints in total (for arrays `chi`, `sib` and `q`),
so it does not look as an improvement.

### Even more compression with first-child-next-sibling representation

The trick is to squeeze queue `q` into arrays `chi` and `sib`.
Since children of each vertex are put sequentially into the queue,
array `sib` already contains these parts of the queue.
We only have to concatenate them together, using last vertices
in each children list (they have `sib` equal to \(-1\)):

During the whole loop, the variable $v$ will iterate
over all vertices of the tree, in the BFS order. Since it uses array `sib`
for this iteration, for every vertex $v$ which is the last child
on its parent's list we must set $sib[v]$ to the next vertex in this order.
Therefore we will maintain variable $ve$ which
stores the number of vertex which is directly before $chi[v]$ in the BFS order
(or if $v$ has no children, directly before position in which its first children
would appear if it has one).

For subsequent vertices $v$ we proceed as follows:
if $v$ has children, we put its first child as the next vertex after $ve$.
Next, we iterate over all children of $v$, counting them, and updating
$ve$ as the number of the last child.
Finally, we store outdegree of $v$ in array `ndeg`, shared with `chi`,
since we won't use $chi[v]$ anymore.

At the end we have array `sib` which serves as the queue
and array `ndeg` (shared which `chi`) which stores
degrees of nodes (but in original order).
The only thing left is to sort array `ndeg` according to
the order stored in array `sib`:

Note that the order in array `sib` is given implicitly,
as a linked list: the first element is `root`, then `sib[root]`, next
`sib[sib[root]]` etc.
The first loop iterates over this list and for every element $v$
stores in $sib[v]$ the value $i$ which denotes that element $v$
is the $i$-th element in the list.

The second loop reorders array `ndeg` using order stored in array `sib`.
We iterate over all elements $i$ and for each element which is out of place
(i.e. $sib[i] \neq i$) we move this element to its correct position (that is $sib[i]$)
and at the same time we move the element which previously was in this position to position $i$.
Thus in each iteration one more element is located correctly.

Finally, the last loop performs calculation of prefix sums, starting from $1$, but it does it a little differently than the implementation in the previous section, using only one loop instead of two.

The above clever code perform the first phase using only \((1+1)n\) ints of memory! This is really an achievement, because it allows us to read a tree in list edge representation to a form suitable to perform DP traversal in \((1+1)n\) ints. So for a problem in which we have one DP array of size \(n\) it would result in an algorithm using only \(2n\) ints in total. In general for a problem with \(k\) DP arrays we get \((1+\max(1,k))n\) ints. However, in our problem DP arrays are of size \(2n\), so we get \(3n\) algorithm, and we have that before.

## Even more improvements in DP calculation

It looks like we really squeezed the memory in this problem.
We store the whole tree in one array of \(n\) ints.
The other two arrays are used for DP calculations.
In fact for calculating the size of matching we use only one additional
array `dn`.
However, for calculating the number of matchings we need two arrays:
both values `cn` and `ct` are numbers resulted
from rather complicated computations modulo \(M\), so it is very unlikely
that they have some general properties to be discovered and used
in such a way that would result in removing one of them.
So it looks like we hit the limit. Or is it?

### Branching and non-branching vertices

But let's don't give up yet.
We already did some *magic* with xoring, so why not try to do something *impossible*?

If we cannot remove one of DP arrays, maybe we can make them smaller? Do we really have to store the data for all vertices? For instance, we really don't have to do it for leaves. DP values for them are constants which do not depend on other vertices, so we can easily recreate them when needed. When parent vertex is calculating DP it can check whether its child is a leaf, and then do not look into DP table, but just use appropriate constants.

That is an idea that could save us some memory on trees with many leaves. Unfortunately, not all trees are like that: we could have a tree which is a long path that has only one leaf, and then this optimization would give us nothing. But note that this tree is also special and easy to calculate DP on, because it consists of this long path of vertices of degree $1$. The DP values for the top-most vertex of this path could also be calculated in constant memory, as long as we know the length of the path, since DP in each vertex depends only on DP value in one vertex below (we can think of it as a variant of tail optimization). This is true in general case: in every tree we can treat each path of vertices of degree $1$ specially, and we do not need to store DP values for internal vertices of these paths.

So we only need to store DP values for *branching vertices*, namely
those which have at least two children.
And in any tree there are at most \(\frac{n}{2}\) branching vertices!
It follows from two simple facts: sum of branching vertices and leaves is bounded
by \(n\), and there is more leaves than branching vertices.

Therefore we only need to store \(\frac{n}{2}\) values in every DP array. There is one problem though: branching vertices do not necessarily form a contiguous part of arrays, so we need some renumbering mechanism if we want to save memory. To achieve this, we introduce an array $pos$, where in \(pos[v]\) we store the position of branching vertex \(v\) in the compressed DP array, i.e. \(pos[v]\) is the number of branching vertices of smaller numbers than \(v\):

Since the values stored so far in arrays `cn` and `ct`
will have to be calculated on-the-fly for non-branching vertices,
it will be handy to define a structure containing all DP values for
a single vertex:

The function which returns the constant values for a leaf is very simple:

Recall that DP calculation for a single vertex $v$ which children $u_0, \ldots, u_{m-1}$
involve initiating values `cn` and `ct`,
and iterating over subsequent vertices $u_i$.
After the $i$-th iteration the current values `cn` and `ct` contain DP values
for a vertex $v$, assuming it has only children $u_0, \ldots, u_i$.
Therefore we could write a function `join(a, b, zv, zu)`
in which argument `a` denotes DP values of a tree consisting of vertex $v$
and children $u_0, \ldots, u_{i-1}$, argument `b` denotes DP values for child $u_i$
and the function returns DP values for a tree consisting of vertex $v$
and children $u_0, \ldots, u_i$.
Since DP values depend on values `z` stored in vertices $v$ and $u_i$,
we will also provide them as arguments:

We are ready to create a helper function which allows to retrieve DP values for vertices we already visited. In other words, if $v$ is a branching vertex, we must query an appropriate cell in the compressed DP array, and otherwise, we must recalculate DP values on-the-fly. First of all, if $v$ is a vertex of degree $1$, the function calculates the length of the path to the first vertex $w$ of other degree. Then, if $w$ is a leaf it uses constants, and if $w$ is a branching vertex it queries DP arrays. Finally, it makes DP calculation upwards the path of vertices of degree $1$:

The only non-clear part of this function is how we handle retrieval of values `z`
when calculating DP on the path of vertices of degree $1$.
The problem is that we store values `z` packed in array `begin`.
When in the first loop we go downwards the path, calculating its length, we obtain numbers
of subsequent vertices $v$ on this path.
We could obtain values `z` for these vertices using `Z(v)`.
However, in the last loop we go upwards the path, so we would need the sequence of these numbers,
but it reverse order.
The problem is that we cannot afford to store this sequence, so we do not know numbers of vertices
when we go upwards the path.

Fortunately we can calculate values of `z` in a path of degree $1$ directly from the
definition (recursive formula).
In fact, when vertex $v$ has exactly one child $u$, then `Z(v) = 1 - Z(u)`.

Finally, we can present the main code to calculating values `cn` and `ct` for all vertices:

Since the on-the-fly calculations in function `calc(i)` depend on the length of the
path starting in vertex $i$, so the time complexity is a little bit suspicious.
Fortunately, we call this function only on vertices $i$ which have branching parent,
all such paths traversed by the algorithm are disjoint, and the time complexity of the algorithm is still $O(n)$.

The above algorithm uses additional array `pos` of size $n$ ints,
but thanks to that the DP arrays are half as big as before.
Therefore for $k$ DP arrays it uses memory of \((2+k/2)n\) ints.
This is definitely improvement for big values of \(k\), but in our
problem it still results in algorithm which uses \(3n\) ints.
We really need something better.

### Destructively traversing the tree

One could argue that we do not have to store array `pos`,
since each of its elements can be computed in constant memory
from array `begin`.
Unfortunately, this needs \(O(n)\) time, and since we query
array `pos` in arbitrary order (when iterating other subsequent
children which have paths underneath, we in fact query vertices
on different heights) this would lead to quadratic time,
which is unacceptable.

So if we really need to store `pos`, but we don't want to spare additional
memory, we must once again reuse some already allocated memory.
But apart from array `dp` storing DP values, we only have array
`begin` which is essential for tree traversal.
Or is it?

Observe, that we iterate over vertices in reversed BFS order,
and after processing of branching vertex $v$ we will not longer
use values of $begin[u]$ for most vertices $u$ is the subtree of $v$.
To be more precise: we *will* use $begin[v]$ and $begin[v+1]$ when later
querying the number of children of $v$.
Also if $u_0, u_1, \ldots, u_{m-1}$ are children of $v$, then
we could use $begin[u_0]$ when querying the number of children of $u_0-1$
(especially if $v=u_0-1$).
But we won't be querying for number of children of $u_0$ and $u_1$ (or iterating over them), thus
we won't be needing value $begin[u_1]$ of the second child of $v$
(note that $v$ has the second child, since it is branching).
We also won't be using value `z` stored in this cell.
Thus we *can* reuse this field to store $pos[v]$.

That requires only slight changes to the previous algorithm.
First, we do not need `pos` array anymore (and its initialization code).
Next, the line updating value in `dp` array should be replaced with:

And querying this array in function `calc` should be replaced with:

This is quite unbelievable, but this algorithm runs one DP traversal for $k$ DP arrays in memory of only \((1 + \max(1, k/2))n\) ints. Unfortunately, as a nasty side-effect it destroys the traversed tree. Therefore it is not practical, if we want to make more than one pass over the tree.

In our problem we need two passes, but we already discussed that the first pass uses only one DP array, thus it can be done using non-destructive methods in memory \((1 + 1)n\) ints. In the second pass we use the above trick, destructively calculating number of matchings in the same memory.

And this is our final algorithm. It uses memory of only $2n$ ints to input the tree and perform non-trivial DP calculations on it. Compared with at least $14n$ ints needed by the standard approach, it's an improvement of one-seventh of the original. I told you that it will be a lot.

## Conclusions

I learned a lot when writing this article, and I hope that the readers will also find it useful. Because not only the problem of dynamic programming on trees using as small memory as possible gave us an excuse to talk about important implementation details of memory management (in context of vectors, malloc and recurrence), various representations of trees in memory and ways to compress DP arrays. It was also a perfect quest for ingenious ideas in computing, once called by Michael Abrash as these astonishing “right-brain optimizations”.