It doesn’t add up

Draft of 2017.05.24

May include: combinatorics ↘ puzzle ↗ math ↖ recreation ↙ &c.

A few disparate-feeling things conspire to make me write this today.

This last weekend (18–20 May, 2017) was the fifteenth annual Genetic Programming Theory and Practice workshop in Ann Arbor. I spent a lot of time admonishing my colleagues in the field (which is a fancy-ass way of saying “wishing I could make time for myself”) to extend the reach of genetic programming away from the traditional numerical symbolic regression problems that have been so handily claimed by Deep Learning models, and towards software synthesis and algorithmic discovery. GP is good at finding algorithms and rules, and it is (in some sense) better at explaining them than neural networks tend to be. After all, if the code one evolves is human-readable, then it’s a bit more likely to be human-understandable.

I’ve been gearing up for my promised series of Computer Etudes (or “puzzles” or “recreations” or something, but to be honest I think “Etudes” is best… unless we end up at “Experiments”). These are similar to the sketches I’ve been posting for a few years now, but organized and stuff. The inspirations are people like Martin Gardner and Ian Stewart and A. K. Dewdney and so forth: open questions with some leading examples and “isn’t that weird?” leading questions along the way, on topics in “complexology” or emergent systems or machine learning. Stuff slightly more towards the “isn’t that weird” part of the computational/mathematical spectrum, and not so much “practice on your professional skills” stuff you see in many “kata” and “dojo” practices for software developers.

Puzzles with no answers. Puzzling things, more like.

Then there’s the fact that I hit the interlibrary loan a bit heavy some nights back. I was reminded of a book I had checked out several months ago, and which I remembered with interest but hadn’t had enough time to read in detail: Old and New Unsolved Problems in Plane Geometry and Number Theory. In this last round I ordered copies of a few other books like those: Unsolved Problems in Geometry and Erdős on graphs: his legacy of unsolved problems and Unsolved Problems in Number Theory and so forth. Quite a few, as it happens, since there are a lot of arXiv preprints I run across in combinatorics and number theory and enumeration and graph theory that seem to revolve around these sometimes-little, sometimes-hugely-important “open problems” and their penumbra of variations.

And I was just poking around with my own GP system the other day, and wondering what sorts of problems were most amenable to evolutionary search and discovery. So this is in large part a self-serving “puzzle”. I find in posing GP problems and benchmarks and “puzzles” that it’s a good rule of thumb that a reasonably astute programmer can make some progress on them, and can maybe even write a brute-force thing that will eventually get the “answer”… but in a frustrating way.

So let’s talk about subset sums.

So here’s the thing

Suppose you have a set of \(N\) distinct integers. These numbers don’t need to be positive, and the set can include \(0\) if you like. But they should (for this part¹) be a proper set, with no duplicates.

Consider all nonempty subsets of those numbers. That is, the subsets that contain one number each, and all the subsets that contain two different numbers (from that set), and so on up to the \(N\) subsets that are missing one number each, and of course the whole set itself. There are \((2^{N}-1)\) of those, in case you’re wondering². Lots of subsets.

Now for each subset, add up all the elements. This produces a single integer sum for each subset.

Finally, count the frequencies of each sum of elements in each subset.

For example, if the numbers are \(\{-1, 1, 3, 4\}\), then we have these subsets and sums:

\(\{-1\}\), with sum \(-1\)
\(\{1\}\), with sum \(1\)
\(\{3\}\), with sum \(3\)
\(\{4\}\), with sum \(4\)
\(\{-1,1\}\), with sum \(0\)
\(\{-1,3\}\), with sum \(2\)
\(\{-1,4\}\), with sum \(3\)
\(\{1,3\}\), with sum \(4\)
\(\{1,4\}\), with sum \(5\)
\(\{3,4\}\), with sum \(7\)
\(\{-1,1,3\}\), with sum \(3\)
\(\{-1,1,4\}\), with sum \(4\)
\(\{-1,3,4\}\), with sum \(6\)
\(\{1,3,4\}\), with sum \(8\)
\(\{-1,1,3,4\}\), with sum \(7\)

And when I count the frequencies of those sums, I get something like this:

The most common subset sums are \(3\) and \(4\). Counting over all possible non-empty subsets, three of those fifteen subsets add up to each one of those two values, and fewer subsets add to the other observed values. \(3\) and \(4\) are “tied winners” when it comes to how often they are the sums of subset values for this set.

So what’s the problem?

A lot of the Number Theory and Combinatorics work I’ve encountered so far in this space is about the sums of pairs of numbers. Erdős for example posed several problems in which one is curious about whether any of the sums of subsets (often of a particular size) “hits” a given target value. There’s some kind of twisted relation tucked in there with the Goldbach conjecture, too, since that also has to do with sums and stuff. There is the 3SUM problem, in which one is asked whether a set of integers has any triplet that sums to zero exactly, and the more general Subset Problem, where we want to determine whether there’s any subset that sum to zero.

But while they may be interesting and informative, those aren’t the problem here.

Here, I’m asking about the frequency of sums of subsets. Specifically: can we tell, just by “looking at” a set of \(N\) integers, whether there will be one clear winner or not, among its sums?

By “clear winner” I mean: some single sum which appears with higher frequency than all the other sums.

In the example I gave above, \(\{-1, 1, 3, 4\}\), the sums \(3\) and \(4\) are tied for the “winning frequency”. That’s not the case for all sets of integers, it turns out. The set \(\{0, 1, 2, 3\}\) has a clear winner among the subset sums: the sum \(3\) occurs four times, while \(1\), \(2\), \(4\), \(5\), and \(6\) occur only twice each (and \(0\) only once). So \(\{0,1,2,3\}\) has a clear winner subset sum (\(3\)), but \(\{-1,1,3,4\}\) does not.

Some observations

The first thing I did when I started playing with these things was write a program. It was a particularly bad Ruby program. What you might call a “design sketch”, really. But it let me poke around and explore some sets of numbers and how their subsets sum up.

For example, to explore one aspect of the question I’m asking, you might look at sets of integers of the form \(\{0,1,2,\dots,N\}\). Among those, I find that the trivial set \(\{0\}\) has a clear winning sum, and \(\{0,1\}\) (with the sum \(1\) winning), and also \(\{0,1,2,3\}\), and \(\{0,1,2,\dots,n\}\) when \(n=8, 11, 12, 15, 16, 19, 20, 23\). But not the sets when \(N=\{2,4,5,6,7,9,10,13,14,17,18,21,22\}\). Above \(N=23\) I got bored, and my laptop was over-taxed because I didn’t bother to write an efficient (or lazy) algorithm.

Similarly, for sets like \(\{1,2,\dots,N\}\) (without the \(0\)), there are clear winning sums for the sets of size \(3\), \(8\), \(11\), \(12\), \(15\), \(16\), \(19\), \(20\) and \(23\). Which is where I got burned (literally) by the brute-force algorithm I sketched, since my laptop CPU got really hot. Same numbers as one sees with the \(0\)-based sets of consecutive integers, which in hindsight may not be surprising.

This raises the question: Is it the case that adding \(0\) doesn’t change the result? Maybe, but I’m not convinced that’s universally true, especially in cases with negative entries where \(0\) may actually be the most common sum.

Then I started playing with small sets of numbers sampled from \([-m,m]\). For example, here are some sets with clear winning sums, for seven integers in the range \([-20,20]\):

\(\{-15, -1, 0, 5, 10, 12, 13\}\), winning sum \(12\)
\(\{-19, -17, -16, -12, -7, -1, 0\}\), winning sum \(-36\)
\(\{-16, -12, -5, -1, 4, 7, 11\}\), winning sum \(-6\)
\(\{-14, -13, 0, 3, 4, 16, 18\}\), winning sum \(7\)
\(\{-19, -17, -8, -6, 6, 11, 17\}\), winning sum \(-8\)
\(\{-20, -18, -12, 0, 6, 8, 12\}\), winning sum \(-12\)
\(\{-16, -10, -9, -3, 9, 10, 19\}\), winning sum \(0\)

For each of those sets with a clear winning subset sum, there were many (but not an intolerable fraction) with no clear winning sums. So there’s another, more Number Theoryish question in there I suppose:

For sets of \(N\) integers taken from the range \([-m,m]\), how does the fraction of sets with clear winning subset sums scale with respect to \(N\) and \(m\)?

I don’t know, but I poked around a bit with larger values of what I’m calling \(m\), and noticed that for \(m=1000\) there were way fewer sets with clear winning sums. Or at least that’s what it felt like.

There’s also something else worth noticing in the winning sums listed above. In some cases, those sums appear in the original set, but in other cases they don’t appear there at all. That feels salient somehow, and one wonders about this bit of Probability Theory, too:

For sets of \(N\) integers taken from the range \([-m,m]\), how does the fraction of sets with clear winning subset sum that does not appear in the original set scale with respect to \(N\) and \(m\)?

Winning by a head: gaps between first and second place

Another interesting thing I noticed—mainly because the sketchy exploratory Ruby program I wrote sort of dumped a lot of numbers to STDOUT for me to stare at—was the gap between the frequency of a clear winning sum, and the next-most-common sum. For example, here’s the sort of information my program dumps to the screen when I run it for randomly-sampled sets of seven numbers from the range \([-20,20]\):

[-19, -10, -9, -1, 14, 15, 18]
  clearance: 1
  [[4, 6], [-5, 5], [13, 5], [-10, 4], [9, 4], [5, 4], [14, 4]...
  {4=>6}
  1 / 127

[-18, -14, -13, -12, -9, -1, 1]
  clearance: 0
  [[-39, 5], [-26, 5], [-27, 5], [-40, 5], [-35, 4], [-13, 4],...
  {-40=>5, -39=>5, -27=>5, -26=>5}
  4 / 127

[-18, -9, -8, 1, 8, 9, 17]
  clearance: 1
  [[0, 11], [-9, 10], [9, 10], [-8, 8], [8, 8], [17, 7], [1, 7...
  {0=>11}
  1 / 127

[-11, -10, -3, 0, 4, 5, 17]
  clearance: 0
  [[4, 4], [5, 4], [11, 4], [2, 4], [1, 4], [12, 4], [-2, 4], ...
  {-10=>4, -9=>4, -7=>4, -6=>4, -5=>4, -4=>4, -3=>4, -2=>4, 1=>4, 2=>4, 4=>4, 5=>4, 6=>4, 7=>4, 8=>4, 9=>4, 11=>4, 12=>4}
  18 / 127

[-20, -15, -7, -2, 2, 9, 13]
  clearance: 1
  [[-20, 7], [2, 6], [-7, 6], [-13, 6], [-22, 6], [0, 6], [-11...
  {-20=>7}
  1 / 127

[-15, -3, 6, 7, 10, 11, 13]
  clearance: 0
  [[16, 5], [13, 5], [8, 4], [19, 4], [21, 4], [23, 4], [6, 4]...
  {13=>5, 16=>5}
  2 / 127

[-9, -7, -5, -3, -2, 0, 5]
  clearance: 0
  [[-7, 10], [-12, 10], [-14, 10], [-9, 10], [-16, 8], [-5, 8]...
  {-14=>10, -12=>10, -9=>10, -7=>10}
  4 / 127

[-10, -3, 0, 3, 14, 16, 20]
  clearance: 2
  [[20, 8], [23, 6], [17, 6], [33, 4], [7, 4], [6, 4], [4, 4],...
  {20=>8}
  1 / 127

[-19, -14, -11, -7, -1, 1, 12]
  clearance: 0
  [[-7, 5], [-25, 5], [-21, 5], [-20, 5], [-19, 5], [-18, 5], ...
  {-32=>5, -25=>5, -21=>5, -20=>5, -19=>5, -18=>5, -14=>5, -7=>5}
  8 / 127

[-20, -13, -3, 0, 1, 3, 11]
  clearance: 0
  [[-12, 6], [-22, 6], [-19, 6], [1, 6], [-2, 6], [-9, 6], [-2...
  {-22=>6, -19=>6, -12=>6, -9=>6, -2=>6, 1=>6}
  6 / 127

That’s just ten randomly-sampled sets, but it shows the sorts of features I’ve been looking into. The first line of each block is the set itself. Then the clearance line indicates how much of a gap exists between the largest and second-largest frequencies in the histogram (which is “0” if there is no clear winning sum). Then I print the first few entries in the sorted list of [sum, count], and finally I explicitly say what proportion of the \(2^{N}-1\) sums are “most frequent” (or tied for that).

Some edge cases

As I was discussing this with some friends on social media, Ed Vielmetti pointed a few edge cases which are interesting to consider, too. For example, if the set of numbers we start with (call it \(S\)) is something like \(\{1,10,100,1000,10000,100000\}\), there can never be a clear winning sum among the subsets. Indeed, the subsets sums will be unique numbers in the range \([1,\sum_{S}]\)—there will only ever be one subset that can possibly sum any given value!

Further, that’s true for any set of equivalent numbers in any base, and there will never be any duplicate sums regardless of whether we start later in the sequence. For example, \(\{1,2,4,8,16,32\}\) will produce subsets with sums hitting every integer in the range \([1,63]\), and the subsets of the set \(S = \{3, 81,243,729\}\) (all various powers of \(3\)) will produce no duplicate sums.

Ed also points out, “1,2,3,4,8,12 has a lot of 15s, which would make it a good cribbage hand.”

Some hints and guesses and a goal or two

Ultimately, I don’t know the “answers” to the questions I’ve posed above.

It feels as though there are clues in the values and relationships between the numbers in the starting set, which one should be able to use to predict whether there is only one clear winner or not. Something about those \(\{1,10,100,1000,\dots\}\) sets also seems to point towards a “feature” we could consider as a heuristic. Then there’s the observation that in at least a few of the sets with clear winners, the winner is a number in the original set, and there are also pairs of numbers that sum to \(0\) and that winning number.

But what specifically is it about \(\{1,10,100,1000,\dots\}\) and \(\{1,3,9,27,\dots\}\) that makes those “flat”? It’s not clear that those are the only sets of values that will be flat in that way, and in fact it’s quite possible (seems to me) that Erdős has some proofs or conjectures or open problems of his own regarding whether there exists sets of integers for which no sums repeat. I haven’t scoured the literature, since to be honest I’m not in the literature business.

In poking around at this problem, one of the things I considered was whether those sequences I mentioned above (the number of clear winners for sets of the form \(\{1,2,\dots,N\}\)) might have some clues buried in it. Most of the sequences of observations I collected for the first few values of \(N\) don’t produce results when I paste them into the search box at OEIS.org, but one did: 1,2,2,3,5,8,14,23,40,70,124,221,397,722,1314,2410, which is the sequence of frequencies for the most common subset sums. That is, this is the count of the winner (or winners) for \(N=1,2,3\dots\), whether they tie or not.

That turns up this sequence, which is an interesting one. The match isn’t exact, since there are a couple of terms different at the start, but the comment on the OEIS entry is salient: "a(n) is the maximal number of subsets of {1,2,...,n} that share the same sum". Is that a clue? Feels like it, though I note that the cited papers seem to be talking about something else entirely. Then again, that’s my common experience with number theory and combinatorics papers; I’m not trained in the fields, so there are counterintuitive “leaps” in many of the citation trees I have tried to traverse through the years.

The mathematicians among you may not be interested in rules of thumb and heuristics, and will probably be thinking more about existence proofs and asymptotic proportions and stuff. And that’s fine, but those aren’t my questions.

Rather, as I said right from start, I’m interested in algorithms, and how people find them. This suite of relatively simple-to-describe problems has that “smell” I can only call “potential”—it could be interesting. It feels as if, if one were to start to try to build algorithms that searched for patterns and tendencies among sets of numbers, after staring at enough examples for long enough, you’d start to maybe get a sense of the problem. Like Ed Vielmetti’s observation on powers of ten, there are no doubt other subtle patterns down in there that one could stumble across, tease apart, and cumulatively and in an exploratory fashion piece together just what it is Tozier means by this puzzle.

There’s leverage here for machine learning to have a go. I framed this as a problem for program synthesis, and I think that could be pretty sweet. But I have no doubt some (possibly obscure) Deep Learning system might be able to work in an integer number world, and the results would be no less interesting and/or confusing than those from genetic programming or other semi-automated approaches.

Here, let’s pare the open questions down into something simpler.

Suppose we were to build a training set of, say, eleven integers from the range \([-20,20]\). Those will be the input values or arguments of our desired algorithm. We should make a few thousand training cases. We are looking for something like these as outputs:

Is there a clear winner subset sum or not? (classification)
What is the (approximate) winning gap between the first and second most frequent subset sums? This could be \(0\) (if there’s no clear winner) or higher. (prediction)
How many different subset sums will tie for “winner”? This should possibly be represented as a proportion, as I’ve done in my sketchy code output, above.
How do these observations scale in various ways? What will happen if the integers are taken from an increasingly wider range (for example, if the values are in \([-10000,10000]\))? What happens as more or less of the possible range is sampled? For example, you might compare the expectations for there being a clear winner when your initial set of integers represents 10%, 50%, or 90% of the sampling range from \([-20,20]\).
What’s the largest gap by which a clear winner can win? That is, what’s the biggest difference between the first- and second-largest counts? Does this depend on \(N\) (yes!). Does it depend on \(m\)? (probably) How?

On Beyond Zebra

I warned you this was one of my open problems, and it surely is.

What, if anything, will be different if we start with a multiset of integers rather than simple set? That is, what if we allow multiple copies of the same integers? For example, are there clear winning subset sums for \(\{1,1,1,1\}\)? (Yes!) Note that by “subset” here I mean “subset of elements of the multiset”, so in this example we can construct four different “subsets” of elements that look like \(\{1\}\). But does this result in something qualitatively different from the case where we only permit a single copy of each value in the starting set?
What, if anything, is different if we start with a set of (unique) numbers, but instead of subsets of arbitrary size we examine the possible multisets of a fixed size \(k\) that can be constructed out of those elements? For example, if we start with \(\{-1,2,13\}\) and build all possible \(k=4\) multisets, those would be \(\{-1,-1,-1,-1\}\), \(\{-1,-1,-1,2\}\), and so on up to \(\{13,13,13,13\}\). Same problem, but a different combinatorial… thingie. You probably know the word better than I do. Again, is this a qualitatively different thing? Are there different features in the original set that might affect whether there is a clear winning sum over these fixed-size words on the alphabet defined in the original set?
Is the basic clear winning sum question “easier” if we limit ourselves to counting the frequencies of sums over subsets of a fixed size \(k\)? So for example if I ask you to look at a set of \(N\) integers, and tell me whether there will be one clear winning sum over all possible subsets of size \(3\), will that be simpler in some sense?
Are there examples of initial integer sets where the presence of \(0\) makes a difference in the outcome? You’ll recall in the quick exploration of sets that contain only the first \(N\) integers that I described above, the winners were identical whether zero was included or not. But I wonder whether, especially in sets containing negative values, that may change. For example, in cases where there’s only a one-point “lead” for the clear winner, and either that winner or the next rank includes \(0\). A little switch up or down might affect the balance, at least in some rare cases. Or is that even possible?
Are there visualization methods that could make it intuitive to discover patterns in these sets? I have the sense that Combinatorics people like to make a jump from numbers to points on a line or plane. Or there may be some benefits in plotting the pairwise sums of the numbers in the set. Or in looking at how fast the sets of sums (or histograms of sums) grow as you move from one to two to three elements in the subsets. I don’t know. But again it feels like there’s something there….

There are more of these conjectures and questions and exercises and games to play, of course. There are always more. That’s the beauty of little abstract puzzles: there’s room at the edges to grow.

You could think about these as programming exercises. Learn a new language, or brush up your development skills and mindful practices in a language you already know.

You could think of them as open questions in mathematics. Hit the literature and see if there’s an “obvious” (if possibly obscure) Hungarian paper on the topic from the 1980s. Odds are good, seems to me. Play with it that way, if you prefer.

You could think of them as a target for machine learning. What would a neural network need to be like to be able to classify sets of integers into “has a clear winner or not”? What would a genetic programming representation have to include, whether you tried to evolve an explicit brute-force algorithm for counting, or searched for heuristics and simpler features?

What can you make of it? Let me know.

Later, we might come back and look at multisets. ↩
It would be \(2^N\) if we also counted the empty subset, but we’re not, so it’s \((2^N-1)\). You can think of this as being equivalent to counting all the numbers you can build with exactly \(N\) binary “digits”: for each subset, every number either is or is not an element, so we can think of the “first” subset which contains only the first element as being "1000...0", the “next” subset containing only the second element as being "0100...0", the subsets with two items being "1100...0" and "10100...0" and so on, until eventually we reach "111...1", which contains every element. The only subset we’re not considering—and after all this, I’m not sure why exactly, except I haven’t done so until now—is "000...0", the empty subset. ↩