Code used in making this post is available here
I’ve been exploring whether particle swarm optimization—a fairly high-complexity algorithm1 but which I’ve found can be relatively computationally accessible in a distributed computing environment—is able to enhance the predictive power of ordinary least squares regression.
Before I describe my thinking around why we might even like to do so and, if we’re seeking predictive power, not simply use a more flexible method, here are some first results: after using the algorithm to optimize the set of interaction terms to include in a regression on 54 variables from the Panel Study of Income Dynamics2 (PSID), a popular dataset for U.S. economics research, the regression’s predictions were no less accurate overall than those from a Random Forest.
The above chart shows the percentage reduction in the total out-of-sample mean squared error (MSE) in a person’s predicted log earnings, compared with a mean predictor, that several learning methods3 were able to achieve under ten-fold cross-validation. (This metric is comparable to the R2 in a regression, though potentially less misleading in case of over-fitting.)
Why Try Regression?
If we’re aiming to make predictions, then a machine learning algorithm such as Random Forest might be a good contender: it uses an ensemble of piece-wise trees, which can help ensure that any nonlinear or interaction relationships across variables are contributing to its predictions4, something not always clear cut with regression.
But if we’re aiming to infer the underlying structure of the process through which our data came together (see this note from Paul Allison, or a discussion here), I find an advantage of regression can be the nice, clean formula we can statistically test with it. For example, we can give a regression something5 such as
STUDENT TEST SCORE ~ HOURS STUDYING + HOURS SLEEP
“the score each student receives on the test is proportional to the amount of time he or she studies plus the amount of sleep he or she receives the night before, any random error aside”
and then receive back coefficients, p-values, correlation coefficients, and other statistics that help us infer whether this hypothesis is correct—that studying and sleep are in fact the systematic drivers of students’ scores, and to what degree they drive them. (There may still be more to work through: any omitted variable bias or collinearity, for example. Though with a formula, we have a starting point.)
This said, something I’ve been pondering is what to do if we’d like both. When it comes to interactions, for example, what if time spent studying is influential—but only with sufficient sleep? A regression using our formula above wouldn’t fit any parameters that test for this.
If we wanted, we could re-specify the formula and test this as a new hypothesis. But then—what if our data includes not merely 2 variables, but 50, or 100? With 50 variables, there are 1,225 possible variable pairs we could test in two-way interaction, 19,600 possible triplets, and so on, and this doesn’t consider the pairs, triplets, and so on of these interaction terms themselves that we could test in the formula. And so, as I alluded to in this post, with this kind of combinatorics, testing every single possible specification in one big “for” loop might not even be feasible computationally.
Particle Swarm Optimization
I was curious if an optimization routine might help here. With particle swarm optimization (PSO), the analogy is a school of fish swimming around, looking for the best area for food. Each individual fish helps with the discovery by exploring the density of food in its own vicinity, and then reports back to the group. After a given round of collecting results from all of the fish across the school, the school shifts course and swims on average toward the fish that reported the greatest density. Each individual fish moves with the school, and also deviates from this average trajectory a little so that it can further explore its own vicinity.
For our purposes, the parameters aren’t the x, y, and z coordinates of a fish—they should be something related to the way variables interact. Here’s what I’ve been using:
- The total number of explanatory variables to include in a test regression
- The likelihood for each given possible explanatory variable (drawing without replacement) that it will be included in the regression
- The highest “order” of an interaction to attempt in the regression: in other words, whether to attempt any two-variable interactions, three-variable interactions, and so on
- The number of interaction terms to include of each given order
- The likelihood for each possible combination of variables in the interactions of each order
To set up the algorithm, I created a swarm of “particles” each storing a value for each of the parameters above. These parameters are randomized across the particles when the algorithm begins, and then each particle runs its own regression using the formula that its parameters dictate. The particles report back to the group and then determine which particle’s regression achieved the smallest cross-validated MSE. Every particle shifts its parameters a little toward those discovered as best so far globally, and the process repeats.
Also, two nuances. First, each particle keeps track of its own historically best parameter set and never strays too far from it—this way the particles don’t all merely bombard the one globally best parameter set discovered across the swarm. They incorporate information about it into their own search—by inching toward it. Since there might be an even better parameter set in some other vicinity, we want to ensure that some particle has a chance to discover it.
Second, we add a little noise to each particle’s parameter set at each iteration, which helps the particle more-fully explore its vicinity.
Setting it Up
Because, with this algorithm, both the greater the number of particles the better and the greater the number of iterations the better, I thought I’d try applying Apache Spark to it. I set up a cluster of EC2 instances on Amazon Web Services (AWS): one m4.16xlarge instance as the Spark master, and three c4.8xlarge instances as Spark workers.
With this setup there were 172 cores available to apply to the problem. And since each given particle operates independently in between re-grouping rounds, the algorithm is fairly-easily parallelized. At two particles per core (slightly stuffing the cores as to take advantage of all CPU cycles), I was able to create a swarm with 344 particles. As for software versions, I used SparkR with Spark 2.0 (the latest as of this post), which has a nice interface for user-defined R functions.
Here‘s a nice set of instructions for spinning up SparkR with Spark 2.0 on AWS. And here‘s a repository containing R code for the algorithm (which should allow you to execute the particles either in series without Spark or in parallel with it, although I’m not sure I’d actually recommend serial mode for anything other than testing and getting a feel for the algorithm), along with the example pre-processed PSID data file.
Applying 500 iterations to 344 particles to 54 variables from the PSID data, the algorithm was able to discover a regression that performed just as well as Random Forest in predicting a person’s log earnings: that is, it reduced the MSE in these predicted earnings by 29 percent6.
Of course I’m stealing a page out of the predictive modeling book here using cross-validated MSE as a performance metric. But here’s where I think there’s an opportunity for inference: we also receive back a clean formula that can help us in determining how this 29 percent of earnings variation is structured.
2012logEarnings ~ race:sex + sex:ACTScore + motherAgeAtFirstChild + birthOrderFromMother + fatherAgeWhenChildBorn + fatherTotalNumberOfKids + whetherBornOnTime + whetherHadPreKCare + householdIncomeAtAge10 + whetherPrimaryCareGiverChangedByAge10 + childhoodCalculationScore + childhoodAppliedProblemScore + childhoodLetterWordScore + totalEducationCompleted + 2012Age
Out of the 54 variables, the algorithm suggests that including a select 16 of them in the regression leads to the best possible predictions of a persons’s log earnings given the data at hand. And! It discovered that, of all the conceivable variable interactions in driving earnings, the race and sex interaction is important, as is the interaction between sex and the person’s high school ACT test scores. Of course we’re not certain as yet whether we’ll also need to address omitted variable bias, collinearity, or any other concerns regarding causality; yet had we included more or different variables, the model may have been over- or under-fitted. And so my question is this: if it’s with this formula that we should begin our inference work. Should this be our starting point, where from here we’ll apply instrumental variables, examine coefficients, variable correlation coefficients, p-values, other statistics, and so on?
For kicks, there are also some results related to the algorithm’s procession that may be interesting. Here’s a graph of the cross-validated MSE of the best so-far parameter set that every particle discovered in its own vicinity as the algorithm proceeded. Of these, whichever particle had the smallest MSE after any given iteration was the particle with the best so-far parameter set universally, and we can see that the particle that discovered the swarm’s ultimate best parameter set (from which the above formula is a result) did so after approximately 200 iterations.
How did the parameters themselves change as the algorithm proceeded? As one example, here is a graph of the best so-far number of explanatory variables that each particle itself was able to discover in its own vicinity as the algorithm proceeded:
We can see that over time the particles’ individual best so-far numbers of explanatory variables slowly closed in on a range between 15 and 25.
Finally, here is a related graph, but of the number of explanatory variables each particle tried in its regression at every iteration:
The shape here is slightly different than that of the graph above. There is less monotonic winnowing toward the 15-25 range, and less winnowing is what we would expect, given that, despite what each particle knows at each iteration about the best number of explanatory variables in its vicinity, it should continue exploring to find a potentially even better number (given whatever the values of its other parameters).