Treating TensorFlow APIs Like a Genetics Experiment to Investigate MLP Performance Variations

I built two six-layer MLPs at different levels of abstraction: a lower-level MLP using explicit matrix multiplication and activation, and a higher-level MLP using tf.layers and tf.contrib.learn. Although my intention was simply to practice implementing simple MLPs at different levels of abstraction, and despite using the same optimizer and same architecture for training, the higher-level abstracted model performed much better (often achieving 100% accuracy on the validation datasets) than the model built around tf.matmul operations. That sort of mystery deserves an investigation, and I set out to find out what was leading to the performance difference and built two more models mixing tf.layers, tf.contrib.learn, and tf.matmul. I used the iris, wine, and digits datasets from scikit-learn as these are small enough to iterate over a lot of variations without taking too much time.

In genetics research it’s common practice to determine relationships between genes and traits by breaking things until the trait disappears, than trying to restore the trait by externally adding specific genes back to compensate for the broken one. These perturbations are called “knockout” and “rescue,” respectively, and I took a similar approach here. My main findings were:

  • Replacing tf.matmul operations with tf.layers didn’t have much effect. Changing dropout and other hyperparameters did not seem to effect the low-level and high-level models differently.
  • “Knocking out” the use of learn.Estimator.fit from tf.contrib.learnand running the training optimizer directly led to significantly degraded performance of the tf.layers model.
  • The model built around tf.matmul could be “rescued” by training with learn.Estimator.fitinstead of train_op.run.
  • The higher-level model using layers did generally perform a little better than the lower-level model, especially on the digits dataset.

So we can conclude that training with the tf.estimator API was likely responsible for the higher performance from the more abstracted model. Cross-validation curves demonstrating the training efficacy of the different models are shown below:

Cross-validation accuracy curves for different random seeds using the tf.layers model.

Cross-validation accuracy curves for different random seeds using the tf.matmul model.

These MLPs perform pretty well (and converge in just a few minutes) on the small sklearn datasets. The four models are built to be readily modifiable and iterable, and can be accessed from the Git repository