State of practical AI and ML, 2020

January 29, 2020

New decade, new opportunities! At Tenfifty we may well have Sweden’s most diverse set of real-world machine learning projects. I decided to write up a post for the aspiring data scientist regarding what tools we use.

We use Python and Jupyter Notebook and we deploy with Linux+Docker+Kubernetes. We usually run on Google Cloud or Microsoft Azure depending on the customer.

1. Sklearn

Most stuff in practice is still done with sklearn.
Check out their website for very good documentation, tutorials, examples, etc.

2. Gradient Boosted Decision Trees

For structured or tabular data (think spreadsheets), neural networks are usually not the most suitable. Fast AI has an NN module called Tabular which seems to give decent results, but the most reliable tool is always gradient boosted trees. Boosted trees are ugly in every way except one – They work. You don’t need to tune parameters (although you can), the results tend to be good and learning quick and robust with standard settings. They tend not to overfit even with very little data or with very noisy data. They can handle continuous data, categories, missing values, etc. With NGBoost they can even give an estimate of uncertainty for each prediction!

The standard libraries are xgboost, Light GBM, Catboost or the implementation that comes with sklearn. NGBoost, mentioned above, can also give a distribution rather than a point estimate as an answer. Here is a good comparison of a few of them. Our recent experience is that with standard settings, Catboost, seems to come out on top. Their own benchmarks support this.

Tree based models are probably not the future. They are theoretically quite ugly and perhaps a dead end (in other words – a local optimum!). The output is discrete and everything is turned into ugly step functions, with no concept of priors, math, trends, similarity between subregions or cycles. This means that they are always constant outside the training region – they only do interpolation. It also means that they have no error gradient connected to the splits, so it is not trivial to layer them to do anything like deep learning, where you input raw features and let the model do your feature engineering.

But dammit, they work.

3. Deep Learning

Which takes us to deep learning with neural networks. The poster child of AI. Deep learning is currently the only method used in practice for image and sound models. It is also very popular for modeling other unstructured data like text and time series, but at least here they have some competition from older and simpler methods. To do deep learning you need two things – fast matrix operations, preferably running on GPU or TPU, and automatically computed gradients.

There are a number of libraries that provide these, with the most popular being Tensorflow (maintained by Google) and Pytorch (maintained by Facebook). At Tenfifty we prefer Pytorch, since it is very flexible and feels more like regular imperative programming. Fast.ai has great material for learning deep learning and also higher level libraries for NN. This is based on Pytorch.

There is however a new kid in town – Jax (also by Google). Jax is a numpy clone with additional features like jitting and nth order gradients of ordinary Python code. It is very similar to Pytorch but faster for small models that do not spend all their time in large matrix multiplications and for methods that can take advantage of 2nd order gradients. It does not have a mature ecosystem yet but based on community enthusiasm (and the great speed-ups we have seen on our small models) seems like it may become the future of deep learning (and some other stuff that use gradients and GPUs)!

4. Natural language processing

GPT-2 came out a year ago, but is still all the rage. It may well be the closest thing to “real” intelligence that we have witnessed in a model. These types of models can be used for all sort of interesting work with language and text, but they are rather large and unwieldy for practical use.

What most people use in practice, I suspect, is Spacy. Spacy can do all the little practical things you want to do with text and also help you train smaller more practical models.

NLTK is a big and mature name in the Python ecosystem, but at Tenifty we never used it much.

5. Probabilistic Programming

We use probabilistic programming (PPL) for lots of stuff where we have hierarchical data or just not very much data but reasonable guesses for how the data is generated. Also when uncertainty is important. The best known library may be PyMC3, but we like Pyro which is newer and slightly less mature, but nicer to work with. It’s built on top of Pytorch. In the Tensorflow ecosystem, the equivalent library is Edward. Hamiltonian Monte Carlo for calculating an approximate posterior distribution is slow in Pyro, unfortunately, so in recent months we have turned to Numpyro which is much less mature but built on Jax and very fast.

6. Forecasting

Facebook released Prophet about a year ago. It is based on PPL and state-of-the-art when it comes to time series forecasting on seasonal data. You can find trends of various kinds, search for periodical trends, effects of holidays and add additional regressors that you think might affect the result. For near-term forecasting you need to replace or combine it with something that understands current value and trend and not only the long-term patterns.

7. Parameter optimization

Setting up your own model for how something works and just keeping the parameters free is an underappreciated way of doing modeling work. Also it is basically what you do when you build model pipelines anyway, but this time the hyper parameters are your model parameters.

The best way of doing parameter optimization for functions that are slow to evaluate is bayesian optimization, where we compute a surrogate model over the fitness landscape of the parameters including uncertainty. A popular choice for these models are gaussian processes, but you can use other things.

I like Dragonfly because it can also automatically control fidelity – the trade-off between for how long you investigate a point (for example for how long you train) and the uncertainty in the result.

Ax from Facebook (yes, everything is from Facebook and Google..) does similar things and seems mature enough, but we have not (yet) tried it for anything.

8. New fun stuff!

This isn’t really the place to describe exciting research directions, but there are some things I think/hope to use more in the near future.

Gaussian processes with deep kernels and deep kernel transfer. Exciting stuff for meta learning, few-shot learning and uncertainty. Generally we need to use data in a less wasteful way, in the ML-industry.

Neural tangents is an implementation of infinite neural networks, which is a freaky idea. It turns out that this describes a type of Gaussian process and that we can potentially use it for the same things with the same benefits.

We see planning and other applications of reinforcement learning used relatively little in practice at companies. There are great results from for example Deep Mind and Open AI, so I hope we see more real implementations soon. This is an excellent introduction from Open AI.

That’s all for now. Leave a comment below if you have suggestions for other practical modelling libraries!