Takeaways from PyData Amsterdam 2019

Posted on Thu 14 March 2019 in Generic

Written by Enrico Rotundo & Michael Chong.

PyData Amsterdam 2019 it’s a wrap. It has been an amazing two days conference (plus one of workshops) where experts and users of data analytics tools shared their research, approaches and mistakes in a friendly and inclusive environment.

In this post, we present some takeaways from the conference, and a few reasons why you should consider going the next time.

Our favourite talks

Generally speaking, the level of the talks was very good, with a mix of experienced presenters and novices, professional entertainers and hardcore technical experts. Amongst several very good talks, these were our favourites:

The Profession of Solving (the Wrong Problem), by Vincent D. Warmerdam (GoDataDriven) Vincent is an entertaining speaker, his talk was fun, spot on in describing mistakes we all make, and very helpful in putting some things in perspective. Through a few examples he illustrated how we can get stuck because we look at the wrong thing or we approach it the wrong way, and often enough we are so much focused on the details of the technology we are using that we forget about the problem itself. By changing perspective you can often get to a solution that seemed impossible before.

People dreamt about the solution. Not the problem.

There are a few practical ways to help you changing perspective when stuck, and some good practices you should adopt always. These include focusing on the problem, rather than on the solution or on the algorithm; talking with your client again and again, understand his pain points and needs; set up a pipeline such that you can evaluate the impact of anything, before implementing the solution. Oh, and don’t forget to “go to your local theatre, there are some AMAZING things happening on stage”

Online machine learning with creme by Max Halford (Université Paul Sabatier, Toulouse) Max had this entertaining presentation full of memes and pop references, on a topic we don’t discuss often: doing machine learning on data streams. We are so used to think of machine learning as a batch problem — that we often forget that real data often is generated sequentially. He presented Creme (from inCREMEntal learning), a python library to do online machine learning, a regime where the model learns one observation at a time. This enables rapid-deployment, right after the 1st data point comes in 🤠, as well as a low-memory footprint (doesn’t need to load the whole dataset into memory).

The library is quite young, however the core team seems very enthusiastic and committed. They need feedback on real-life use cases (i.e. real streaming data), to drive the development of the library (yes, things go wrong in real scenarios). If you are a bit confused about the usefulness of the approach, don’t worry. We are as well. What is clear is that we are very much used to the batch approach that seldom we question it. To better understand all the implications of doing online learning, we will for sure try Creme as soon as we have a chance to do so.

Sebenz.ai — South African job creation through gamified data labeling for machine learning by Alex Conway (Sebenz.ai) Alex’s talk was the most colourful of the two days both literally and in a figurative way. He presented Sebenz.ai, a way to label your data by making people play a sort of game on their smartphones (and get paid). If this sounds similar to a Mechanical Turk, it is because it kind of is. But AWS’s tool is quite old, and there is nothing bad in getting out of a sort of monopoly.

Generally we are very much concerned with the amount of data, and we might overlook the quality of data, which is scarce and precious. And properly labelled data is even rarer. Ask for a price quotation to buy annotated images if you don’t believe it. As Alex loves to put it:

if you are paying a data scientist with a PhD to label your data you are not making a good deal

Not all the data is like oil. Having an alternative tool to label your images, or to transcribe your audio notes, by a company that promise to ensure at least minimum wage retribution and that tries to give jobs in a country with high levels of unemployment, can’t do any harm.

Ethics and inclusion awareness is rising

The fact that inclusion, diversity and ethics have become a central topic is clear from the fact that the first keynote, an excellent presentation by Sasha Romijn, was about the need of empathy in our communities and products. Sasha highlighted unintentional non-inclusive behaviours we may have, and the fact that until we experience being discriminated against, it is very difficult to realise the complexity of it! A thing that blew our mind was the realisation that when we make our product accessible to people with disabilities we may accidentally make it available for a set of people that we did not think about, as illustrated here

Speaking of discrimination, we were happy to see that PyData adopted a code of conduct and made sure to share it with all the participants both online and in the goodies bag.

Interestingly enough, several times during the conference questions were asked about the ethical consequences of a solution or of a project, stimulating debate and awareness.

Finally, in at least a couple of talks the topic of filter bubbles was touched and the talk by Sanne Vrijenhoek was focused on how to measure diversity in news recommendations.

You’ll meet cool people

Such as core developers of open-source packages you use every day, such as Ralf Gommers (SciPy, NumPy) and Carlos Córdoba (Spyder)

Conclusions

Attending this conference was a remarkable experience. Whether you’re a techie or a business savvy you’ll get valid takeaways from the community. And it’s not all! Don’t forget to follow PyData Amsterdam’s meetup page to stay posted on the monthly meetups.

Acknowledgments: we want to thank for giving us the possibility to attend this event. Also, we’d like to thank all the people who worked to make PyData Amsterdam 2019 such a successful events, the Organizing Committee, the volunteers, the speakers andNumFOCUS.

Originally published at https://medium.com on May 14, 2019.


A (visual) sneak peek into Kubeflow

Posted on Thu 27 December 2018 in Generic

Here’s how v0.3.4 looks like, in a glimpse

Kubeflow logo

You have probably heard about Kubernetes. Alright. If you are in the Data Science field, you might be wondering what’s being baked for you based on K8s. Well, for 2018… that’s probably Kubeflow!

Kubeflow is the machine learning toolkit for Kubernetes

It provides scalable machine learning workflows, and since it relies on K8s, it promises infrastructure-free containerized services for machine learning practitioners. In other words, Kubeflow ships with the following services:

  • JupyterHub, for the well-known Jupyter Notebooks and JupyterLab

  • TensorFlow model training

  • Optimized model serving, with support for NVIDIA TensorRT

  • Pipelines to manage experiments, deployments, etc.

That sounds great, finally, no more manual deployment of Jupyter and (re)-adapting those template scripts to train a bunch of models.

My reaction to my first reading on Kubeflow (src: giphy.com)

My reaction to my first reading on Kubeflow (src: giphy.com)

Thrilled by the glamorous announcements, I thought “Hey this Kubeflow sounds pretty cool, but I’d like to test it”. After I had unsuccessfully searched for screencast demos on YouTube, tutorials with explicative visuals, or step-by-step guides on what you could do with Kubeflow, I quickly realized there’s very little material out there for the final user (e.g. a data scientist like me).

But… how does it look like?

Running Kubeflow

After I had wasted several hours on trying to run Kubeflow v0.3.4 on macOS, I found this Deploy Kubeflow on GKE using the UI to work (almost) seamlessly. All you need is a GCP account and some spare time. After all the steps, you’ll have a running instance of Kubeflow. It starts with a pretty empty UI, this is the entry point to the services listed above.

View of Kubeflow v0.3.4 initial dashboardView of Kubeflow v0.3.4 initial dashboard

JupyterHub

I had a bit of initial hustle with the JupyterHub service due to a “401: Unauthorized” error. However, after starting the server, you’ll be able to select an image for your Jupyter workspace. Note that there’s support for both CPU and GPU in all of the recent tensorflow versions. Kubeflow should be able to automagically provision GPU resources if that’s available in the cluster.

The user can configure details such as CPU, GPU and RAM memory to assign to the Notebook instance.

After spawning a Jupyter instance I started exploring around. It comes with 2 Python kernels pre-installed (i.e. py2 and py3). However, I was a bit disappointed about not being able to create any Notebook, nor a terminal window. It’d just fail with a cryptic “Not Found” message.

Need more info? Check the jupyterhub/kubespawner docs.

Tensorflow Job dashboard

The TF Job dashboard is the component you can use to run TensorFlow training jobs. Since there’s K8s under the hood, I’d expect this to scale out smoothly and map the training jobs in a smart way. The UI is pretty basic here and you can create and monitor jobs, as well as managing them by namespace. Here’s how it looks like:

Want more info? Check the kubeflow/tf-operator docs.

Pipeline dashboard

This is the place where you compose and build data pipelines, and manage running experiments. You build an ETL pipeline with visual tools.

More info on kubeflow/pipelines here.

Conclusions

I’d have liked to say more about Kubeflow but the v0.3.4 seemed pretty unusable for me. Although v0.4.0-rc2 is already out (see announcement), I didn’t find an easy (I mean dummy proof!) way to deploy it for a quick test. It comes with all the tools you need to set up an analysis environment: Jupyter, model training scheduler, and an experiment dashboard. Once these three will be fully working, machine learning practitioners will have a scalable and flexible workbench to play with.


JupyterCon 2018

Posted on Fri 31 August 2018 in Generic

JupyterCon is a yearly conference held in New York City that promotes the Jupyter project, its values and a the notable projects built on top of it. It's organized by NumFOCUS and O'Reilly Media with the support of several commercial sponsors. You can check out the schedule here and hopefully save the date for 2019!

Thanks to my company @Hal24k I attended the 2018 edition for the entire 4 days. It's been a great experience with lots of inspiring talks and insightful tutorials. Make sure to check out the related events: poster session & drinks took place before or after the conference hours; note there were additional events arranged by main sponsors (like Netflix) held in external venues, but these are not listed on the conference webpage.

Highlights

You can check the keynotes linked in O'Reilly page. My favorites are:

  • Paco Nathan's "Jupyter trends in 2018"
  • Carol Willing's "Sustaining wonder: Jupyter and the knowledge commons"
  • Tracy Teal's "Democratizing data"
  • Julia Meinwald's "Why contribute to open source?"
  • Fernando Perez's "Sea change: What happens when Jupyter becomes pervasive at a university?"

Since Paco's keynote is the best in my opinion, here's the video. This one, in particular, gives you an overview of the scale of Jupyter project, and it's just terrific to see so much potential.

Lesson learned

Foremost take home is a deeper understanding of what Jupyter actually is. As many data scientist, I've been using Jupyter Notebook and JupterLab in my daily work. I thought it was a revolutionary tool that brought IPython interactivity to the next level. I was wrong. The first mistake was confusing Jupyter with Notebook or Lab as part of the unique blob. If you think about that's quite common. Most users start with these tools by working on a scientific assignment and all they care is getting to the results. You don't ask yourself much about what's behind all that.

Truth is Notebook and Lab are just 2 of the many components of the Jupyter eco-system. Indeed you can think of Jupyter as a broader project that comprises a multitude of tools. This documentation page gives you a nice overview of what I'm talking about. To help this narrative I'll report that here:

  • General: yes you can contribute to Jupyter!
  • User Interfaces: there are many more than Notebook and JupyterLab! Check out third parties' nteract, a desktop application that bring .ipynb files editing to your notebook.
  • IPython
  • Kernels
  • Widgets
  • Notebook Documents
  • JupyterHub
  • Deployment
  • Foundations

A more technical diagram of the key components in Jupyter is available here, however, I believe it's incomplete since it doesn't include many of the hottest releases like JupyterLab.

So? The Jupyter project is about supporting the development of modular components (listed above) that can be improved, extended and bundled by the entire community! Only the official Jupyter GitHub account counts 69 repositories and these are only a subset of what's available on the web.

Developers put out a call to contribute on modules and get back to them with feedbacks.

JupyterLab's extensions

First of all a clarification. In the Jupyter eco-system, there are many things called 'widget' and this might lead to some confusion. For instance, it's not entirely clear to me the difference between the Jupyter Widgets and the JupyterLab Extensions, at least in terms of implementation. Anyways, I'd like to spend a few words on JupyterLab's Extensions because I think that's the really cool innovation with Lab from the user point of view.

The core development team was present at JupyerCon and they made it clear: JupyterLab is an extensible platform. The Lab application is nothing more than a bundle of extensions that delivers the "standard" experience but users are encouraged to package their own.

Throughout the training, we got hands-on this repo and built a set of extensions. Most of the code is in TypeScript and nodejs/npm so you'll have to be a bit familiar with these web-dev technologies. Among others, we managed to build a widget that renders MP4 videos with just a few lines of code. The idea is you start from a MIME renderer template and write the necessary code to open, serialize and display mp4 content. A bit of CSS will make the layout nice and neat. This is a functional but relatively simple plugin, that involves only front-end coding. More complex extensions, such as the one for accessing GitHub repositories needs a bit of back-end hacking.

The bottom line is: here's the platform, now build your own extensions. Indeed there are quite many extensions out there but not everyone is aware of it. A good starting point is probably the "Extension Manager", a widget that lets you look up and install plugins. It's disabled by default (at least in v0.33.12) and here's how to enable it. The widget will add an entry on the left side tab listing all the available extensions. You can install one just by clicking on the "Install" button. Bear in mind most extensions are early release so make sure you check the related repository to grasp if it's suitable.

JupyterLab activate extension manager

I suggest to look up for extensions on GitHub and Npm directly as well.

Note: the current version is ~v0.34 and there's still some time to go for the v1.0. Accroding to @sccolber the way to v1.0 will probably break some of the existing extensions.