Data Science Experience - a review

Data Science Experience - a review


It's been a couple of weeks since I got accepted in the closed beta testing programme for IBM Data Science Experience (DSX), and it is about time I share my thoughts on this offering.

DSX is a new product, which IBM is positioning as a new generation Data Science development and training platform. It is, however, aiming to be more than a typical "notebooks in the cloud" service as it mixes open source tools (Jupyter, RStudio, and Spark), IBM value-adds, and sharing and collaboration functionality.

The first thing I noticed after logging into DSX is that all my Jupyter notebooks from Bluemix are already accessible directly from my home screen.

My Notebooks in DSX

Poking a bit around the interface also revealed a list of all my Bluemix services. Although at the moment there is no much interaction possible (the manage link simply took me to Bluemix), this functionality hints toward the possibilities of deeper DSX-Bluemix integration.

Bluemix services in DSX

There is, however, functionality in place to directly provision new Spark back-end services from DSX. There is also an option to attach an Object Store to the Spark instance, which provides 5 GB of free storage space and should be sufficient for quick data wrangling / exploratory data analysis kind of tasks.

Having said that, DSX is not limited to pulling data from the IBM cloud only. You can create connections to popular services like S3, Azure, Salesforce.com, external Oracle, Greenplum, Sybase, MySQL databases, Hadoop (Hive, Impala) and many more.

External Data Services

Creating notebooks is fairly straightforward – you can create one from scratch, upload a file from your local machine, or pull an existing notebook from an URL. I tested the "From URL" option by pulling a notebook from GitHub and it worked like a charm.

Another nice touch is that each notebook has a link that takes you to the Spark History Server of the associated Spark instance. This comes in handy if you want to take a quick look at the storage and stages of your workload, while still working on your source code.

There is also a sharing section, that allows you to share your notebook with other people. The sharing mechanism is not very sophisticated, though – it just generates a unique permalink that you can pass on.

DSX has three main dashboards/perspectives – Data Science, Data Hub, and Exchange.

The Data Science area is where you create and work with your notebooks. This is also where you can access various data science related articles, tutorials, and notebooks.

Data Hub is the area where you create and work with Projects. Projects are a key feature of DSX, which enables people to work together on a dedicated set of assets. You can create/assign connections, notebooks, and storage to a project, and then share these resources with a group of collaborators. You can assign roles to each collaborator and the role governs what type of access to the project each individual gets (viewer, editor etc.)

Having said that, having the capability to collaborate with my colleagues on shared notebooks is great, but I would also like to see some kind of source control integration. If multiple people are changing the same notebooks I'd like to be able to have different versions and diff/preview changes functionality.

Data Hub

It is also interesting to note, that sometimes projects get automatically created as a result of your actions. For example, when a program like Watson Analytics saves a data set in the IBM data lake, DSX will automatically pick up this event and create a project with a link to the data set assigned to it.

In the Exchange you get access to data sets, notebooks, and storybooks (Watson Analytics) shared by other people. You can browse a catalogue of freely available datasets (classified by industry), select a dataset, and generate an access key for your notebook with a push of a button:

Data set from the Exchange

Another very exciting functionality, based on the partnership between IBM and the R Consortium, is the inclusion of RStudio in the Data Science Experience.

RStudio

You can launch a web version of the RStudio IDE with a push of a button, and get access not only to hundreds of CRAN packages for enriching your R scripts, but also use popular features for building R-based web applications (I played a bit with Shiny and Flex Dashboard and had no problems whatsoever using them in DSX).

Data Science Experience is a logical continuation of IBM's investment in open source (think Spark Technology Center, the donation of SystemML to Apache, Big Data University etc.) and is definitely worth trying. Having the capability to develop in the cloud using hybrid data supply (mix internal and external data sources) is tempting. From a personal perspective, having RStudio and Jupyter is a great start, but I would like to see the toolset enhanced (I am thinking Zepellin) and also a bit more done around security and version control.

I would also like to see a bit more done around Machine Learning – SystemML access in my Spark instances would be a great thing to have, although this request has more to do with Bluemix than DSX. The collaboration/social element can also benefit from further extension – sharing notebooks with my team is great, but having a capability to communicate with them in the same environment would be even better – I really think that DSX can benefit from rolling out some forum like capabilities or maybe even Slack.

To summarize my view – Data Science Experience is a great a start and I'll be keeping a close eye on it. If you are interested in signing up for early access you can register at http://datascience.ibm.com.