RStudio Re-Posit-ions for Success

As a multi-language bridge to Julia and a next-gen arXiv

Layne Sadler
11 min readJul 31, 2022
Image by Author

1) Reflecting on the rebrand

At its annual rstudio::conf last week, RStudio announced that it is rebranding as Posit and prioritizing a multi-language approach.

Given that the company raised a $161M round of financing last year, the first time since 2005 that it has taken venture money, it was evident that something big was brewing. However, if you’ve been keeping tabs on RStudio, this move isn’t coming out of left field. In fact, given the natural progression of Python and Jupyter Notebook adoption within RStudio’s platform over the past several years, a chimeric strategy was fairly predictable — especially if you consider how vocal they were about these Pythonic features. RStudio had already done everything in the realm of R and built a sizable team, so what else could the money have been for?

“We’re not changing our name because we’re changing what we’re doing, we’re changing our name to better reflect what we’re already doing”
- Hadley Wickham

That being said, now that the renaming is actually here (set for October), it admittedly feels like a bit of a shock. Perhaps it’s the stark contrast between the two logos that gives that effect. The original, concrete and product-focused. The new, ethereal and idealistic — more of a hypothesis than a brand. Being synonymous with R, the RStudio name carries a lot of weight in the space. So forsaking a brand that strong really sends a message that the company means business in this new venture. The reality is that, due to the sheer strength of its association with R, shedding the old name is a must if RStudio wants to be taken seriously in the existing Python and emerging Julia arenas.

On the whole, RStudio’s rebranding is representative of how the field of data science has changed over the past decade. We’ve seen the explosive growth of the Python and machine learning ecosystems, as well as a glimpse of promise in the underlying principles of Julia. Large enterprises have been forced to use multiple languages in order to get things done. So although this repositioning comes from a place of necessity, I posit that it will ultimately play out well for them — especially in the biopharma industry.

2) Observations from a multi-language industry

From where I’m sitting, biotech is the only industry that matters; it is to the 21st century what computing was to the 20th century. Cells are the new machines.

If you’ve been observing the shifting landscape of scientific computing from the perspective of bioinformatics, then the multi-language writing has been on the wall for some time now.

🧬 2a. Package repositories

Image by Author

R and biostatistics have a storied history together. This is best represented by the development of the Bioconductor package repo, which has since been flattered by Bioconda and BioJulia. So over the course of the past decade, you’ve got these enterprise pharma customers, titans of industry, hardening the RStudio platform — pushing its computing, collaboration, and compliance capabilities to the max.

🐍 2b. Python & deep learning

During the same time period, Python sees exponential growth. However, RStudio doesn’t turn a blind eye to this. They start developing reticulate, which allows Python to be ran from within R, as early as 2016. I tinkered with these kind of cross-language tools a bit at work, but piping other languages is just a hack to get stuff done in ad hoc scripts.

All in all, there’s just too much gravity coming from the star at the center of the Python solar system, machine learning (ML). Biology is complex and there is only so much insight you can glean from association studies. Although these tools provide R APIs, they don’t see as much adoption. So biopharma turns to Python in order to answer their highly dimensional questions. They begin staffing in-house ML labs.

One of my friends is a big R user because she is a proteomics researcher at the Broad Institute. Her team starts collaborating with the Broad’s ML lab. Now you’ve got all of these R and Python users working side-by-side on the same project, but these Pythonistas can’t work their magic on the RStudio platform. The next thing you know, RStudio adds support for Python and Jupyter Notebooks.

✨ 2c. Spark

The Python pull turns out to be so strong that Apache Spark, which is developed in Scala, goes from having zero Python support to making it a first-class citizen and mirroring the Pandas API with the release of Koalas. RStudio attempts to fix SparkR with sparklyr, but, from what I see at the annual Databricks conference talks, it doesn’t get a lot of love from the broader Spark crowd.

Meanwhile Pythonic Spark-based bioinformatics tools are center stage. The Broad Institute releases Hail and adds Spark distribution to GATK. Regeneron/ Databricks make a big splash with the regenie/GLOW toolkit for use with the DNAnexus/ UK Biobank RAP platform.

During this time, Dask emerges as an alternative to Spark that is tightly coupled with the PyData toolset (Pandas, NumPy, sklearn).

📊 2d. Visualization

Image by Author

Since we are using R at Genuity Science to mine the UK Biobank with pharma, we use Plotly for visual interpretation of our results. We choose Plotly over ggplot2 because it provides highly interactive charts and APIs for multiple languages: R, Python, and later Julia.

Plotly actually takes a page out of RStudio’s book when they imitate and improve upon R Shiny by making ReactJS dashboards accessible to data scientists in the form of Dash (also multi-language). They recently announced Dash Bio too. Coming full circle, RStudio is developing Shiny for Python, which likely means that Shiny for Julia is lurking around the corner.

Learn more from our blogs about the reactive web and Dash.

Considering everything up until this point, it would seem that all roads lead to Python. Or do they?

🤹‍♂️ 2e. Julia

JuliaCon was also last week. As seen in RStudio’s Quarto markdown talk, they have followed Plotly’s example in making Julia a first-class citizen in their new projects.

The Julia language is simple like Python (in spirit) and fast like C. As Jeremy Howard says, “it’s Julia all the way down… whereas Python needs to wrap optimized C code in order to do anything performant.” Much like R, the Julia ecosystem is extremely unified in its focus on data science and mathematics. So much so, that the core language is fluent in its understanding of most mathematical symbols. Unlike R, the core language is designed with distributed computing and GPU in mind from the outset. SciML is the crown jewel of the community. The gist of the framework is that it enables elegantly and efficiently solving differential equations via simulation — as opposed to the brute force/ black box approach of neural networks where millions of small floats are multiplied together until they happen to arrive at the right conclusion.

Julia Computing goes-to-market with the JuliaHub platform and its partner Pumas-AI. Pumas is a product from Pumas-AI that facilitates drug development solutions via quantitative pharmacology/pharmacometrics.

3) Why is Posit ideally suited for this strategy?

⏭️ 3a. Culture of pragmatism & humility

“R is not a language driven by purity of its philosophy. R is a language designed to get shit done.”
- Hadley Wickham

Despite having been the stewards of R for so long and being so dominant in their domain — they are not too proud to listen and embrace change. As the parent of the R baby, they are agreeing to shed their R brand. Their C-levels aren’t going to chase short-term wins. Any player who wants to survive in this space needs to be able to embrace change. You never stay on top for long because who knows when the next government/ venture-backed moonshot is going to displace you.

If you’ve ever enacted strategic change in a corporation of any size, then you know that it’s like trying to turn an aircraft carrier around. At first I felt like this rebranding was a schism; it must have been pushed through by the board and met with great resistance by R diehards. However, after taking a look at RStudio’s About page, I came away with the notion that the company is acting in unison. As I scrolled past the first few rows of headshots, I discovered, much to my delight, that every associate in the company was displayed. Most companies of this size reserve the about page for flaunting their lineup of PhDs, executives, and board members. What’s left of the optimist in me would like to believe that they are going on this audacious journey together.

👨‍👩‍👧‍👦 3b. Data science & open science community building

Source: twitter.com/mitchoharawild/status/1007297976711110659/photo/1

It’s not enough to just build something, you have to get people behind it. RStudio knows how to nurture an open source community. They are pioneers of the open source & open science movements. Their ecosystem is a hexagonal quilt of user-contributed libraries.

Image by Author
Monthly Visits — Image by Author
Image by Author

arXiv/bioRxiv/medRxiv is the most popular website for early-access research articles. It is one of the few places on the internet frequented by biomedical researchers, machine learning engineers, and computer scientists alike — which makes it all the more insane that it is just static content as opposed to a dynamic community. This is a huge gap waiting to be filled by the right player.

If Quarto takes off as a publishing platform, then scientific papers will be showcased on a Posit-affiliated platform, similar to Papers with Code. It has the potential to drive quadratic growth like a social network. Is Posit’s intent to build the social network of open science? I think so. Why else would they stress that Quarto is for publishing? Why wouldn’t they position it as a computational notebook?

https://quarto.org/docs/publishing/quarto-pub.html

Posit is also Boston-based, which shouldn’t be overlooked. The “hub city,” as it’s known internally, is a burgeoning fountain of biomedical research and talent. It’s home to universities like MIT & Harvard, institutes like the Broad, top notch academic medical centers, VCs like Flagship, several big pharma, and FAANG companies.

Reference this article by Michael Porter, the father of business strategy, to learn more about the competitive advantages associated with geographic clusters.

🤝 3c. Collaboration comes first on their enterprise platform

Most power users can solve their own problems when it comes to making their program do what it needs to do. Rather than use a vendor provided analytical tool in their domain, they’d probably have more fun writing their own library. Where they really need help is the enterprise things like:

  • Sharing their work securely in a team setting
  • Architecting/maintaining platform infrastructure
  • Staying compliant with industry regulations.

I hate to harp on the Jupyter Survey results again, but I thought for sure the main pain points of data scientists would be about, well, data science. It turns out that what they need most are collaborative features like “parity with Microsoft Word’s track-changes feature” and “support for our version control system.”

RStudio gets this and they already have it figured out. The little blue button in the RStudio desktop IDE lets you take whatever you are doing locally and publish it to your company’s instance of the RStudio cloud so that you can share it with your team via RStudio Connect. That kind of product development doesn’t happen by accident, and that level of simplicity is hard to come by in the world of optimization-obsessed computer scientists.

🧮 3d. Existing platform components for analytical productivity

Image by Author

RStudio already has a platform for scientific computing in place:

  • Powerful desktop and browser IDEs with good data exploration
  • Markdown/ notebook support for interactive programming & publishing
  • Jobs system for batch execution of workloads
  • Visualization for interpreting raw data
  • Dashboarding for creating app-like experiences

With a fair amount of abstraction and refactoring, it can adapt and apply these components for the other languages. The principles of mathematics, statistics, and deep learning are the same in every programming language. In essence, that’s what the name, Posit, is all about.

You could argue that there is a gap with respect to large scale data management. Given the volume, variety, and velocity (3 Vs of big data) of multi-omic/modal data that pharma generates, customers genuinely need a data warehouse. However, for the most part, files are language agnostic and RStudio has been able to defer file management to existing filesystems (so much for the lake house). I suspect Posit will continue to focus on analytics, connecting to customer data wherever it lives.

An area where they are more likely to invest development resources is distributed and parallel analytics. This is a main benefit of adopting Python and Julia tooling in the first place. It also goes hand-in-hand with experiment tracking.

4) Final thoughts; a multi-language bridge to Julia

Image Source: commons.wikimedia.org/wiki/File:The_Tron_bridge_%2823349111204%29.jpg

In the long run, fighting the core benefits of Julia is like swimming against the current. Based on what I hear from the maintainers of the Python scientific computing community — the Python projects have neither the staff to match nor, more importantly, the motivation to counteract the new language.

However, if one tries switching from Python to Julia today [allow me to preface this by saying that I have tremendous respect for the Julia team and anyone who dares to build something great], then one may be discouraged to find that things neither “just work” nor are as functionally complete as one has come to expect in the magical land of Python. In pushing computer science forward while adhering to a purist philosophy, the Julia user experience hasn’t been smoothed out yet. Given the combination of: this inherent complexity, the lack of tooling resulting in a decent amount of DIY, and the fact that Julia is a compiled language — I would only recommend it for:

  • Differential equations/simulations.
  • Embedded systems [though I can’t speak knowledgeably about this area].
  • Long-running batch execution workloads in the cloud where performance is a must for time & cost savings. This is an extremely important use case because time is, quite literally, money in the cloud.

So in the meantime, while the tooling expands to reach parity, there has to be a bridge. It makes more sense that an established platform would provide an R & Python-first experience while allowing customers to experiment with and gradually migrate to Julia as it matures.

Because if you’re an R user — then you’re already equipped with APIs for DataFrames, Spark, TensorFlow, charts, and dashboards — so why bother migrating to Python/Dask? Why not go straight to Julia?

And if you’re RStudio… err Posit… why bother snapping Python hexagons together in a fruitless effort to play years of catchup with your competitors? Why not skate to where the puck is going to be by building in Julia?

--

--

Layne Sadler

AIQC is an open source framework for rapid, rigorous, & reproducible deep learning.