Table of Contents

Introduction

Before I started with R I used to do quite a lot of python coding, however back in the days I was still using python 2.7 and was not really using a bona-fide data-centric workflow. In this series of notebooks I would like to document how my best-practices from using R can be carried over to the python universe.

IDE

R has with RStudio one obvious candidate for the best IDE to use with R. It has been developed just like R especially for maintaining a data-centric workflow. We have

  • plotting
  • interactive variable exploration
  • git integration
  • code completion
  • code documentation
  • package building
  • unit testing
  • markdown support
  • package and reproducibility management
  • execution of python code

python on the other hand has not been primarily developed for data science applications but has some great add-onn packages that can be used for scientific computation. There are a couple of python IDE that mimick the RStudio or the Matlab interface such as spyder and rodeo they support plotting and interactive variable exploration and have great code completion but they lack git and markdown support. The multi-purpose IDE pycharm however seems to support also the other features that RStudio is capable of. The professional version offers a scientific mode that also mimicks the RStudiointerface to some degree and that provides a decent project structure.

Here we will walk through the features mentioned above and see how they are implemented in pycharm

Ploting and interactive variable exploration

This requirement is fullfilled by all IDEs.

git integration

git integration is pretty straight forward with more options than in RStudio

Code Completion

Code completion in RStudio is pretty straight forward if you press Tab your namespace is sensibly searched for variables and functions that you might be typing. When you are typing a function it automatically displays the function documentation and which parameters the function is accepting. When you are defining the parameters of a function and you hit Tab RStudio will show you a list of all parameters of the function that you have not defined yet. You can then select the parameter from the list In python code completion is done by packages like jedi which seems to be what all the IDEs are using however code completion feels a bit different. spyder’s code completion looks and works exactly like the one of RStdio. pycharm has code completion but it tends to be cluttered with irrelevant object and method names when it does not find anything in the local namespace. In order to get the parameters of the function we also have to use an additional shortcut Ctrl + P (on windows). This only displays the parameter but we cannot select anything and thus have to type it ourselves.

Code Documentation

In R we can only document functions properly when writing a package using roxygen2 comments. We can call the documentation of a function using ? for example: ?myfunction(). This will display nicely as html in an additional viewer window. In order to generate consistent roxygen2 comments we can use the sinew package which will generate a sensible template from a finished function. Code examples can be copy pasted into the console or can be called via example()

In python we can use help(myfunction) or my_function.__doc__ to call the docstring associated with a function as console output. In pycharm we can place the caret inside the name of any function and press ctrl + Q in order to obtain a more well formated pop-up window containing the docstring.

The docstring in python can be written for any function and also makes sense outside of package writing. In pycharm we can insert a docstring template in any definition of a function by placing the caret in the function name of the definition and hit alt + enter to get to a menu in which one option is to insert a docstring template based on the parameters of the function.

There is a google style sheet for writing docstrings. link

Example code in the docstring should be started with >>>> and the line below should reflect the return value


>>>> len('foo')
3

Those examples can be run as unit tests by packages such as doctest, nose or unittests

see this blogpost

Package Building

package building and installation is very well integrated in RStudio, there is a steep learning curve but we can compile doumentation run tests and check all from within Rstudio and we have several option to create latex,pdf or html documentation.

In python package building seems to be pretty straight forward. you basically just need a __init__.py file in your directory and you have a package.

But there are some standards and nomenclature best practices that are described here.

Unit Testing

In R we use testthat and the integrated RStudio build tools to write and execute unit tests. In python the standard package seems to be unittest while there are a number of different packages for unit testing with a simpler synthax such as py.test. As already mentioned doctest will run python examples in your docstring.

See this blogpost for more information.

Package Management and Code Reproducibility

Code reproducibililty and portability is a big issue in R since there are so many packages that are constantly updated it is hard to track the dependencies for a specific analysis. Rstudio relies on packrat to ensure reproducibilty by saving all packages with each analysis or project which greatly increases the diskspace needed for a project. To tackle this I have actually written a package called (updateR)[https://github.com/erblast/updateR] resorting to a workflow where I archive R and all packages installed at this timepoint 4 times a year.

In python we seem to have different options. We can use pip freeze to create a requirements.txt file. We can use an anaconda environment which we can share with our analyis or we can use docker. All three options seem to be well supported by pycharm.

Markdown Support

RStudio and markdown go hand in hand. .Rmd are similar to .md files and can contain executable code chunks in either R or python and can be rendered into numereous formats such as word, slides, dashboards, html, ebooks, pdf and blogposts. .Rmd files are simple text files that contain a YAML header that defines the rendering options. RStudio even supports one click publishing of html content to Rpubs a publication platform for R code.

The python equivalent is jupyter notebook which uses the .ipynb format which is basically a json file. The json format does not easily display in a text editor so we always have to use an IDE of some sort to edit those files. The best method is to start a jupyter notebook kernel from the command line and connect to it via a web browser. This starts an easy to use webinterface that allows for editing .ipynb files and feels a bit like a .Rmd editing in RStudio. We have different cells and we can define the content of this cell as either being markdown or code of a variety of languages and we can see the output of the code being console output or a plot right below the cells. Pretty much the same as it is with RStudio.

The difference is in the file format, the Rmd format does not save any code output inside the file while .ipynb even stores binary information inside the document. This makes the changes in .ipynb files difficult to track for git and practically disqualifies them from being part of an entire analysis or deployment process. We can use Rmd files as actual bulding blocks of an entire analysis for which we can have html reporting for each steps that can be legible to none-coders. The use-case for jupyter notebook seems to be documentation of data exploration, tutorials or proof of concepts.

As far as I can tell we can edit .ipynb files inside pycharm but I had difficulties saving the changes that I had made to the notebooks and ended up loosing a lot of work. It seems as if in pycharm we need at least one code cell to save the file. I am more comfortable now working with the jupyter notebook webinterface.

Execution of foreign code

In RStudio we can use the reticulate package to execute python code and to convert python into R objects. We can also define python chunks inside Rmd documents. However we can only get python code completion or interactive variable exploration when we convert the python objects to R objects.

Both pycharm and jupyter notebook support R. We can probably use feather to pass dataframes between R and python.