Easily Use Python and R together with {reticulate}
I work in an environment where R and Python are used interchangeably, and most of the data scientists here have some familiarity with both languages. We regularly use one language to call the other and I’ve been struck by just how easy this is, particularly with RStudio’s {reticulate}
package (also with Python’s subprocess
). We’ve tried rpy2, but it’s a poor man’s reticulate in my opinion.
Here’s a quick example of utilising a Python script in R with the use of reticulate, based on a real problem I had recently.
library(dplyr)
library(reticulate)
I was working with some data that had a peculiar timestamp, which I could not parse correctly in R. I’m not saying this can’t be done in R, just that I couldn’t do it and didn’t have the time to delve into the complexities of date formats in R. I knew it parsed correctly with Pandas, so that’s what I used.
So here’s a tiny fake dataset that illustrates the issue:
<- tibble(
df random_id = sample(letters, 10),
timestamp = c(1537363730375, 1537363680645, 1537363720707, 1537363740836, 1537363740176,
1537363700649, 1537363780495, 1537363730636, 1537363730041, 1537363767311)
)
%>%
df mutate(date = as.POSIXct(timestamp, origin = "1970-01-01"))
## # A tibble: 10 x 3
## random_id timestamp date
## <chr> <dbl> <dttm>
## 1 b 1537363730375 50687-02-12 16:39:35
## 2 i 1537363680645 50687-02-12 02:50:45
## 3 w 1537363720707 50687-02-12 13:58:27
## 4 v 1537363740836 50687-02-12 19:33:56
## 5 h 1537363740176 50687-02-12 19:22:56
## 6 j 1537363700649 50687-02-12 08:24:09
## 7 r 1537363780495 50687-02-13 06:34:55
## 8 f 1537363730636 50687-02-12 16:43:56
## 9 u 1537363730041 50687-02-12 16:34:01
## 10 t 1537363767311 50687-02-13 02:55:11
As you can see, the dates are incorrect. Like I said, there’s probably a way to fix this in R, but after searching Stack Overflow and other resources, I didn’t find a solution so I quickly coded one up in Python:
import pandas as pd
def format_date(r_df):
"timestamp"] = pd.to_datetime(r_df["timestamp"], unit = "ms")
r_df[return r_df
Pandas’ unit
argument to to_datetime()
here is what solved it for me. We can just source this script with reticulate and this makes the function available to us in R, which we can use as part of a sequence of regular pipe operations:
source_python("/path/to/this/Python/script.py")
%>%
df format_date() %>% # Python function
as_tibble() %>%
pull(timestamp) %>%
str()
## POSIXct[1:10], format: "2018-09-19 13:28:50" "2018-09-19 13:28:00" "2018-09-19 13:28:40" "2018-09-19 13:29:00" "2018-09-19 13:29:00" ...
As you can see, the timestamp is now formatted in a way we can easily use. So. Easy.