R For SEO Part 8: Apply Methods in R
Welcome back again. Now we’re at part eight of my series on R for SEO and we’ve been on quite a journey so far, haven’t we? What was originally meant to take me eight weeks has taken several years! Between a pandemic, several jobs, two house-moves, some big shakeups in my personal life and a lot of changes in the marketing ecosystem, it’s been a lot. But today, we’re in the home stretch and we’ll be talking about using the apply family in R.
Applys are a more “R-centric” way of running functions across a range of data. They’re not entirely dissimilar to loops, but tend to be a more efficient way to write them and they do come with additional functionality which can be very useful in a number of situations.
You’ll have seen some of them being used already through this series, particularly when we talked about using APIs in R. So let’s take a look at the components of the apply family and what they can be used for.
As always, this is quite a long piece, so do feel free to skip around using the table of contents below and please do sign up for my email list to keep up to date when I publish new content.
What Are Apply Commands In R?
R’s apply family essentially allows you to apply a command or function across a range of data. The clue’s in the name, right? While not dissimilar to a loop in concept, there are a number of differences that do often make them a better or simpler choice.
As with loops, there are a few different variations that we can use. Let’s look at what they are before we get started on how we can use these different R apply methods in SEO work.
The Different Kinds Of Apply Methods In R
There are four key applies in the apply family
- Apply: The most basic one. Apply runs across a column in a matrix or dataframe. I generally use this when I want to find the maximum or minimum value of a column or if I’m trying to find the number of times a term is mentioned in a text corpus
- Lapply: You’ll have seen me use this a couple of times throughout this series and, truthfully, it’s the one I use the most. Lapply applies a function across a list or vector and is great for using functions and API calls
- Sapply: Similar to lapply, and another one I use quite a bit. Sapply means “simple apply” and tries to simplify the output compared to lapply
- Mapply: Mapply means “map apply” and is great when you need to use your apply method across multiple datasets or elements
So now we know what they are, let’s look at how to apply (Ha! I’m a comedic genius!) these to SEO work.
Using R’s Apply Method To Count Keywords On Google Search Console Data
The easiest way to show the difference between loops and applys is probably to replicate our loops from the last piece using the different apply methods.
To use the basic R apply method, let’s rework what we did with our first for loop to count the number of keywords that have 20 or more impressions from our Google Search Console dataset – you can follow the steps from part 2 to get that data, or just export from Google Search Console and import it using the read.csv function from part 1.
Assuming you’ve named your dataframe gsc, use the following command:
kwCount <- sum(apply(gsc["Impressions"], 1, function(x) x >= 20))
Now if we investigate our kwCount object in the console, using
kwCount
We should see the following – your data will be different, and probably much higher than mine.
But we did that in one line, compared to the multiple lines of the loop we’d previously used, didn’t we?
As always, let’s break it down.
Our Apply Command Broken Down
Our apply command did what the loop achieved in one line rather than five. That’s interesting. Let’s see how it works.
- kwCount <-: As with all the commands we’ve used throughout the series, we’ve named our object and used <- to tell R we want to keep it and use this particular name
- sum(apply: We’re invoking the sum command that we’re well familiar with, and then we use apply to apply it to each row or column in our dataset
- gsc[“Impressions”],: This tells R that we want to use this on our gsc dataset and that we want to focus on the Impressions column
- 1: The number 1 specifies that we want to use this on rows. If we wanted it to be across columns, we’d use 2
- function(x) x >= 20)): Finally, we’re using a very simple function that we’re applying across the rows to see if the value is greater than or equal to 20
Pretty handy, right? It gives the same output as our first loop from part 7, but a lot quicker and more efficiently.
Now let’s take a look at lapply.
Using R’s Lapply On A List or Dataframe
Lapply is one of R’s more popular apply methods and one that I use much more heavily than the regular apply command. Lapply runs across a list or vector and is particularly good for using across multiple API calls or URLs, which is naturally, very handy for SEO analysis.
Let’s take a look at how we can use lapply to replicate our “list” loop from part 7, where we use it to subset our Google Search Console dataset to show data for keywords with 20 or more impressions.
Again, obviously, we’d use subset from part 1 for this most of the time, but it’s a good first example for using lapply.
kw20 <- reduce(lapply(seq_along(gsc$Impressions), function(x) if (gsc$Impressions[x] >= 20) gsc[x, ] else NULL), bind_rows)
Here, we’re using the reduce function from the tidyverse to cover do.call(rbind in less code.
This will give you a dataframe called kw20 which only includes keywords, impressions, clicks and CTR where the impressions are 20 or more.
Run that in your console and now, as always, let’s break it down.
Our Lapply Command Broken Down
Again, we’ve replicated our loop in a smaller amount of code using an apply method – lapply in this case – than we used with the loop.
Hopefully you’re starting to see the power of the apply family for writing less code but getting the same results. We’ve trimmed it down even further by using the Tidyverse’s reduce function. You’re probably seeing why I never start R without the Tidyverse.
Let’s take a look at how it works:
- kw20 <- reduce(lapply(: We’re creating an object called kw20 and invoking the Tidyverse’s reduce command before calling lapply. Reduce allows us to bind the output for each row into our output
- seq_along(gsc$Impressions: As with our previous loop, we’re using seq_along on the Impressions column of our gsc dataset to apply our function across every row
- function(x): We’re creating our very simple function with the variable x, similar to how we’ve done all the way through this series
- if (gsc$Impressions[x] >= 20) gsc[x, ] else NULL),: If you cast your mind back to part 5 where we talked about if statements in R, this won’t be too unfamiliar. Our function is a very simple if statement, where we’re seeing if our Impressions value in the specific row of the apply is greater than or equal to 20. If it’s not, the else is returning NULL, which means nothing
- bind_rows): Finally, we’re using dplyr from the Tidyverse’s bind_rows command to add the column values to the next row in our output dataset
So there we have it – a really simple lapply command to run a simple function across multiple rows of a dataset.
Now let’s look at sapply and how it’s similar, but also a little different.
Using R’s Sapply On A List or Dataframe
Sapply – meaning “Simple apply” can work very similarly to lapply in a lot of instances, but it is focused on simplifying the output into a dataframe or vector, rather than a list. I find that when I use reduce and bind_rows from the Tidyverse, I get better results from lapply than sapply.
Either way, let’s run through how we can do the same command that we just did on lapply using sapply instead.
kw20 <- reduce(sapply(seq_along(gsc$Impressions), function(x) if (gsc$Impressions[x] >= 20) gsc[x, ], simplify = FALSE), bind_rows)
If this runs the way it should, it’ll give you the same output in kw20 as the lapply command did, but there are some differences in the command that we want to look at.
Our Sapply Command
As you’ll see from looking at the code, the command we’ve used here is pretty much exactly the same as out lapply command, with one key difference – we have the following parameter in there:
simplify = FALSE
Since the core tenet of sapply is to “simply” apply, it tries to simplify outputs to a dataframe. Since we’re already working with a dataframe, we can use simplify = FALSE here, as we don’t need it.
However, if you’re trying to coerce a list or matrix into something a bit easier to work with, sapply is a great choice. I find myself mostly using it when I’m working across JSON outputs, but for this example, you can see that it works quite nicely in the same way our lapply method did.
I’m sure you’re seeing plenty of ways that R’s apply family can be used for SEO by now. Let’s take a look at the most complex of the apply methods now – the powerful mapply().
Using R’s Mapply Method
Mapply allows you to run apply commands across multiple dataframe elements or vectors and is really handy when you need to do this. It’s a very powerful function and allows you to do some very complex analysis work with very little code. Let’s use a very simple example to identify a difference in our click through rates against a target, using our Google Search Console data.
Firstly, if you’ve imported your Google Search Console dataset from a CSV export rather than using the API, you’ll need to do a little preparation to remove the percentage character, set the column as a number and divide it by 100. This simple function below will do that:
ctrCleanup <- function(x){ x <- gsub("%", "", x) x <- as.numeric(x) x <- x/100 }
You’ll have seen all of this throughout the series, but let’s break it down anyway:
- ctrCleanup <- function(x){: Our function is called ctrCleanup (catchy, right?) and has an x variable
- x <- gsub(“%”, “”, x): On x, (our CTR column), we’re running gsub to find and replace the % character with nothing. Removing it, effectively
- x <- as.numeric(x): Now we’re setting that column as a number rather than the character it was previously since we’ve removed the % character
- x <- x/100: Finally, we’re dividing it by 100 so we can get our percentage as a decimal
Run this in your console with the following command:
gsc$CTR <- ctrCleanup(gsc$CTR)
and you’ll have your CTR column as decimals, ready to work with our mapply command.
Now we want to define our target CTR. Let’s take 5%.
targetCTR <- 0.05
Now we’re prepared, let’s run our mapply command.
gsc$ctrDiff <- mapply(function(x, y) abs(x - y), gsc$CTR, targetCTR)
This will create a new column in our dataset called ctrDiff with the difference between our actual click through rate and our target.
Our Mapply Command Broken Down
As always, let’s break it down:
- gsc$ctrDiff <- mapply(: We’re creating a new column called ctrDiff in our gsc dataframe and using mapply to run our function
- function(x, y) abs(x – y): Our function uses x and y parameters (our two data elements) and uses abs to find the absolute difference between them
- gsc$CTR, targetCTR): Finally, we’re defining our x and y variables – our actual click through rate from Google Search Console and our target CTR
Running this will create a new column in gsc and will give you the difference by query from your target CTR. Hopefully it’ll give you some idea of where you need to work on with your SEO efforts. Or maybe not, but at least it gives you an idea of how mapply works!
So that’s the basics of how the different apply methods in R can work. There’s a common thread to the anatomy of them, isn’t there?
Let’s look at that now.
The Anatomy of R’s Apply Methods
In general, the apply family has a common thread of how they work – a common anatomy. You’ll find yourself working around this as you go further in your R journey, but you’ll generally look at it like so:
dataframe <- applyMethod(data, function, extra parameters)
It seems a little backwards, doesn’t it? Usually we call our function before the data. Truthfully, I don’t fully know why it’s done this way, but I suspect it’s to do with the fact that we’re defining the data to apply the function to before we start actually applying it. All programming languages have these fun little elements to them, and I think this is one of R’s.
I think this might actually be the shortest of the posts I’ve written in this series, but don’t let that be an indication of the power of R’s apply family – they are absolutely vital in R programming and an essential element in using R for SEO.
Until next time, where we’ll be talking about web scraping in R.
Our Code From Today
# Install Packages install.packages("tidyverse") library(tidyverse) # Read In Data gsc <- read.csv("Queries.csv", stringsAsFactors = FALSE) # Apply Methods ## Apply() kwCount <- sum(apply(gsc["Impressions"], 1, function(x) x >= 20)) ## Lapply() kw20 <- reduce(lapply(seq_along(gsc$Impressions), function(x) if (gsc$Impressions[x] >= 20) gsc[x, ] else NULL), bind_rows) ## Sapply() kw20 <- reduce(sapply(seq_along(gsc$Impressions), function(x) if (gsc$Impressions[x] >= 20) gsc[x, ], simplify = FALSE), bind_rows) ## Mapply ctrCleanup <- function(x){ x <- gsub("%", "", x) x <- as.numeric(x) x <- x/100 } gsc$CTR <- ctrCleanup(gsc$CTR) targetCTR <- 0.05 gsc$ctrDiff <- mapply(function(x, y) abs(x - y), gsc$CTR, targetCTR)