The Smart Chef

Search for number in string

Data is messy, and there are several reasons why it can be useful to extract numbers stored within a string in R:

In summary, extracting numbers from strings in a dataframe column can help you clean, analyze, visualize, and model data more effectively.

To identify when you find a number in a character column in an R dataframe, you can use the grepl() function. The grepl() function returns a logical vector indicating whether a pattern is found in a character vector.

Here is an example code snippet that demonstrates how to use grepl() to identify when a number is present in a character column in an R dataframe:

  

# Create a sample dataframe
df <- data.frame(col1 = c("abc", "123", "def", "456", "ghi"),
                 col2 = c("jkl", "mno", "789", "pqr", "stu"))

# Identify rows where col2 contains a number
has_number <- grepl("[0-9]", df$col2)

# View the rows where col2 contains a number
df[has_number,]

  

In this example, the grepl() function is used to search for any digits in the col2 column of the df dataframe. The resulting logical vector has_number is TRUE for rows that contain a number in col2. Finally, the subset of df where has_number is TRUE is displayed to show which rows contain a number in col2.

In addition to determing a number exists in a string, what if you also want to know what position the number(s) are in?

If you want to identify where in the string the number is located, you can use the gregexpr() function in addition to grepl(). The gregexpr() function returns the starting position of a pattern within a character vector.

Here is an example code snippet that demonstrates how to use grepl() and gregexpr() to identify where in the string a number is located in a character column in an R dataframe:

  
# Create a sample dataframe
df <- data.frame(col1 = c("abc", "123", "def", "456", "ghi"),
                 col2 = c("jkl", "mno", "789", "pqr", "stu"))

# Identify rows where col2 contains a number
has_number <- grepl("[0-9]", df$col2)

# Get the starting position of the number in each string
num_position <- gregexpr("[0-9]", df$col2)

# View the rows and position where col2 contains a number
df[has_number,]
num_position[has_number]


  

In this example, the grepl() function is used to search for any digits in the col2 column of the df dataframe, as in the previous example. The resulting logical vector has_number is TRUE for rows that contain a number in col2.

Then, the gregexpr() function is used to get the starting position of the number in each string. The resulting vector num_position contains the starting position of the first occurrence of a number in each string.

Finally, the subset of df where has_number is TRUE is displayed to show which rows contain a number in col2, and the num_position vector is displayed to show the starting position of the number in each of these strings.

Going one step further, you may want to look at not only the 1st number pulled from a string, but also additional numbers.

To extract the values of the first and second numbers in a string column of a dataframe in R, you can use regular expressions and the str_extract_all() function from the stringr package. Here's an example code snippet:

  
# Load the stringr package
library(stringr)

# Create a sample dataframe
df <- data.frame(col1 = c("abc1def2ghi", "123jkl456mno", "pqr789stu"))

# Extract the first and second numbers from each string
numbers <- str_extract_all(df$col1, "\\d+")
first_numbers <- sapply(numbers, "[", 1)
second_numbers <- sapply(numbers, "[", 2)

# Add the first and second numbers to the original dataframe
df$first_number <- first_numbers
df$second_number <- second_numbers

# View the updated dataframe
df


  

In this example, we first load the stringr package. We then create a sample dataframe df with a column of strings containing numbers.

We use str_extract_all() to extract all of the numbers from each string in the col1 column, using the regular expression \\d+. This matches one or more digits in each string. The resulting numbers object is a list, where each element contains a vector of the numbers extracted from each string.

We then use sapply() to extract the first and second numbers from each string, using the [ function to extract the first or second element of each vector in the numbers object. The resulting first_numbers and second_numbers objects are vectors of the first and second numbers from each string in df$col1.

Finally, we add the first_numbers and second_numbers vectors to the original dataframe df as new columns, first_number and second_number, respectively. We then view the updated dataframe to see the extracted values.