String Searching and Manipulation
R has extensive tools for searching for text (string) patterns and performing substitutions. These can be very helpful in manipulation of large data sets.
These functions can search for exact matches or can make use of an extensive syntax for wildcards, known as regular expressions. I will first describe the search functions and then explain regular expressions.
In this excercise, type all of the code blocks into your computer to see what happens! Experiment as you go!
Search Functions
grep()
Use grep()
when you want to search for a single pattern in a vector of strings.
# for example: which state names have 'i' in them?
data(state)
state.name
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
# in its basic form, grep returns the position (index) of the items that
# match the pattern.
grep(pattern = "i", x = state.name, ignore.case = T)
## [1] 3 5 7 9 10 11 12 13 14 15 18 19 22 23 24 25 29 31 33 35 38 39 40
## [24] 46 47 48 49 50
# you can also ask grep to return the values that match
grep(pattern = "i", x = state.name, ignore.case = T, value = T)
## [1] "Arizona" "California" "Connecticut" "Florida"
## [5] "Georgia" "Hawaii" "Idaho" "Illinois"
## [9] "Indiana" "Iowa" "Louisiana" "Maine"
## [13] "Michigan" "Minnesota" "Mississippi" "Missouri"
## [17] "New Hampshire" "New Mexico" "North Carolina" "Ohio"
## [21] "Pennsylvania" "Rhode Island" "South Carolina" "Virginia"
## [25] "Washington" "West Virginia" "Wisconsin" "Wyoming"
The variant grepl()
returns a logical vector of TRUE and FALSE indicating whether or not a match occurred.
grepl(pattern = "i", x = state.name, ignore.case = T)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [23] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
## [34] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [45] FALSE TRUE TRUE TRUE TRUE TRUE
# recalling that TRUE = 1 and FALSE = 0, when can use grepl to quickly
# determine the number of states that have 'i' in them.
sum(grepl(pattern = "i", x = state.name, ignore.case = T))
## [1] 28
Exercise 1: Can you figure out how to count the number of states that do not have an “i” in their name (without changing the search pattern)? There are at least two easy ways.
sum(!grepl(pattern = "i", x = state.name, ignore.case = T))
## [1] 22
length(grep(pattern = "i", x = state.name, ignore.case = T, invert = T))
## [1] 22
match()
Use match()
when you have two sets of values (lets call them Set A and Set B) and you want to know the (first) positions in Set B that match something that is in Set A. Match only finds exact matches and can not be used with regular expressions.
# we have a list of favorite fruits and a list of citrus. Which of the
# favorites are citrus?
favorites <- c("peach", "bannana", "blueberry", "orange", "plum", "strawberry",
"mandarin")
citrus <- c("kumquat", "grapefruit", "orange", "mandarin", "orange", "tangerine",
"tangelo", "lemon", "lime")
match(favorites, citrus) #what do the numbers returned refer to?
## [1] NA NA NA 3 NA NA 4
Match has an alternative form which I find more convenient: %in%
. This form returns Trues or Falses indicating whether or not Set A is in Set B.
favorites %in% citrus
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
# most useful is to use this in combination with square brackets to
# extract
favorites[favorites %in% citrus]
## [1] "orange" "mandarin"
sub()
sub()
performs text substitutions. Think of it as find and replace.
# lets say that we wanted to replace 'W' with the phrase '_DoubleU_'
sub("w", "_DoubleU_", state.name, ignore.case = T)
## [1] "Alabama" "Alaska"
## [3] "Arizona" "Arkansas"
## [5] "California" "Colorado"
## [7] "Connecticut" "Dela_DoubleU_are"
## [9] "Florida" "Georgia"
## [11] "Ha_DoubleU_aii" "Idaho"
## [13] "Illinois" "Indiana"
## [15] "Io_DoubleU_a" "Kansas"
## [17] "Kentucky" "Louisiana"
## [19] "Maine" "Maryland"
## [21] "Massachusetts" "Michigan"
## [23] "Minnesota" "Mississippi"
## [25] "Missouri" "Montana"
## [27] "Nebraska" "Nevada"
## [29] "Ne_DoubleU_ Hampshire" "Ne_DoubleU_ Jersey"
## [31] "Ne_DoubleU_ Mexico" "Ne_DoubleU_ York"
## [33] "North Carolina" "North Dakota"
## [35] "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania"
## [39] "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee"
## [43] "Texas" "Utah"
## [45] "Vermont" "Virginia"
## [47] "_DoubleU_ashington" "_DoubleU_est Virginia"
## [49] "_DoubleU_isconsin" "_DoubleU_yoming"
Excerise 2: Try replacing all of the “i”s with “y”s. Are results what you expected? Why or why not?
sub("i", "y", state.name, ignore.case = T)
## [1] "Alabama" "Alaska" "Aryzona" "Arkansas"
## [5] "Calyfornia" "Colorado" "Connectycut" "Delaware"
## [9] "Floryda" "Georgya" "Hawayi" "ydaho"
## [13] "yllinois" "yndiana" "yowa" "Kansas"
## [17] "Kentucky" "Louysiana" "Mayne" "Maryland"
## [21] "Massachusetts" "Mychigan" "Mynnesota" "Myssissippi"
## [25] "Myssouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshyre" "New Jersey" "New Mexyco" "New York"
## [33] "North Carolyna" "North Dakota" "Ohyo" "Oklahoma"
## [37] "Oregon" "Pennsylvanya" "Rhode ysland" "South Carolyna"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Vyrginia" "Washyngton" "West Vyrginia"
## [49] "Wysconsin" "Wyomyng"
# Only the first 'i' in each word is replaced
gsub()
sub()
only replaces the first occurrence of pattern. gsub()
will replace all occurrences.
gsub("i", "y", state.name, ignore.case = T)
## [1] "Alabama" "Alaska" "Aryzona" "Arkansas"
## [5] "Calyfornya" "Colorado" "Connectycut" "Delaware"
## [9] "Floryda" "Georgya" "Hawayy" "ydaho"
## [13] "yllynoys" "yndyana" "yowa" "Kansas"
## [17] "Kentucky" "Louysyana" "Mayne" "Maryland"
## [21] "Massachusetts" "Mychygan" "Mynnesota" "Myssyssyppy"
## [25] "Myssoury" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshyre" "New Jersey" "New Mexyco" "New York"
## [33] "North Carolyna" "North Dakota" "Ohyo" "Oklahoma"
## [37] "Oregon" "Pennsylvanya" "Rhode ysland" "South Carolyna"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Vyrgynya" "Washyngton" "West Vyrgynya"
## [49] "Wysconsyn" "Wyomyng"
# compare to just using sub()
Regular Expressions
Regular expressions (regexp) are a way of specifying wild cards in the search operations described above. The same syntax (with some modifications) is used in many computer languages including Unix/Linux command line tools, Perl, python, and very helpfully SublimeText2.
Character classes I
Regexp have codes to match certain classes of characters.
- . Matches any single character
- \w Matches any character that would be found in a “word” including digits (excludes punctuation and white space)
- \W Is the opposite of \w and matches any non-word character
- \d Matches any digit character
- \D Matches any non-digit character
- \s Matches any white space character
- \S Matches any non space character
The following match specific characters or locations but are worth mentioning here:
- ^ Matches the beginning of a line
- $ Matches the end of a line
- \t tab character
- \n return character
Unless you use additional modifications described below, these match a single character.
How would we find state names that have two “a”s separated by a single additional character?
data(state)
grep("a.a", state.name, value = T, ignore.case = T)
## [1] "Alabama" "Alaska" "Delaware" "Hawaii" "Indiana" "Louisiana"
## [7] "Montana" "Nevada"
Exercise 3: Find state names that have a space in their names.
grep(" ", state.name, value = T)
## [1] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [5] "North Carolina" "North Dakota" "Rhode Island" "South Carolina"
## [9] "South Dakota" "West Virginia"
grep("\\s", state.name, value = T)
## [1] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [5] "North Carolina" "North Dakota" "Rhode Island" "South Carolina"
## [9] "South Dakota" "West Virginia"
Exercise 4: Find state names that begin with “M”
grep("^M", state.name, value = T)
## [1] "Maine" "Maryland" "Massachusetts" "Michigan"
## [5] "Minnesota" "Mississippi" "Missouri" "Montana"
Character classes II
An alternative way of specifying character classes is to enclose them in square brackets.
For example, to find all of the state names that end with a vowel:
grep("[aeiou]$", state.name, value = T)
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Colorado" "Delaware" "Florida" "Georgia"
## [9] "Hawaii" "Idaho" "Indiana" "Iowa"
## [13] "Louisiana" "Maine" "Minnesota" "Mississippi"
## [17] "Missouri" "Montana" "Nebraska" "Nevada"
## [21] "New Hampshire" "New Mexico" "North Carolina" "North Dakota"
## [25] "Ohio" "Oklahoma" "Pennsylvania" "South Carolina"
## [29] "South Dakota" "Tennessee" "Virginia" "West Virginia"
The ^ sign inside of square brackets inverts the search.
Exercise 5: Find state names that begin with non-vowels.
grep("^[^aeiou]", state.name, value = T, ignore.case = T)
## [1] "California" "Colorado" "Connecticut" "Delaware"
## [5] "Florida" "Georgia" "Hawaii" "Kansas"
## [9] "Kentucky" "Louisiana" "Maine" "Maryland"
## [13] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [17] "Missouri" "Montana" "Nebraska" "Nevada"
## [21] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [25] "North Carolina" "North Dakota" "Pennsylvania" "Rhode Island"
## [29] "South Carolina" "South Dakota" "Tennessee" "Texas"
## [33] "Vermont" "Virginia" "Washington" "West Virginia"
## [37] "Wisconsin" "Wyoming"
You can specify ranges of characters inside square brackets:
- [0-9] : all digits
- [a-j] : the first 10 (lowercase) letters of the alphabet
There are also a number of predefined character classes. For example:
[:punct:] Punctuation characters: ! “ # $ % & ‘ ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
See ?regex
for additional classes.
Perhaps confusingly these classes must themselves be placed within square brackets
grep("[[:space:]]", state.name, value = T)
## [1] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [5] "North Carolina" "North Dakota" "Rhode Island" "South Carolina"
## [9] "South Dakota" "West Virginia"
Specifying repeats
You can specify the number of times that a character or character class is repeated.
- ? The preceding item is optional and will be matched at most once.
-
- The preceding item will be matched zero or more times.
-
- The preceding item will be matched one or more times.
- {n} The preceding item is matched exactly n times.
- {n,} The preceding item is matched n or more times.
- {n,m} The preceding item is matched at least n times, but not more than m times.
Exercise 6: Find the state names that have two consecutive vowels in them.
grep("[aeiou]{2}", state.name, value = T)
## [1] "California" "Georgia" "Hawaii" "Illinois"
## [5] "Indiana" "Louisiana" "Maine" "Missouri"
## [9] "Ohio" "Pennsylvania" "South Carolina" "South Dakota"
## [13] "Tennessee" "Virginia" "West Virginia"
Exercise 7: Using last weeks data set, count the number of babys with the name Stacey or Stacy. Bonus 1: count Stacey, Stacy, or Staci. Bonus 2: list the unique set of names that start with “Stac”. The data set can be downloaded here
bnames <- read.csv("~/Documents/Teaching/RClub/plyr-tutorial/examples/bnames.csv",
as.is = T)
# check it first:
unique(grep("Stace?y", bnames$name, value = T))
## [1] "Stacy" "Stacey"
sum(grepl("Stace?y", bnames$name))
## [1] 228
# Bonus1:
unique(grep("Stace?[iy]$", bnames$name, value = T))
## [1] "Stacy" "Stacey" "Staci"
sum(grepl("Stace?[iy]$", bnames$name))
## [1] 262
# Bonus2:
unique(grep("Stac", bnames$name, value = T))
## [1] "Stacy" "Stacey" "Stacia" "Stacie" "Staci"
Alternates
The “|” character can be read as “or” and can be used to specify alternates.
# Example: change the path below to point to 'bnames.csv' on your computer
bnames <- read.csv("~/Documents/Teaching/RClub/plyr-tutorial/examples/bnames.csv",
as.is = T)
bnames[grep("Stac(y|i)", bnames$name), ]
## year name percent sex
## 897 1880 Stacy 0.000051 boy
## 5925 1885 Stacy 0.000052 boy
## 11834 1891 Stacy 0.000064 boy
## 12856 1892 Stacy 0.000061 boy
## 13896 1893 Stacy 0.000058 boy
## 14847 1894 Stacy 0.000064 boy
## 18796 1898 Stacy 0.000068 boy
## 23839 1903 Stacy 0.000070 boy
## 27928 1907 Stacy 0.000063 boy
## 28610 1908 Stacy 0.000114 boy
## 31921 1911 Stacy 0.000058 boy
## 39985 1919 Stacy 0.000047 boy
## 40842 1920 Stacy 0.000059 boy
## 55957 1935 Stacy 0.000042 boy
## 56923 1936 Stacy 0.000042 boy
## 59972 1939 Stacy 0.000038 boy
## 60925 1940 Stacy 0.000040 boy
## 61883 1941 Stacy 0.000041 boy
## 63950 1943 Stacy 0.000034 boy
## 64858 1944 Stacy 0.000040 boy
## 65808 1945 Stacy 0.000044 boy
## 66934 1946 Stacy 0.000032 boy
## 67867 1947 Stacy 0.000035 boy
## 68833 1948 Stacy 0.000038 boy
## 69689 1949 Stacy 0.000055 boy
## 70677 1950 Stacy 0.000058 boy
## 71656 1951 Stacy 0.000060 boy
## 72678 1952 Stacy 0.000054 boy
## 73675 1953 Stacy 0.000056 boy
## 74638 1954 Stacy 0.000064 boy
## 75640 1955 Stacy 0.000065 boy
## 76595 1956 Stacy 0.000076 boy
## 77551 1957 Stacy 0.000089 boy
## 78502 1958 Stacy 0.000111 boy
## 79469 1959 Stacy 0.000134 boy
## 80464 1960 Stacy 0.000136 boy
## 81450 1961 Stacy 0.000150 boy
## 82398 1962 Stacy 0.000198 boy
## 83361 1963 Stacy 0.000228 boy
## 84340 1964 Stacy 0.000263 boy
## 85307 1965 Stacy 0.000319 boy
## 86284 1966 Stacy 0.000373 boy
## 87162 1967 Stacy 0.000932 boy
## 88157 1968 Stacy 0.000977 boy
## 89186 1969 Stacy 0.000816 boy
## 90211 1970 Stacy 0.000620 boy
## 91231 1971 Stacy 0.000565 boy
## 92234 1972 Stacy 0.000546 boy
## 93230 1973 Stacy 0.000567 boy
## 94241 1974 Stacy 0.000509 boy
## 95255 1975 Stacy 0.000484 boy
## 96306 1976 Stacy 0.000351 boy
## 97376 1977 Stacy 0.000254 boy
## 98428 1978 Stacy 0.000201 boy
## 99526 1979 Stacy 0.000145 boy
## 100564 1980 Stacy 0.000126 boy
## 101668 1981 Stacy 0.000090 boy
## 102753 1982 Stacy 0.000073 boy
## 103774 1983 Stacy 0.000068 boy
## 104714 1984 Stacy 0.000080 boy
## 105759 1985 Stacy 0.000071 boy
## 106767 1986 Stacy 0.000073 boy
## 107764 1987 Stacy 0.000074 boy
## 108808 1988 Stacy 0.000069 boy
## 109802 1989 Stacy 0.000074 boy
## 110818 1990 Stacy 0.000073 boy
## 111849 1991 Stacy 0.000071 boy
## 112871 1992 Stacy 0.000070 boy
## 113955 1993 Stacy 0.000062 boy
## 162922 1913 Stacia 0.000052 girl
## 165901 1916 Stacia 0.000053 girl
## 166916 1917 Stacia 0.000052 girl
## 167813 1918 Stacia 0.000065 girl
## 168953 1919 Stacia 0.000048 girl
## 198981 1949 Stacy 0.000048 girl
## 200866 1951 Stacy 0.000059 girl
## 201798 1952 Stacy 0.000068 girl
## 202665 1953 Stacy 0.000095 girl
## 203535 1954 Stacy 0.000139 girl
## 204528 1955 Stacy 0.000151 girl
## 205426 1956 Stacy 0.000223 girl
## 205975 1956 Stacie 0.000052 girl
## 206348 1957 Stacy 0.000323 girl
## 206760 1957 Stacie 0.000082 girl
## 207256 1958 Stacy 0.000558 girl
## 207710 1958 Stacie 0.000094 girl
## 208217 1959 Stacy 0.000713 girl
## 208624 1959 Stacie 0.000123 girl
## 209222 1960 Stacy 0.000699 girl
## 209556 1960 Stacie 0.000149 girl
## 210206 1961 Stacy 0.000809 girl
## 210512 1961 Stacie 0.000176 girl
## 211176 1962 Stacy 0.001098 girl
## 211471 1962 Stacie 0.000207 girl
## 211751 1962 Staci 0.000094 girl
## 212129 1963 Stacy 0.001641 girl
## 212389 1963 Stacie 0.000307 girl
## 212580 1963 Staci 0.000145 girl
## 213123 1964 Stacy 0.001768 girl
## 213404 1964 Stacie 0.000299 girl
## 213459 1964 Staci 0.000230 girl
## 213961 1964 Stacia 0.000062 girl
## 214119 1965 Stacy 0.001754 girl
## 214364 1965 Stacie 0.000340 girl
## 214410 1965 Staci 0.000282 girl
## 214851 1965 Stacia 0.000076 girl
## 215091 1966 Stacy 0.002149 girl
## 215359 1966 Stacie 0.000366 girl
## 215380 1966 Staci 0.000331 girl
## 215839 1966 Stacia 0.000081 girl
## 216074 1967 Stacy 0.002678 girl
## 216301 1967 Stacie 0.000486 girl
## 216316 1967 Staci 0.000437 girl
## 216822 1967 Stacia 0.000086 girl
## 217064 1968 Stacy 0.003152 girl
## 217252 1968 Stacie 0.000631 girl
## 217277 1968 Staci 0.000571 girl
## 217737 1968 Stacia 0.000105 girl
## 218053 1969 Stacy 0.003681 girl
## 218209 1969 Stacie 0.000853 girl
## 218264 1969 Staci 0.000624 girl
## 218668 1969 Stacia 0.000125 girl
## 219042 1970 Stacy 0.004245 girl
## 219165 1970 Stacie 0.001105 girl
## 219268 1970 Staci 0.000626 girl
## 219581 1970 Stacia 0.000167 girl
## 220032 1971 Stacy 0.005201 girl
## 220131 1971 Stacie 0.001457 girl
## 220250 1971 Staci 0.000673 girl
## 220577 1971 Stacia 0.000174 girl
## 221033 1972 Stacy 0.004717 girl
## 221159 1972 Stacie 0.001130 girl
## 221253 1972 Staci 0.000659 girl
## 221653 1972 Stacia 0.000145 girl
## 222032 1973 Stacy 0.004723 girl
## 222150 1973 Stacie 0.001182 girl
## 222224 1973 Staci 0.000723 girl
## 222650 1973 Stacia 0.000149 girl
## 223038 1974 Stacy 0.004304 girl
## 223156 1974 Stacie 0.001122 girl
## 223271 1974 Staci 0.000601 girl
## 223646 1974 Stacia 0.000151 girl
## 224034 1975 Stacy 0.004596 girl
## 224146 1975 Stacie 0.001178 girl
## 224253 1975 Staci 0.000621 girl
## 224684 1975 Stacia 0.000141 girl
## 225038 1976 Stacy 0.004152 girl
## 225151 1976 Stacie 0.001098 girl
## 225269 1976 Staci 0.000553 girl
## 225703 1976 Stacia 0.000134 girl
## 226042 1977 Stacy 0.003743 girl
## 226185 1977 Stacie 0.000904 girl
## 226288 1977 Staci 0.000511 girl
## 226633 1977 Stacia 0.000161 girl
## 227044 1978 Stacy 0.003515 girl
## 227177 1978 Stacie 0.000892 girl
## 227304 1978 Staci 0.000464 girl
## 227756 1978 Stacia 0.000124 girl
## 228043 1979 Stacy 0.003316 girl
## 228201 1979 Stacie 0.000769 girl
## 228299 1979 Staci 0.000462 girl
## 228793 1979 Stacia 0.000116 girl
## 229053 1980 Stacy 0.002843 girl
## 229222 1980 Stacie 0.000649 girl
## 229310 1980 Staci 0.000438 girl
## 229833 1980 Stacia 0.000106 girl
## 230065 1981 Stacy 0.002504 girl
## 230235 1981 Stacie 0.000594 girl
## 230363 1981 Staci 0.000362 girl
## 230887 1981 Stacia 0.000096 girl
## 231072 1982 Stacy 0.002261 girl
## 231280 1982 Stacie 0.000479 girl
## 231354 1982 Staci 0.000365 girl
## 231891 1982 Stacia 0.000095 girl
## 232061 1983 Stacy 0.002615 girl
## 232261 1983 Stacie 0.000511 girl
## 232326 1983 Staci 0.000398 girl
## 232836 1983 Stacia 0.000103 girl
## 233067 1984 Stacy 0.002466 girl
## 233253 1984 Stacie 0.000547 girl
## 233327 1984 Staci 0.000399 girl
## 233905 1984 Stacia 0.000092 girl
## 234067 1985 Stacy 0.002228 girl
## 234282 1985 Stacie 0.000479 girl
## 234323 1985 Staci 0.000399 girl
## 234852 1985 Stacia 0.000103 girl
## 235087 1986 Stacy 0.001747 girl
## 235301 1986 Stacie 0.000440 girl
## 235369 1986 Staci 0.000346 girl
## 235835 1986 Stacia 0.000108 girl
## 236101 1987 Stacy 0.001466 girl
## 236330 1987 Stacie 0.000396 girl
## 236338 1987 Staci 0.000388 girl
## 237126 1988 Stacy 0.001180 girl
## 237323 1988 Staci 0.000401 girl
## 237340 1988 Stacie 0.000374 girl
## 237977 1988 Stacia 0.000089 girl
## 238152 1989 Stacy 0.000947 girl
## 238382 1989 Staci 0.000332 girl
## 238399 1989 Stacie 0.000311 girl
## 239173 1990 Stacy 0.000814 girl
## 239420 1990 Stacie 0.000294 girl
## 239439 1990 Staci 0.000278 girl
## 240212 1991 Stacy 0.000693 girl
## 240479 1991 Staci 0.000250 girl
## 240536 1991 Stacie 0.000215 girl
## 241232 1992 Stacy 0.000593 girl
## 241526 1992 Stacie 0.000225 girl
## 241537 1992 Staci 0.000219 girl
## 242288 1993 Stacy 0.000485 girl
## 242577 1993 Stacie 0.000208 girl
## 242636 1993 Staci 0.000179 girl
## 243319 1994 Stacy 0.000436 girl
## 243672 1994 Stacie 0.000163 girl
## 243799 1994 Staci 0.000130 girl
## 244389 1995 Stacy 0.000355 girl
## 244805 1995 Stacie 0.000129 girl
## 244959 1995 Staci 0.000103 girl
## 245420 1996 Stacy 0.000316 girl
## 245882 1996 Stacie 0.000117 girl
## 246471 1997 Stacy 0.000279 girl
## 247517 1998 Stacy 0.000253 girl
## 248554 1999 Stacy 0.000234 girl
## 249597 2000 Stacy 0.000212 girl
## 250603 2001 Stacy 0.000211 girl
## 251660 2002 Stacy 0.000195 girl
## 252663 2003 Stacy 0.000199 girl
## 253632 2004 Stacy 0.000214 girl
## 254650 2005 Stacy 0.000211 girl
## 255674 2006 Stacy 0.000202 girl
## 256691 2007 Stacy 0.000199 girl
## 257725 2008 Stacy 0.000198 girl
Exercise 8: Pull out the names “Jonathan” “Jonnie” and “Johnathon” but not other Jon names. Bonus: Only have Jon listed once in your search string.
unique(grep("^(Johnathon)|(Jon(nie)|(athon))", bnames$name, value = T))
## [1] "Jonnie" "Jonathon" "Johnathon"
## note that this was harder than intended. I meant to ask for 'Jonathan'
## 'Jonnie' and 'Jonathon'
unique(grep("^Jon((nie)|(ath[ao]n))", bnames$name, value = T))
## [1] "Jonathan" "Jonnie" "Jonathon"
Escapes
What if you want to match a “.” or other special character? The following characters have special meaning in regular expressions: “. \ | ( ) [ { ^ $ * + ?” and if you want to search for them you have to do something special.
# say you want all Chrom one ILs
ILs <- c("IL.1.1", "IL.2.2", "IL.1.3", "IL.2.1", "IL.11.1", "IL.11.3", "IL.12.1",
"IL.12.2")
# the following seems logical at first:
grep("IL.1.", ILs, value = T)
## [1] "IL.1.1" "IL.1.3" "IL.11.1" "IL.11.3" "IL.12.1" "IL.12.2"
What happened? Remember that “.” matches any character. We need to tell grep that the “.” is just a regular character, not a wildcard. To do this we “escape” it by preceding it with a “". However, since “" itself is a special character it must also be escaped. So we use two backslashes.
grep("IL\\.1\\.", ILs, value = T)
## [1] "IL.1.1" "IL.1.3"
# an alternative, if you don't need any regex functionality is to use the
# argument 'fixed=T'
grep("IL.1.", ILs, value = T, fixed = T)
## [1] "IL.1.1" "IL.1.3"
Back references
One of the powerful tools in regexps is the ability to refer back to a previous match. The item that you want to refer back to is enclosed in parentheses. You backreference with a backslash and a digit. A “1” would indicate the first group enclosed in a parentheses, a “2” would refer to the second group, etc.
Suppose we want to find all state names that have letters repeated twice in a row.
grep("(.)\\1", state.name, value = T)
## [1] "Connecticut" "Hawaii" "Illinois" "Massachusetts"
## [5] "Minnesota" "Mississippi" "Missouri" "Pennsylvania"
## [9] "Tennessee"
# the '.' matches any character. Enclosing that in parentheses designates
# it as an item that we want to refer back to. The \\1 refers back to
# it.
Exercise 9: Find all state names that have two vowels repeated.
grep("([aeiou])\\1", state.name, value = T, ignore.case = T)
## [1] "Hawaii" "Tennessee"
I find back references particularly helpful in combination with sub(), because you can use them in your replacement string.
Exercise 10: For all two-worded state names reverse the order of the two words and add a comma between the words. (“North Carolina” should become “Carolina, North”)
sub("([[:alpha:]]+) ([[:alpha:]]+)", "\\2, \\1", state.name)
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "Florida"
## [10] "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa"
## [16] "Kansas" "Kentucky" "Louisiana"
## [19] "Maine" "Maryland" "Massachusetts"
## [22] "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska"
## [28] "Nevada" "Hampshire, New" "Jersey, New"
## [31] "Mexico, New" "York, New" "Carolina, North"
## [34] "Dakota, North" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Island, Rhode"
## [40] "Carolina, South" "Dakota, South" "Tennessee"
## [43] "Texas" "Utah" "Vermont"
## [46] "Virginia" "Washington" "Virginia, West"
## [49] "Wisconsin" "Wyoming"
Exercise 11: For all two-worded state names, abbreviate the first word to be the first letter, followed by a “.” (“North Carolina” becomes “N. Carolina”).
sub("^(\\w)\\w+\\s", "\\1. ", state.name)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "N. Hampshire" "N. Jersey" "N. Mexico" "N. York"
## [33] "N. Carolina" "N. Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "R. Island" "S. Carolina"
## [41] "S. Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "W. Virginia"
## [49] "Wisconsin" "Wyoming"