Annotating random samples in R

Years ago, when I was still working with my master's thesis, I used to do a lot of manual annotating of data in a spreadsheet software (libreoffice calc, mainly). Now that R has firmly become my main tool for doing research, I've been thinking about what's the best way to, for instance, manually annotate a small sample of sentences from a data frame.

Random sampling with dplyr

First of all, I must say that during the last couple of months I've grown more and more accustomed to using dplyr's piping and data manipulation techniques. They have absolutely revolutionized the way I write R code nowadays. Here's one typical use case for me that has to do with annotating samples.

Consider this dataset consisting of Finnish and Russian verbs:

# A tibble: 543 x 2
# Groups:   lang [2]
   lang  headverb
   <chr> <chr>
 1 fi    saada
 2 fi    tehdä
 3 ru    провести
 4 ru    принять
 5 ru    получить
 6 fi    voittaa
 7 ru    опубликовать
 8 fi    aloittaa
 9 fi    antaa
10 fi    ottaa
# ... with 533 more rows

If my goal were to take a random sample of these verbs, dplyr offers the convenient function sample_n. So I can just do:


> headverbs %>% sample_n(10)
# A tibble: 10 x 2
   lang  headverb
   <chr> <chr>
 1 ru    вынести
 2 ru    приговорить
 3 fi    ampua
 4 ru    пройти
 5 fi    nimittää
 6 ru    сдать
 7 ru    предложить
 8 ru    заказывать
 9 ru    запретить
10 fi    määrätä

Even better, using the group_by function, I can first group my data by language and then get a sample having n number of instances from both Finnish and Russian:



> headverbs %>% group_by(lang) %>% sample_n(10)
# A tibble: 20 x 2
# Groups:   lang [2]
   lang  headverb
   <chr> <chr>
 1 fi    käydä
 2 fi    avata
 3 fi    korottaa
 4 fi    pistää
 5 fi    osoittaa
 6 fi    uudistaa
 7 fi    korjata
 8 fi    kilpailuttaa
 9 fi    todistaa
10 fi    pelata
11 ru    обыграть
12 ru    проиграть
13 ru    представлять
14 ru    опубликовать
15 ru    решить
16 ru    комментировать
17 ru    свести
18 ru    испечь
19 ru    перенести
20 ru    взорвать

Manual annotations

Now, in order to make manual annotations possible without leaving R I wrote the following little function:


CheckSample_df <- function(r, cols_to_show, backup_file="/tmp/backup.txt"){
    content  <- sapply(r[cols_to_show],function(x) paste(strwrap(x, 79),collapse="\n"))
    cat("\n\n", paste(cols_to_show,content,sep="\n=====\n",collapse="\n\n"),"\n\n")
    def <- readline("\nYour annotation:\n")
    write_lines(paste0(
                       paste(r[cols_to_show],collapse="|"),
                       "|",def)
                ,backup_file,append=T)
    return(def)
}

The function is designed to be called with apply (for a tutorial cf. e.g here). For the verb dataset above, if I wanted to define an additional column describing, e.g., my interpretation of the semantic class of each verb, I could do the following:


headverbs$semantic_class <- apply(headverbs,1,CheckSample_df, cols_to_show=c("lang","headverb"))

The cols_to_show parameter defines, which columns are shown for the user to help with the annotation. The backup_file specifies a file the function copies the annotation results. This is a reasonable thing to do especially if you have a lot to annotate -- in case of R crashing in the middle of the process, it's nice to have something to use as a basis for data recovery.

If you're just interested in a simpler version that you can use with sapply , the function could be written this way:


CheckSample_simple <- function(show_this, backup_file="/tmp/backup.txt"){
    cat("\n\n", paste(strwrap(show_this, 80), collapse="\n"), "\n\n")
    def <- readline("Your annotation:")
    write_lines(paste0(show_this,"|",def),backup_file,append=T)
    return(def)
}

Fine-tuning with pbapply

One improvement to the aforementioned technique is to get some feedback on how you are progressing with the annotation process. A great tool for this is the pbapply package. We can just turn the previous command into:


headverbs$semantic_class <- pbapply(headverbs,1,CheckSample_df, cols_to_show=c("lang","headverb"))

And we get a nice progress bar indicating the work that has already been done

an estimate of the time remaining:


   |+++++                                             | 9 % ~02m 22s

Thoughts of a linguist-turned-dev

Thoughts of a linguist-turned-dev

Annotating random samples in R

Random sampling with dplyr

Manual annotations

Fine-tuning with pbapply