quanteda.extras
This package contains statstical and plotting tools that expand the functionality of quanteda. These include functions for calculating:
keyness
effect size
dispersion
collocation
Installing quanteda.extras
Use devtools to install the package.
devtools::install_github("browndw/quanteda.extras")
Using quanteda.extras
The package is designed to facilitate basic corpus functions of the kind produced by concordancers like AntConc or WordSmith in the R programming environment.
As the name of the package suggests, many of the functions take outputs produced by the quanteda package. Users of the package should, therefore, be familiar with quanteda processing pipelines – particularly tokenizing and the creation of document-feature matrices.
A simple frequency table
As a simple example, we can start with the sample_corpus
included with the package.
We can check the composition of the corpus by taking the text-type information from the doc_id
variable:
sample_corpus %>%
mutate(text_type = str_extract(doc_id, "[a-z]+")) %>%
group_by(text_type) %>%
tally()
The corpus contains 50 text sample of each type:
text_type |
n |
---|---|
acad |
50 |
blog |
50 |
fic |
50 |
mag |
50 |
news |
50 |
spok |
50 |
tvm |
50 |
web |
50 |
In order to generate a table from the frequency_table
function, the data must first be tokenized.
sc_tokens <- sample_corpus %>%
mutate(text = preprocess_text(text)) %>% # Pre-process the text column
corpus() %>% # Generate a corpus object
tokens(what = "fastestword") # Tokenize
# Generate a frequency table
sc_counts <- frequency_table(sc_tokens)
The result is a table showing absolute and relative frquencies, as well as average reduced frequncy, and deviation of proportions.
Token |
AF |
Per_10.6 |
ARF |
DP |
---|---|---|---|---|
the |
51046 |
49572.98822779917 |
31756.267973437283 |
0.14353581079341268 |
to |
25942 |
25193.403216815543 |
16219.377547552038 |
0.09146807813084532 |
and |
25344 |
24612.65943747487 |
15910.523549257367 |
0.12570387667514152 |
of |
24331 |
23628.8911289931 |
14695.894699887542 |
0.17752068678885638 |
a |
22694 |
22039.129311634104 |
13920.262892414788 |
0.1098711133049567 |
in |
16959 |
16469.6216619372 |
10384.016099616008 |
0.14662171556500003 |