The aim of the rmdfiltr word count filter is to provide
a more accurate estimate of the number of words in a document than can
be gleaned from the R Markdown source document. Output from (inline) R
chunks as well as formatted citations and references can not enter the
word count, when the source document is analyzed. Hence, the word count
filter is applied after the document has been knitted and while it is
being processed by pandoc. At this stage, the document is
represented as an abstract syntax tree (AST), a semantic nested list,
and can be manipulated by applying so-called filters.
One the filters that is applied to R Markdown by default is
citeproc (previously pandoc-citeproc), which
formats citations and inserts references. To obtain an accurate
estimate, the word count filter should therefore be applied
after citeproc has been applied. To do so, it is
necessary to disable the default application of citeproc,
because it is always applied last, by adding the following to the
documents YAML front matter:
To manually apply citeproc and subsequently the
rmdfiltr word count filter add the pandoc
arguments to the output format of your R Markdown document as
pandoc_args. Each filter returns a vector of command line
arguments; they take previous arguments as args and add to
them. Hence, the calls to add filters can be nested:
#> [1] "--citeproc"#> [1] "--citeproc"                                                                                                   
#> [2] "--lua-filter"                                                                                                 
#> [3] "/private/var/folders/nv/mz4ffsbn045101ngdd_mx0th0000gn/T/RtmpUaG0vc/Rinst1019a4a86f794/rmdfiltr/wordcount.lua"When adding the filters to pandoc_args the R code needs
to be preceded by !expr to declare it as to-be-interpreted
expression.
output:
  html_document:
    pandoc_args: !expr rmdfiltr::add_wordcount_filter(rmdfiltr::add_citeproc_filter(args = NULL))The word count filter reports the word counts in the console or the R Markdown tab in RStudio, respectively.
285 words in text body
23 words in reference sectionThe rmdfiltr filter is and adapted combination of two
other
Lua-filters by John MacFarlane and contributors.
Although word counting appears to be a trivial matter, the counts of different methods often disagree. The magnitude of those disagreements depends on the complexity of the document.
To get a feeling for the performance of the word count filter, I briefly compared the estimates for two documents across several common methods. The first document, a paper by Stahl & Aust (2018) is a rather simple consisting of only text with citations and a reference section. The second document is a more complicated—it contains math, code, verbatim output, etc.
The word counts for the text body do not contain, tables or images
(or their captions), or the reference section (which required some
manual labor in Word, Pages, and wordcounter.net).
Overall, all methods provide similar estimates for the text body of
the simple document. Although the document contains a considerable
number of citations, the wordcountaddin which is applied to
the R Markdown source file before citeproc, provides a good
estimate. As expected there is less agreement on the word count for the
shorter and more complex document. In particular, the
texcount word count is off—it displayed several errors
related to the displayed R code and verbatim output. I think the errors
may have caused texcount to ignore some bits and are
probably the reason for the low word count of the text body. Similarly,
the wordcountaddin cannot count the verbatim output.
The pattern for the reference sections of the simple and complex
documents are comparable. Pages and texcount count more
words than Word, wordcounter.net and the
rmdfiltr word count filter. I suspect the difference is due
to how the methods handle the URLs in the references. The
wordcountaddin cannot provide a word count for reference
sections.
Overall I’m fairly happy with the performance of the
rmdfiltr filter. The word counts are quite similar to those
of the majority of the other methods. I’m sure the filter can be
improved (and I’ll gladly take any suggestion) but I think in its
current form it is a decent solution.
Stahl, C., & Aust, F. (2018). Evaluative conditioning as memory-based judgment. Social Psychological Bulletin, 13(3), Article e28589. https://doi.org/10.5964/spb.v13i3.28589