Gabriel Gabriel - 1 month ago 16
LaTeX Question

LaTeX document word statistics

I know there are a number of ways of counting words in a LaTeX document, some more precise than others.

What I'm after is a way to perform simple statistics on a LaTeX document. This is, instead of just grouping all words and counting its length, I'd like to count the number of instances of each word separately.

The output would look something like this:

1. (15% - 456) that
++++++++++++++++++++++++++++++++++++++++++++
2. (10% - 308) the
++++++++++++++++++++++++++++++
3. (8% - 213) is
+++++++++++++++++++++
4. (4% - 102) of
+++++++++
5. (2% - 55) and
++++


Is there any tool out there that con do something similar do this?

Answer

I could not find any package/script to do what I needed, so I ended up building my own.

It's a small (rudimentary) Python script, but it does the job. The output looks like this:

Number of unique words: 1945
Total number of words: 16660

  0.  1210     (7.26%) - the
  1.   461     (2.77%) - in
  2.   431     (2.59%) - of
  3.   317     (1.90%) - a
  4.   313     (1.88%) - and
  5.   304     (1.82%) - for
  6.   304     (1.82%) - to
  7.   241     (1.45%) - is
  8.   176     (1.06%) - words
  9.   165     (0.99%) - by
Sum percentage: 23.5%

Word lengths distribution:
 1  ++ (317)
 2  ++++++++++++++++++++ (2602)
 3  ++++++++++++++++++++++++++++++ (3947)
 4  ++++++++++++++++++ (2342)
 5  +++++++++++++ (1752)
 6  ++++++++++ (1348)
 7  +++++++++ (1154)
 8  ++++++++ (1071)
 9  ++++++ (787)
10  ++++ (586)
11  +++ (383)
12  + (129)
13  + (123)
14  + (36)
15  + (83)

It's uploaded in the Github repo: LaTexWordStats.