N-gram tokenizing:
The test will break up the content of each post
in your dataset into sequences of words, or “n-grams.”
Each word separated by a space is an individual token
in the n-gram, and the number of token indicates the
length of an n-gram.
Scanning for repeated n-grams:
The test will examine if any n-grams between length
three (e.g. “he eats snails”) and five (e.g. “The
last time I voted”) are repeated in another post.
The test will reveal any n-grams that appear more
than once across different posts.
Counting repeated n-grams:
The test will tally the number of times an n-gram is
repeated and will sort n-grams by frequency of repetitions
(highest to lowest) in the data output. This allows you
to see if a particular phrase was copied and pasted many
times across posts.
Output Format: .csv File
The test will produce a csv file where repeated n-grams
are presented in rows, with each row corresponding to
an individual post where a repeated n-gram was found.
The rows are arranged by n-grams of longest lengths
first. The csv will present all repeated n-grams of
length 5 first (if found), followed by repeated n-grams
of length 4 (if found), then 3 (if found).
The content of the n-gram is displayed, and rows are
further sorted by the number of times the particular
n-gram was repeated by all users in the dataset,
starting from the n-grams repeated the most to the
n-grams repeated the least (a minimum of two occurrences).
The unique username of the user that posted the particular
n-gram is presented, and the rows are further sorted by the
unique usernames which posted a given n-gram the most times,
to the least.
The content of the entire post in which the n-gram appeared
is presented, as well as the timestamp for when the post was made.