token_stats {textTinyR} | R Documentation |
token statistics
# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, # file_delimiter = ' ', n_gram_delimiter = "_")
x_vec |
either NULL or a string character vector |
path_2folder |
either NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter) |
path_2file |
either NULL or a valid path to a file |
file_delimiter |
either NULL or a character string specifying the file delimiter |
n_gram_delimiter |
either NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function |
subset |
either NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function) |
number |
a numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned. |
word |
a character string for the print_collocations and print_prob_next functions |
dice_n_gram |
a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function |
method |
a character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine. |
split_separator |
a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_" |
dice_thresh |
a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0. |
upper |
either TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's |
diagonal |
either TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's |
threads |
a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function |
n_grams |
a numeric value specifying the n-grams in the look_up_table function |
n_gram |
a character string specifying the n-gram to use in the print_words_lookup_tbl function |
An object of class R6ClassGenerator
of length 24.
the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file
the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function
the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function
the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function
the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).
the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.
token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")
--------------
path_2vector()
--------------
freq_distribution()
--------------
print_frequency(subset = NULL)
--------------
count_character()
--------------
print_count_character(number = NULL)
--------------
collocation_words()
--------------
print_collocations(word = NULL)
--------------
string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)
--------------
look_up_table(n_grams = NULL)
--------------
print_words_lookup_tbl(n_gram = NULL)
library(textTinyR) expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token') tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL) #------------------------- # frequency distribution: #------------------------- tk$freq_distribution() # tk$print_frequency() #------------------ # count characters: #------------------ cnt <- tk$count_character() # tk$print_count_character(number = 4) #---------------------- # collocation of words: #---------------------- col <- tk$collocation_words() # tk$print_collocations(word = 'five') #----------------------------- # string dissimilarity matrix: #----------------------------- dism <- tk$string_dissimilarity_matrix(method = 'levenshtein') #--------------------- # build a look-up-table: #--------------------- lut <- tk$look_up_table(n_grams = 3) # tk$print_words_lookup_tbl(n_gram = 'e_w')