LargeDataSetForText {aifeducation} | R Documentation |
Abstract class for large data sets containing raw texts
Description
This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
Value
Returns a new object of this class.
Super class
aifeducation::LargeDataSetBase
-> LargeDataSetForText
Methods
Public methods
Inherited methods
aifeducation::LargeDataSetBase$get_all_fields()
aifeducation::LargeDataSetBase$get_colnames()
aifeducation::LargeDataSetBase$get_dataset()
aifeducation::LargeDataSetBase$get_ids()
aifeducation::LargeDataSetBase$load()
aifeducation::LargeDataSetBase$load_from_disk()
aifeducation::LargeDataSetBase$n_cols()
aifeducation::LargeDataSetBase$n_rows()
aifeducation::LargeDataSetBase$reduce_to_unique_ids()
aifeducation::LargeDataSetBase$save()
aifeducation::LargeDataSetBase$select()
Method new()
Method for creation of LargeDataSetForText instance. It can be initialized with init_data
parameter if passed (Uses add_from_data.frame()
method if init_data
is data.frame
).
Usage
LargeDataSetForText$new(init_data = NULL)
Arguments
init_data
Initial
data.frame
for dataset.
Returns
A new instance of this class initialized with init_data
if passed.
Method add_from_files_txt()
Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
Usage
LargeDataSetForText$add_from_files_txt( dir_path, batch_size = 500, log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA, trace = TRUE )
Arguments
dir_path
Path to the directory where the files are stored.
batch_size
int
determining the number of files to process at once.log_file
string
Path to the file where the log should be saved. If no logging is desired set this argument toNULL
.log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_file
is notNULL
.log_top_value
int
indicating the current iteration of the process.log_top_total
int
determining the maximal number of iterations.log_top_message
string
providing additional information of the process.trace
bool
IfTRUE
information on the progress is printed to the console.
Returns
The method does not return anything. It adds new raw texts to the data set.
Method add_from_files_pdf()
Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
Usage
LargeDataSetForText$add_from_files_pdf( dir_path, batch_size = 500, log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA, trace = TRUE )
Arguments
dir_path
Path to the directory where the files are stored.
batch_size
int
determining the number of files to process at once.log_file
string
Path to the file where the log should be saved. If no logging is desired set this argument toNULL
.log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_file
is notNULL
.log_top_value
int
indicating the current iteration of the process.log_top_total
int
determining the maximal number of iterations.log_top_message
string
providing additional information of the process.trace
bool
IfTRUE
information on the progress is printed to the console.
Returns
The method does not return anything. It adds new raw texts to the data set.
Method add_from_files_xlsx()
Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.
Usage
LargeDataSetForText$add_from_files_xlsx( dir_path, trace = TRUE, id_column = "id", text_column = "text", bib_entry_column = "bib_entry", license_column = "license", url_license_column = "url_license", text_license_column = "text_license", url_source_column = "url_source", log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA )
Arguments
dir_path
Path to the directory where the files are stored.
trace
bool
IfTRUE
prints information on the progress to the console.id_column
string
Name of the column storing the ids for the texts.text_column
string
Name of the column storing the raw text.bib_entry_column
string
Name of the column storing the bibliographic information of the texts.license_column
string
Name of the column storing information about the licenses.url_license_column
string
Name of the column storing information about the url to the license in the internet.text_license_column
string
Name of the column storing the license as text.url_source_column
string
Name of the column storing information about about the url to the source in the internet.log_file
string
Path to the file where the log should be saved. If no logging is desired set this argument toNULL
.log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_file
is notNULL
.log_top_value
int
indicating the current iteration of the process.log_top_total
int
determining the maximal number of iterations.log_top_message
string
providing additional information of the process.
Returns
The method does not return anything. It adds new raw texts to the data set.
Method add_from_data.frame()
Method for adding raw texts from a data.frame
Usage
LargeDataSetForText$add_from_data.frame(data_frame)
Arguments
data_frame
Object of class
data.frame
with at least the following columns "id","text","bib_entry", "license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs. If the other columns are not present in thedata.frame
they are added with empty values(NA
). Additional columns are dropped.
Returns
The method does not return anything. It adds new raw texts to the data set.
Method get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.
Usage
LargeDataSetForText$get_private()
Returns
Returns a list
with all private fields and methods.
Method clone()
The objects of this class are cloneable with this method.
Usage
LargeDataSetForText$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
See Also
Other Data Management:
DataManagerClassifier
,
EmbeddedText
,
LargeDataSetForTextEmbeddings