parquet_schema {nanoparquet} | R Documentation |
Create a Parquet schema
Description
You can use this schema to specify how to write out a data frame to
a Parquet file with write_parquet()
.
Usage
parquet_schema(...)
Arguments
... |
Parquet type specifications, see below.
For backwards compatibility, you can supply a file name
here, and then |
Details
A schema is a list of potentially named type specifications. A schema
is stored in a data frame. Each (potentially named) argument of
parquet_schema
may be a character scalar, or a list. Parameterized
types need to be specified as a list. Primitive Parquet types may be
specified as a string or a list.
Value
Data frame with the same columns as read_parquet_schema()
:
file_name
, name
, r_type
, type
, type_length
, repetition_type
, converted_type
, logical_type
, num_children
, scale
, precision
, field_id
.
Possible types:
Special type:
-
"AUTO"
: this is not a Parquet type, but it tellswrite_parquet()
to map the R type to Parquet automatically, using the default mapping rules.
Primitive Parquet types:
-
"BOOLEAN"
-
"INT32"
-
"INT64"
-
"INT96"
-
"FLOAT"
-
"DOUBLE"
-
"BYTE_ARRAY"
-
"FIXED_LEN_BYTE_ARRAY"
: fixed-length byte array. It needs atype_length
parameter, an integer between 0 and 2^31-1.
Parquet logical types:
-
"STRING"
-
"ENUM"
-
"UUID"
-
"INTEGER"
: signed or unsigned integer. It needs abit_width
and anis_signed
parameter.bit_width
must be 8, 16, 32 or 64.is_signed
must beTRUE
orFALSE
. -
"INT"
: same as"INTEGER"
. The Parquet documentation uses"INT"
, but the actual specification uses"INTEGER"
. Both are supported in nanoparquet. -
"DECIMAL"
: decimal number of specified scale and precision. It needs theprecision
andprimitive_type
parameters. Also supports thescale
parameter, it defaults to zero if not specified. -
"FLOAT16"
-
"DATE"
-
"TIME"
: needs anis_adjusted_utc
(TRUE
orFALSE
) and aunit
parameter.unit
must be"MILLIS"
,"MICROS"
or"NANOS"
. -
"TIMESTAMP"
: needs anis_adjusted_utc
(TRUE
orFALSE
) and aunit
parameter.unit
must be"MILLIS"
,"MICROS"
or"NANOS"
. -
"JSON"
-
"BSON"
Logical types MAP
, LIST
and UNKNOWN
are not supported currently.
Converted types are deprecated in the Parquet specification in favor of
logical types, but parquet_schema()
accepts some converted types as a
syntactic shortcut for the corresponding logical types:
-
INT_8
meanlist("INT", bit_width = 8, is_signed = TRUE)
. -
INT_16
meanlist("INT", bit_width = 16, is_signed = TRUE)
. -
INT_32
meanlist("INT", bit_width = 32, is_signed = TRUE)
. -
INT_64
meanlist("INT", bit_width = 64, is_signed = TRUE)
. -
TIME_MICROS
meanslist("TIME", is_adjusted_utc = TRUE, unit = "MICROS")
. -
TIME_MILLIS
meanslist("TIME", is_adjusted_utc = TRUE, unit = "MILLIS")
. -
TIMESTAMP_MICROS
meanslist("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MICROS")
. -
TIMESTAMP_MILLIS
meanslist("TIMESTAMP", is_adjusted_utc = TRUE, unit = "MILLIS")
. -
UINT_8
meanslist("INT", bit_width = 8, is_signed = FALSE)
. -
UINT_16
meanslist("INT", bit_width = 16, is_signed = FALSE)
. -
UINT_32
meanslist("INT", bit_width = 32, is_signed = FALSE)
. -
UINT_64
meanslist("INT", bit_width = 64, is_signed = FALSE)
.
Missing values
Each type might also have a repetition_type
parameter, with possible
values "REQUIRED"
, "OPTIONAL"
or "REPEATED"
. "REQUIRED"
columns
do not allow missing values. Missing values are allowed in "OPTIONAL"
columns. "REPEATED"
columns are currently not supported in
write_parquet()
.
Examples
parquet_schema(
c1 = "INT32",
c2 = list("INT", bit_width = 64, is_signed = TRUE),
c3 = list("STRING", repetition_type = "OPTIONAL")
)