The dataset
package extends R’s native data structures
with machine-readable metadata. It follows a semantic
early-binding approach: metadata is embedded as soon as the data is
created, making datasets suitable for long-term reuse, FAIR-compliant
publishing, and integration into semantic web systems.
In R, a data.frame
is defined as a tightly coupled
collection of variables that share many of the properties of matrices
and lists, and it serves as the fundamental data structure for most of
R’s modeling software. Users of the R ecosystem often use the term
data frame interchangeably with dataset. However, the
standards used in libraries, repositories, and statistical systems for
publishing, exchanging, and reusing datasets require metadata that even
“tidy” data frames do not provide.
This vignette introduces the dataset_df
class and the
dataset_df()
constructor, which extend tidy data frames
with a semantic layer. For details on semantically enriched vectors, see
vignette("defined", package = "dataset")
. Readers
interested in the underlying ISO and W3C definitions of dataset
will find them discussed in
vignette("design", package = "dataset")
.
The dataset_df()
function helps you create
semantically rich datasets that meet the
interoperability, exchange, and reuse requirements of libraries,
repositories, and statistical systems. It defines a new S3 class,
inherited from the modernised data frame of
tibble::tibble()
, that retains compatibility with existing
workflows but is easier to:
This vignette walks you through creating such a dataset using a
subset of the GDP and main aggregates – international data
cooperation annual data dataset from Eurostat
(DOI: https://doi.org/10.2908/NAIDA_10_GDP)..)
print(gdp)
#> # A tibble: 10 × 5
#> geo year gdp unit freq
#> <chr> <int> <dbl> <chr> <chr>
#> 1 AD 2020 2355. CP_MEUR A
#> 2 AD 2021 2594. CP_MEUR A
#> 3 AD 2022 2884. CP_MEUR A
#> 4 AD 2023 3120. CP_MEUR A
#> 5 LI 2020 5430. CP_MEUR A
#> 6 LI 2021 6424. CP_MEUR A
#> 7 LI 2022 6759. CP_MEUR A
#> 8 SM 2020 1265. CP_MEUR A
#> 9 SM 2021 1461. CP_MEUR A
#> 10 SM 2022 1612. CP_MEUR A
This example dataset is already in tidy format: each row represents a
single observation for a country and year, and each column is a
variable. dataset_df
builds on this structure by adding
semantic information to the variables and the dataset itself, ensuring
that both the shape and the meaning of the data are preserved and
unambiguous.
While the raw dataset represented in the gdp
data.frame
is valid and tidy, it can be hard to interpret without external
documentation. For example:
Countries are encoded in the geo
variable.
Reporting frequency (e.g., A
for annual) is stored
in freq
.
The dataset_df()
constructor enables two levels of
semantic annotation for a tbl_df
object:
Variable-level metadata — label, unit, definition, namespace.
Dataset-level metadata — title, author, license, description.
Let’s create a smaller dataset and enrich it with metadata.
Let’s create a semantically enriched subset:
small_country_dataset <- dataset_df(
geo = defined(
gdp$geo,
label = "Country name",
concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/geo/$1"
),
year = defined(
gdp$year,
label = "Reference Period (Year)",
concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod"
),
gdp = defined(
gdp$gdp,
label = "Gross Domestic Product",
unit = "CP_MEUR",
concept = "http://data.europa.eu/83i/aa/GDP"
),
unit = defined(
gdp$unit,
label = "Unit of Measure",
concept = "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure",
namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/unit/$1"
),
freq = defined(
gdp$freq,
label = "Frequency",
concept = "http://purl.org/linked-data/sdmx/2009/code"
),
dataset_bibentry = dublincore(
title = "Small Country Dataset",
creator = person("Jane", "Doe"),
publisher = "Example Inc.",
datasource = "https://doi.org/10.2908/NAIDA_10_GDP",
rights = "CC-BY",
coverage = "Andorra, Liechtenstein, San Marino and the Feroe Islands"
)
)
Columns created with the defined
class store semantic
information such as the label, the concept’s definition link, and the
unit of measure.
Check the variable label:
And the measure of unit:
A dataset_df()
object can also store metadata describing
the dataset as a whole. This metadata follows widely adopted
standards:
dublincore()
), used in libraries and
data repositories.datacite()
), commonly used in research data
repositories.Each metadata field can be accessed or modified using simple assignment functions. For example, you can set the dataset language.
To see the complete dataset description, you can print it as a BibTeX-style entry, which is suitable for citation or export.
print(get_bibentry(small_country_dataset), "bibtex")
#> Dublin Core Metadata Record
#> --------------------------
#> Title: Small Country Dataset
#> Creator(s): Jane Doe [ctb]
#> Publisher: Example Inc.
#> Year: 2025
#> Language: eng
This prints a complete BibTeX-style entry, suitable for citation or export.
The previous dataset contains observations for three data subjects — Andorra, Liechtenstein, and San Marino — but does not include the Feroe Islands.
feroe_df <- data.frame(
geo = rep("FO", 3),
year = 2020:2022,
gdp = c(2523.6, 2725.8, 3013.2),
unit = rep("CP_MEUR", 3),
freq = rep("A", 3)
)
The dataset_df
class does not allow binding two datasets
directly unless their concept definitions, units of measure, and URI
namespaces match.
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
While this constraint can feel restrictive during an analysis workflow, it ensures semantic consistency when the data is later published or exchanged.
This is similar in spirit to tidy data principles: when combining
datasets, both structure and meaning must align. In
dataset_df
, the tidy data rule that “variables are columns”
is complemented by the requirement that variables with the same name
also share the same definition, units, and concept references.
To add the missing Feroe Islands data, first create a compatible dataset using the same definitions, country coding, and units of measure as the original.
feroe_dataset <- dataset_df(
geo = defined(
feroe_df$geo,
label = "Country name",
concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/geo/$1"
),
year = defined(
feroe_df$year,
label = "Reference Period (Year)",
concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod"
),
gdp = defined(
feroe_df$gdp,
label = "Gross Domestic Product",
unit = "CP_MEUR",
concept = "http://data.europa.eu/83i/aa/GDP"
),
unit = defined(
feroe_df$unit,
label = "Unit of Measure",
concept = "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure",
namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/unit/$1"
),
freq = defined(
feroe_df$freq,
label = "Frequency",
concept = "http://purl.org/linked-data/sdmx/2009/code"
)
)
Once the new dataset is defined in this way, you can combine it with
the existing one using bind_defined_rows()
.
joined_dataset <- bind_defined_rows(small_country_dataset, feroe_dataset)
joined_dataset
#> Doe (2025): Small Country Dataset [dataset]
#> rowid geo year gdp unit freq
#> <defined> <defined> <defined> <defined> <defined> <defined>
#> 1 obs1 AD 2020 2355. CP_MEUR A
#> 2 obs2 AD 2021 2594. CP_MEUR A
#> 3 obs3 AD 2022 2884. CP_MEUR A
#> 4 obs4 AD 2023 3120. CP_MEUR A
#> 5 obs5 LI 2020 5430. CP_MEUR A
#> 6 obs6 LI 2021 6424. CP_MEUR A
#> 7 obs7 LI 2022 6759. CP_MEUR A
#> 8 obs8 SM 2020 1265. CP_MEUR A
#> 9 obs9 SM 2021 1461. CP_MEUR A
#> 10 obs10 SM 2022 1612. CP_MEUR A
#> 11 obs11 FO 2020 2524. CP_MEUR A
#> 12 obs12 FO 2021 2726. CP_MEUR A
#> 13 obs13 FO 2022 3013. CP_MEUR A
The combined dataset behaves like a regular tibble but retains its
metadata. If you convert it to a base R data.frame
, you
will lose the helper methods and built-in checks, but the metadata will
remain in the object’s attributes.
attributes(as.data.frame(joined_dataset))
#> $names
#> [1] "rowid" "geo" "year" "gdp" "unit" "freq"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13
#>
#> $dataset_bibentry
#> Dublin Core Metadata Record
#> --------------------------
#> Title: Small Country Dataset
#> Creator(s): Jane Doe [ctb]
#> Publisher: Example Inc.
#> Year: 2025
#> Language: eng
#>
#> $prov
#> [1] "<http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> ."
#> [2] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> ."
#> [3] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."
#> [4] "_:unknownauthor <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."
#> [5] "<https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#> [6] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."
#> [7] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-08-25T22:11:30Z\"^^<xsd:dateTime> ."
#>
#> $subject
#> $term
#> [1] "Data sets"
#>
#> $subjectScheme
#> [1] "LCSH"
#>
#> $schemeURI
#> [1] "http://id.loc.gov/authorities/subjects"
#>
#> $valueURI
#> [1] "http://id.loc.gov/authorities/subjects/sh2018002256"
#>
#> $classificationCode
#> NULL
#>
#> $prefix
#> [1] "lcsh:"
#>
#> attr(,"class")
#> [1] "subject" "list"
#>
#> $class
#> [1] "data.frame"
With dataset_df()
your datasets are:
This approach supports the FAIR data principles (Findable,
Accessible, Interoperable, Reusable) and makes your data easier to
reuse, interpret, and validate. By maintaining metadata from creation
through publication, dataset_df
helps preserve meaning
across the entire data lifecycle.
The package is designed to work seamlessly with the rOpenSci rdflib package and complements tidyverse workflows while enabling exports to semantic web formats like RDF.