dataset_df: Create Datasets that are Easy to Share Exchange and Extend

The dataset package extends R’s native data structures with machine-readable metadata. It follows a semantic early-binding approach: metadata is embedded as soon as the data is created, making datasets suitable for long-term reuse, FAIR-compliant publishing, and integration into semantic web systems.

In R, a data.frame is defined as a tightly coupled collection of variables that share many of the properties of matrices and lists, and it serves as the fundamental data structure for most of R’s modeling software. Users of the R ecosystem often use the term data frame interchangeably with dataset. However, the standards used in libraries, repositories, and statistical systems for publishing, exchanging, and reusing datasets require metadata that even “tidy” data frames do not provide.

This vignette introduces the dataset_df class and the dataset_df() constructor, which extend tidy data frames with a semantic layer. For details on semantically enriched vectors, see vignette("defined", package = "dataset"). Readers interested in the underlying ISO and W3C definitions of dataset will find them discussed in vignette("design", package = "dataset").

Purpose

The dataset_df() function helps you create semantically rich datasets that meet the interoperability, exchange, and reuse requirements of libraries, repositories, and statistical systems. It defines a new S3 class, inherited from the modernised data frame of tibble::tibble(), that retains compatibility with existing workflows but is easier to:

understand by humans,
validate and process by machines,
deposit, exchange, and publish,
share across tools, teams, and domains.

This vignette walks you through creating such a dataset using a subset of the GDP and main aggregates – international data cooperation annual data dataset from Eurostat
(DOI: https://doi.org/10.2908/NAIDA_10_GDP)..)

Load example data

library(dataset)
data("gdp")

print(gdp)
#> # A tibble: 10 × 5
#>    geo    year   gdp unit    freq 
#>    <chr> <int> <dbl> <chr>   <chr>
#>  1 AD     2020 2355. CP_MEUR A    
#>  2 AD     2021 2594. CP_MEUR A    
#>  3 AD     2022 2884. CP_MEUR A    
#>  4 AD     2023 3120. CP_MEUR A    
#>  5 LI     2020 5430. CP_MEUR A    
#>  6 LI     2021 6424. CP_MEUR A    
#>  7 LI     2022 6759. CP_MEUR A    
#>  8 SM     2020 1265. CP_MEUR A    
#>  9 SM     2021 1461. CP_MEUR A    
#> 10 SM     2022 1612. CP_MEUR A

This example dataset is already in tidy format: each row represents a single observation for a country and year, and each column is a variable. dataset_df builds on this structure by adding semantic information to the variables and the dataset itself, ensuring that both the shape and the meaning of the data are preserved and unambiguous.

While the raw dataset represented in the gdp data.frame is valid and tidy, it can be hard to interpret without external documentation. For example:

Countries are encoded in the geo variable.
Reporting frequency (e.g., A for annual) is stored in freq.

Add metadata to your dataset

The dataset_df() constructor enables two levels of semantic annotation for a tbl_df object:

Variable-level metadata — label, unit, definition, namespace.
Dataset-level metadata — title, author, license, description.
Let’s create a smaller dataset and enrich it with metadata.

Let’s create a semantically enriched subset:

small_country_dataset <- dataset_df(
  geo = defined(
    gdp$geo,
    label = "Country name",
    concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
    namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/geo/$1"
  ),
  year = defined(
    gdp$year,
    label = "Reference Period (Year)",
    concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod"
  ),
  gdp = defined(
    gdp$gdp,
    label = "Gross Domestic Product",
    unit = "CP_MEUR",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  unit = defined(
    gdp$unit,
    label = "Unit of Measure",
    concept = "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure",
    namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/unit/$1"
  ),
  freq = defined(
    gdp$freq,
    label = "Frequency",
    concept = "http://purl.org/linked-data/sdmx/2009/code"
  ),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset",
    creator = person("Jane", "Doe"),
    publisher = "Example Inc.",
    datasource = "https://doi.org/10.2908/NAIDA_10_GDP",
    rights = "CC-BY",
    coverage = "Andorra, Liechtenstein, San Marino and the Feroe Islands"
  )
)

Inspecting variable-level metadata

Columns created with the defined class store semantic information such as the label, the concept’s definition link, and the unit of measure.

Check the variable label:

var_label(small_country_dataset$gdp)
#> [1] "Gross Domestic Product"

And the measure of unit:

var_unit(small_country_dataset$gdp)
#> [1] "CP_MEUR"

Adding dataset-level metadata

A dataset_df() object can also store metadata describing the dataset as a whole. This metadata follows widely adopted standards:

Dublin Core Terms (dublincore()), used in libraries and data repositories.
DataCite (datacite()), commonly used in research data repositories.

Each metadata field can be accessed or modified using simple assignment functions. For example, you can set the dataset language.

language(small_country_dataset) <- "en"

Reviewing dataset-level metadata

To see the complete dataset description, you can print it as a BibTeX-style entry, which is suitable for citation or export.

print(get_bibentry(small_country_dataset), "bibtex")
#> Dublin Core Metadata Record
#> --------------------------
#> Title:       Small Country Dataset
#> Creator(s):  Jane Doe [ctb]
#> Publisher:   Example Inc.
#> Year:        2025
#> Language:    eng

This prints a complete BibTeX-style entry, suitable for citation or export.

Joining datasets

The previous dataset contains observations for three data subjects — Andorra, Liechtenstein, and San Marino — but does not include the Feroe Islands.

feroe_df <- data.frame(
  geo = rep("FO", 3),
  year = 2020:2022,
  gdp = c(2523.6, 2725.8, 3013.2),
  unit = rep("CP_MEUR", 3),
  freq = rep("A", 3)
)

The dataset_df class does not allow binding two datasets directly unless their concept definitions, units of measure, and URI namespaces match.

rbind(small_country_dataset, feroe_df)

Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match

While this constraint can feel restrictive during an analysis workflow, it ensures semantic consistency when the data is later published or exchanged.

This is similar in spirit to tidy data principles: when combining datasets, both structure and meaning must align. In dataset_df, the tidy data rule that “variables are columns” is complemented by the requirement that variables with the same name also share the same definition, units, and concept references.

To add the missing Feroe Islands data, first create a compatible dataset using the same definitions, country coding, and units of measure as the original.

feroe_dataset <- dataset_df(
  geo = defined(
    feroe_df$geo,
    label = "Country name",
    concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
    namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/geo/$1"
  ),
  year = defined(
    feroe_df$year,
    label = "Reference Period (Year)",
    concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod"
  ),
  gdp = defined(
    feroe_df$gdp,
    label = "Gross Domestic Product",
    unit = "CP_MEUR",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  unit = defined(
    feroe_df$unit,
    label = "Unit of Measure",
    concept = "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure",
    namespace = "https://dd.eionet.europa.eu/vocabulary/eurostat/unit/$1"
  ),
  freq = defined(
    feroe_df$freq,
    label = "Frequency",
    concept = "http://purl.org/linked-data/sdmx/2009/code"
  )
)

Once the new dataset is defined in this way, you can combine it with the existing one using bind_defined_rows().

joined_dataset <- bind_defined_rows(small_country_dataset, feroe_dataset)
joined_dataset
#> Doe (2025): Small Country Dataset [dataset]
#>    rowid     geo       year      gdp       unit      freq      
#>    <defined> <defined> <defined> <defined> <defined> <defined>
#>  1 obs1      AD        2020      2355.     CP_MEUR   A        
#>  2 obs2      AD        2021      2594.     CP_MEUR   A        
#>  3 obs3      AD        2022      2884.     CP_MEUR   A        
#>  4 obs4      AD        2023      3120.     CP_MEUR   A        
#>  5 obs5      LI        2020      5430.     CP_MEUR   A        
#>  6 obs6      LI        2021      6424.     CP_MEUR   A        
#>  7 obs7      LI        2022      6759.     CP_MEUR   A        
#>  8 obs8      SM        2020      1265.     CP_MEUR   A        
#>  9 obs9      SM        2021      1461.     CP_MEUR   A        
#> 10 obs10     SM        2022      1612.     CP_MEUR   A        
#> 11 obs11     FO        2020      2524.     CP_MEUR   A        
#> 12 obs12     FO        2021      2726.     CP_MEUR   A        
#> 13 obs13     FO        2022      3013.     CP_MEUR   A

The combined dataset behaves like a regular tibble but retains its metadata. If you convert it to a base R data.frame, you will lose the helper methods and built-in checks, but the metadata will remain in the object’s attributes.

attributes(as.data.frame(joined_dataset))
#> $names
#> [1] "rowid" "geo"   "year"  "gdp"   "unit"  "freq" 
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13
#> 
#> $dataset_bibentry
#> Dublin Core Metadata Record
#> --------------------------
#> Title:       Small Country Dataset
#> Creator(s):  Jane Doe [ctb]
#> Publisher:   Example Inc.
#> Year:        2025
#> Language:    eng
#> 
#> $prov
#> [1] "<http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> ."                  
#> [2] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> ."                         
#> [3] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."                 
#> [4] "_:unknownauthor <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."                                        
#> [5] "<https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#> [6] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."                       
#> [7] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-08-25T22:11:30Z\"^^<xsd:dateTime> ."                         
#> 
#> $subject
#> $term
#> [1] "Data sets"
#> 
#> $subjectScheme
#> [1] "LCSH"
#> 
#> $schemeURI
#> [1] "http://id.loc.gov/authorities/subjects"
#> 
#> $valueURI
#> [1] "http://id.loc.gov/authorities/subjects/sh2018002256"
#> 
#> $classificationCode
#> NULL
#> 
#> $prefix
#> [1] "lcsh:"
#> 
#> attr(,"class")
#> [1] "subject" "list"   
#> 
#> $class
#> [1] "data.frame"