By using the formula parameter, it is already possible to protect linked tables with the functions described in the other vignettes. The result is the strictest form of protection, which we call global protection.
This vignette illustrates alternative methods for linked tables. A
common method for such protection is back-tracking where one
iterates until a consistent solution is found. In the functions
described below, such a method can be achieved by specifying
linkedGauss = "back-tracking"
. With the GaussSuppression
package, one can find such a consistent solution using an improved
approach, avoiding the need for iteration.
Below we start with some examples of protected tables with
alternative methods. Then we show in more detail different function
calls that achieve this. We also discuss the parameters
recordAware
and collapseAware
.
Finally, an example with interval protection is also shown.
We use a modified version of the example 1 dataset used elsewhere.
library(GaussSuppression)
dataset <- SSBtoolsData("example1")
dataset <- dataset[c(1, 2, 4, 6, 8, 10, 12, 13, 14, 15), ]
dataset$freq = c(6, 8, 9, 1, 2, 4, 3, 7, 2, 2)
print(dataset)
#> age geo eu year freq
#> 1 young Spain EU 2014 6
#> 2 young Iceland nonEU 2014 8
#> 4 old Spain EU 2014 9
#> 6 old Portugal EU 2014 1
#> 8 young Iceland nonEU 2015 2
#> 10 old Spain EU 2015 4
#> 12 old Portugal EU 2015 3
#> 13 young Spain EU 2016 7
#> 14 young Iceland nonEU 2016 2
#> 15 young Portugal EU 2016 2
In the examples, we work with two linked tables:
- a three-way table where age
, eu
, and
year
are crossed, and
- a two-way table where geo
and year
are
crossed.
In this example, small counts (1s and 2s) are protected. All zeros are treated as known structural zeros and are omitted from both the input and the output.
As in the other vignettes, primary suppressed cells are underlined and labeled in red, while the secondary suppressed cells are labeled in purple.
We first illustrate local protection, where the tables are protected separately without any coordination between them.
Table 1: Linked suppressed tables by
linkedGauss = "local"
age | year | nonEU | EU | Total |
---|---|---|---|---|
young | 2014 | 8 | 6 | 14 |
young | 2015 | 2 | 2 | |
young | 2016 | 2 | 9 | 11 |
young | Total | 12 | 15 | 27 |
old | 2014 | 10 | 10 | |
old | 2015 | 7 | 7 | |
old | 2016 | |||
old | Total | 17 | 17 | |
Total | 2014 | 8 | 16 | 24 |
Total | 2015 | 2 | 7 | 9 |
Total | 2016 | 2 | 9 | 11 |
Total | Total | 12 | 32 | 44 |
year | Iceland | Portugal | Spain | Total |
---|---|---|---|---|
2014 | 8 | 1 | 15 | 24 |
2015 | 2 | 3 | 4 | 9 |
2016 | 2 | 2 | 7 | 11 |
Total | 12 | 6 | 26 | 44 |
Clearly, this is not a satisfactory solution. The totals for 2015 and 2016 are suppressed in one table, but not in the other. Furthermore, there is also an inconsistency for Iceland-2014, which is the same as nonEU-2014.
We continue with consistent protection.
Table 2: Linked suppressed tables by
linkedGauss = "consistent"
age | year | nonEU | EU | Total |
---|---|---|---|---|
young | 2014 | 8 | 6 | 14 |
young | 2015 | 2 | 2 | |
young | 2016 | 2 | 9 | 11 |
young | Total | 12 | 15 | 27 |
old | 2014 | 10 | 10 | |
old | 2015 | 7 | 7 | |
old | 2016 | |||
old | Total | 17 | 17 | |
Total | 2014 | 8 | 16 | 24 |
Total | 2015 | 2 | 7 | 9 |
Total | 2016 | 2 | 9 | 11 |
Total | Total | 12 | 32 | 44 |
year | Iceland | Portugal | Spain | Total |
---|---|---|---|---|
2014 | 8 | 1 | 15 | 24 |
2015 | 2 | 3 | 4 | 9 |
2016 | 2 | 2 | 7 | 11 |
Total | 12 | 6 | 26 | 44 |
The inconsistency problems are now avoided.
However, a remaining problem with this solution is that Spain-2015 can be derived from EU-2015 and Portugal-2015.
Finally, we illustrate an improved form of consistent protection, denoted as super-consistent, which also avoids this problem.
Table 3: Linked suppressed tables by
linkedGauss = "super-consistent"
age | year | nonEU | EU | Total |
---|---|---|---|---|
young | 2014 | 8 | 6 | 14 |
young | 2015 | 2 | 2 | |
young | 2016 | 2 | 9 | 11 |
young | Total | 12 | 15 | 27 |
old | 2014 | 10 | 10 | |
old | 2015 | 7 | 7 | |
old | 2016 | |||
old | Total | 17 | 17 | |
Total | 2014 | 8 | 16 | 24 |
Total | 2015 | 2 | 7 | 9 |
Total | 2016 | 2 | 9 | 11 |
Total | Total | 12 | 32 | 44 |
year | Iceland | Portugal | Spain | Total |
---|---|---|---|---|
2014 | 8 | 1 | 15 | 24 |
2015 | 2 | 3 | 4 | 9 |
2016 | 2 | 2 | 7 | 11 |
Total | 12 | 6 | 26 | 44 |
The suppressed cells in each table correspond to related equations that cannot be solved. The super-consistent method makes use of the fact that common cells across tables must have the same value. Thus, the equations from the different tables can be combined when searching for solutions. The super-consistent method ensures that suppressed cells cannot be uniquely determined from the combined system of equations. However, the coordination is not as strict as in the global method, where the system of equations becomes even larger. In this particular case, the super-consistent solution turns out to be the same as the global one.
To achieve both treating zeros as known structural zeros and omitting
them from the output, we use the parameter settings
extend0 = FALSE
and removeEmpty = TRUE
.
In SuppressLinkedTables()
, the argument
withinArg
specifies which parameters may differ between the
linked tables. In our examples, we choose this to be either
dimVar
, hierarchies
, or
formula
.
The output from SuppressLinkedTables()
is a list, with
one element for each of the linked tables.
SuppressLinkedTables()
with dimVar
output <- SuppressLinkedTables(data = dataset,
fun = SuppressSmallCounts,
withinArg = list(table_1 = list(dimVar = c("age", "eu", "year")),
table_2 = list(dimVar = c("geo", "year"))),
freqVar = "freq",
maxN = 2,
extend0 = FALSE,
removeEmpty = TRUE,
linkedGauss = "super-consistent")
#>
#> ====== Linked GaussSuppression by "super-consistent" algorithm:
#>
#> GaussSuppression_anySum: .....................................
print(output[["table_1"]])
#> age eu year freq primary suppressed
#> 1 Total Total Total 44 FALSE FALSE
#> 2 Total Total 2014 24 FALSE FALSE
#> 3 Total Total 2015 9 FALSE TRUE
#> 4 Total Total 2016 11 FALSE TRUE
#> 5 Total EU Total 32 FALSE FALSE
#> 6 Total EU 2014 16 FALSE FALSE
#> 7 Total EU 2015 7 FALSE FALSE
#> 8 Total EU 2016 9 FALSE FALSE
#> 9 Total nonEU Total 12 FALSE FALSE
#> 10 Total nonEU 2014 8 FALSE FALSE
#> 11 Total nonEU 2015 2 TRUE TRUE
#> 12 Total nonEU 2016 2 TRUE TRUE
#> 13 old Total Total 17 FALSE FALSE
#> 14 old Total 2014 10 FALSE FALSE
#> 15 old Total 2015 7 FALSE FALSE
#> 16 old EU Total 17 FALSE FALSE
#> 17 old EU 2014 10 FALSE FALSE
#> 18 old EU 2015 7 FALSE FALSE
#> 19 young Total Total 27 FALSE FALSE
#> 20 young Total 2014 14 FALSE FALSE
#> 21 young Total 2015 2 TRUE TRUE
#> 22 young Total 2016 11 FALSE TRUE
#> 23 young EU Total 15 FALSE FALSE
#> 24 young EU 2014 6 FALSE FALSE
#> 25 young EU 2016 9 FALSE FALSE
#> 26 young nonEU Total 12 FALSE FALSE
#> 27 young nonEU 2014 8 FALSE FALSE
#> 28 young nonEU 2015 2 TRUE TRUE
#> 29 young nonEU 2016 2 TRUE TRUE
print(output[["table_2"]])
#> geo year freq primary suppressed
#> 1 Total Total 44 FALSE FALSE
#> 2 Total 2014 24 FALSE FALSE
#> 3 Total 2015 9 FALSE TRUE
#> 4 Total 2016 11 FALSE TRUE
#> 5 Iceland Total 12 FALSE FALSE
#> 6 Iceland 2014 8 FALSE FALSE
#> 7 Iceland 2015 2 TRUE TRUE
#> 8 Iceland 2016 2 TRUE TRUE
#> 9 Portugal Total 6 FALSE FALSE
#> 10 Portugal 2014 1 TRUE TRUE
#> 11 Portugal 2015 3 FALSE FALSE
#> 12 Portugal 2016 2 TRUE TRUE
#> 13 Spain Total 26 FALSE FALSE
#> 14 Spain 2014 15 FALSE TRUE
#> 15 Spain 2015 4 FALSE FALSE
#> 16 Spain 2016 7 FALSE TRUE
SuppressLinkedTables()
with
hierarchies
First, we need hierarchies for the input. Here, these are generated
separately with SSBtools::FindDimLists()
.
h_age <- SSBtools::FindDimLists(dataset["age"])[[1]]
h_geo <- SSBtools::FindDimLists(dataset["geo"])[[1]]
h_eu <- SSBtools::FindDimLists(dataset["eu"])[[1]]
h_year <- SSBtools::FindDimLists(dataset["year"])[[1]]
print(h_age)
#> levels codes
#> 1 @ Total
#> 2 @@ old
#> 3 @@ young
print(h_geo)
#> levels codes
#> 1 @ Total
#> 2 @@ Iceland
#> 3 @@ Portugal
#> 4 @@ Spain
print(h_eu)
#> levels codes
#> 1 @ Total
#> 2 @@ EU
#> 3 @@ nonEU
print(h_year)
#> levels codes
#> 1 @ Total
#> 2 @@ 2014
#> 3 @@ 2015
#> 4 @@ 2016
The output is identical to using dimVar
, so we only show
the code. Note that the only difference is the withinArg
argument.
output <- SuppressLinkedTables(data = dataset,
fun = SuppressSmallCounts,
withinArg =
list(table_1 = list(hierarchies = list(age = h_age, eu = h_eu, year = h_year)),
table_2 = list(hierarchies = list(geo = h_geo, year = h_year))),
freqVar = "freq",
maxN = 2,
extend0 = FALSE,
removeEmpty = TRUE,
linkedGauss = "super-consistent")
SuppressLinkedTables()
with formula
When using formula
, the output is similar to that
obtained with dimVar
or hierarchies
. The only
difference in the output is the ordering of rows, so we only show the
code.
Again, the only difference in the code is the withinArg
argument. However, note that we have omitted
removeEmpty = TRUE
here, since this is the default when a
formula is used as input.
SuppressSmallCounts()
with formula
and
linkedGauss
Since only the formula
parameter varies between the
linked tables, one option is to run SuppressSmallCounts()
directly with formula
as a list and the
linkedGauss
parameter specified. Here we show 10 output
rows.
output <- SuppressSmallCounts(data = dataset,
formula = list(table_1 = ~age*eu*year, table_2 = ~geo*year),
freqVar = "freq",
maxN = 2,
extend0 = FALSE,
linkedGauss = "super-consistent")
#>
#> ====== Linked GaussSuppression by "super-consistent" algorithm:
#>
#> GaussSuppression_anySum: ....................................
print(output[c(1, 6:7, 12, 19, 23, 25:28), ])
#> age year geo freq primary suppressed
#> 1 Total Total Total 44 FALSE FALSE
#> 6 Total 2014 Total 24 FALSE FALSE
#> 7 Total 2015 Total 9 FALSE TRUE
#> 12 old Total EU 17 FALSE FALSE
#> 19 young 2016 Total 11 FALSE TRUE
#> 23 Total 2014 nonEU 8 FALSE FALSE
#> 25 Total 2016 nonEU 2 TRUE TRUE
#> 26 Total 2014 Iceland 8 FALSE FALSE
#> 27 Total 2014 Portugal 1 TRUE TRUE
#> 28 Total 2014 Spain 15 FALSE TRUE
tables_by_formulas()
with formula
and
linkedGauss
Similar output can be obtained by tables_by_formulas()
.
In this case, the region variable is specified manually, and table
membership variables are included in the output. Again, 10 output rows
are shown.
output <- tables_by_formulas(data = dataset,
table_fun = SuppressSmallCounts,
table_formulas = list(table_1 = ~age*eu*year, table_2 = ~geo*year),
freqVar = "freq",
maxN = 2,
extend0 = FALSE,
linkedGauss = "super-consistent",
substitute_vars = list(region = c("geo", "eu")))
#>
#> ====== Linked GaussSuppression by "super-consistent" algorithm:
#>
#> GaussSuppression_anySum: ....................................
print(output[c(1, 6:7, 12, 19, 23, 25:28), ])
#> age year region freq primary suppressed table_1 table_2
#> 1 Total Total Total 44 FALSE FALSE TRUE TRUE
#> 6 Total 2014 Total 24 FALSE FALSE TRUE TRUE
#> 7 Total 2015 Total 9 FALSE TRUE TRUE TRUE
#> 12 old Total EU 17 FALSE FALSE TRUE FALSE
#> 19 young 2016 Total 11 FALSE TRUE TRUE FALSE
#> 23 Total 2014 nonEU 8 FALSE FALSE TRUE FALSE
#> 25 Total 2016 nonEU 2 TRUE TRUE TRUE FALSE
#> 26 Total 2014 Iceland 8 FALSE FALSE FALSE TRUE
#> 27 Total 2014 Portugal 1 TRUE TRUE FALSE TRUE
#> 28 Total 2014 Spain 15 FALSE TRUE FALSE TRUE
recordAware
and
collapseAware
An important issue is which cells are considered common cells. In the
functions, the parameter recordAware
is set to
TRUE
by default. In this case, common cells are determined
based on whether they aggregate the same underlying records. This is
similar to the use of cell keys, a well-known concept from the cell-key
method of statistical disclosure control.
When recordAware = FALSE
, common cells are instead
identified by matching variable combinations. This does not always work
well. For example, here recordAware = TRUE
is necessary to
capture that Iceland-2014 and nonEU-2014 are the
same.
A related parameter is collapseAware
, but it is not
available when using SuppressLinkedTables()
. When it is
used, even more cells are treated as common cells. In particular, the
suppression algorithm then automatically accounts for cells in one table
that are sums of cells in another table. In our example, this means that
the combination "consistent"
and
collapseAware = TRUE
gives the same result as
"super-consistent"
.
For more details on parameters and options, see the documentation for
SuppressLinkedTables()
.
Intervals for the primary suppressed cells are computed whenever the
lpPackage
parameter is specified. When
linkedGauss = "super-consistent"
, intervals can be
calculated using this method as well.
There are several possibilities. See the documentation for the
parameter linkedIntervals
in the help page for
SuppressLinkedTables()
.
If rangePercent
and/or rangeMin
are
provided, further suppression is performed to ensure that the interval
width requirements are met. See the help page for
GaussSuppressionFromData()
, under the description of the
lpPackage
parameter, for more details.
In the example below, the required interval width is 4. To achieve
this, two additional cells are suppressed: Portugal-2015 and
Spain-2015. Without this additional suppression, some intervals
are as narrow as 3 (see variables lo_1
and
up_1
below).
Table 4: Linked suppressed tables with
intervals by
linkedGauss = "super-consistent", rangeMin = 4
age | year | nonEU | EU | Total |
---|---|---|---|---|
young | 2014 | 8 | 6 | 14 |
young | 2015 | 2 [0, 4] | 2 [0, 4] | |
young | 2016 | 2 [0, 4] | 9 | 11 |
young | Total | 12 | 15 | 27 |
old | 2014 | 10 | 10 | |
old | 2015 | 7 | 7 | |
old | 2016 | |||
old | Total | 17 | 17 | |
Total | 2014 | 8 | 16 | 24 |
Total | 2015 | 2 [0, 4] | 7 | 9 |
Total | 2016 | 2 [0, 4] | 9 | 11 |
Total | Total | 12 | 32 | 44 |
year | Iceland | Portugal | Spain | Total |
---|---|---|---|---|
2014 | 8 | 1 [0, 6] | 15 | 24 |
2015 | 2 [0, 4] | 3 | 4 | 9 |
2016 | 2 [0, 4] | 2 [0, 6] | 7 | 11 |
Total | 12 | 6 | 26 | 44 |
This functionality can be used with all the function calls above.
Below is shown SuppressLinkedTables()
with
dimVar
.
output <- SuppressLinkedTables(data = dataset,
fun = SuppressSmallCounts,
withinArg = list(table_1 = list(dimVar = c("age", "eu", "year")),
table_2 = list(dimVar = c("geo", "year"))),
freqVar = "freq",
maxN = 2,
extend0 = FALSE,
removeEmpty = TRUE,
linkedGauss = "super-consistent",
lpPackage = "highs",
rangeMin = 4)
#>
#> ====== Linked GaussSuppression by "super-consistent" algorithm:
#>
#> GaussSuppression_anySum: .....................................
#> (20*18-DDrow->16*18->-0exact->9*5-DDcol2->9*3-GaussI->9*3)
#>
#> Using highs for intervals...
#> ----
#> (20*18-DDrow->16*18->)
#> ..................
#> 10+1-6+2-5+3+
#> 2: 1 new, (4.000) 1-
#> 1: 2 new, (3.000) 1+
#> GaussSuppression_none: .............................
#> (20*16-DDrow->16*16->-0exact->11*5-DDcol2->11*3-GaussI->11*3)
#>
#> Using highs for intervals...
#> ----
print(output[["table_1"]])
#> age eu year freq rlim_freq lo_1 up_1 lo up suppressed_integer primary
#> 1 Total Total Total 44 NA NA NA NA NA 0 FALSE
#> 2 Total Total 2014 24 NA NA NA NA NA 0 FALSE
#> 3 Total Total 2015 9 NA NA NA NA NA 2 FALSE
#> 4 Total Total 2016 11 NA NA NA NA NA 2 FALSE
#> 5 Total EU Total 32 NA NA NA NA NA 0 FALSE
#> 6 Total EU 2014 16 NA NA NA NA NA 0 FALSE
#> 7 Total EU 2015 7 NA NA NA NA NA 0 FALSE
#> 8 Total EU 2016 9 NA NA NA NA NA 0 FALSE
#> 9 Total nonEU Total 12 NA NA NA NA NA 0 FALSE
#> 10 Total nonEU 2014 8 NA NA NA NA NA 0 FALSE
#> 11 Total nonEU 2015 2 4 0 4 0 4 1 TRUE
#> 12 Total nonEU 2016 2 4 0 4 0 4 1 TRUE
#> 13 old Total Total 17 NA NA NA NA NA 0 FALSE
#> 14 old Total 2014 10 NA NA NA NA NA 0 FALSE
#> 15 old Total 2015 7 NA NA NA NA NA 0 FALSE
#> 16 old EU Total 17 NA NA NA NA NA 0 FALSE
#> 17 old EU 2014 10 NA NA NA NA NA 0 FALSE
#> 18 old EU 2015 7 NA NA NA NA NA 0 FALSE
#> 19 young Total Total 27 NA NA NA NA NA 0 FALSE
#> 20 young Total 2014 14 NA NA NA NA NA 0 FALSE
#> 21 young Total 2015 2 4 0 4 0 4 1 TRUE
#> 22 young Total 2016 11 NA NA NA NA NA 2 FALSE
#> 23 young EU Total 15 NA NA NA NA NA 0 FALSE
#> 24 young EU 2014 6 NA NA NA NA NA 0 FALSE
#> 25 young EU 2016 9 NA NA NA NA NA 0 FALSE
#> 26 young nonEU Total 12 NA NA NA NA NA 0 FALSE
#> 27 young nonEU 2014 8 NA NA NA NA NA 0 FALSE
#> 28 young nonEU 2015 2 4 0 4 0 4 1 TRUE
#> 29 young nonEU 2016 2 4 0 4 0 4 1 TRUE
#> suppressed
#> 1 FALSE
#> 2 FALSE
#> 3 TRUE
#> 4 TRUE
#> 5 FALSE
#> 6 FALSE
#> 7 FALSE
#> 8 FALSE
#> 9 FALSE
#> 10 FALSE
#> 11 TRUE
#> 12 TRUE
#> 13 FALSE
#> 14 FALSE
#> 15 FALSE
#> 16 FALSE
#> 17 FALSE
#> 18 FALSE
#> 19 FALSE
#> 20 FALSE
#> 21 TRUE
#> 22 TRUE
#> 23 FALSE
#> 24 FALSE
#> 25 FALSE
#> 26 FALSE
#> 27 FALSE
#> 28 TRUE
#> 29 TRUE
print(output[["table_2"]])
#> geo year freq rlim_freq lo_1 up_1 lo up suppressed_integer primary
#> 1 Total Total 44 NA NA NA NA NA 0 FALSE
#> 2 Total 2014 24 NA NA NA NA NA 0 FALSE
#> 3 Total 2015 9 NA NA NA NA NA 2 FALSE
#> 4 Total 2016 11 NA NA NA NA NA 2 FALSE
#> 5 Iceland Total 12 NA NA NA NA NA 0 FALSE
#> 6 Iceland 2014 8 NA NA NA NA NA 0 FALSE
#> 7 Iceland 2015 2 4 0 4 0 4 1 TRUE
#> 8 Iceland 2016 2 4 0 4 0 4 1 TRUE
#> 9 Portugal Total 6 NA NA NA NA NA 0 FALSE
#> 10 Portugal 2014 1 4 0 3 0 6 1 TRUE
#> 11 Portugal 2015 3 NA NA NA NA NA 3 FALSE
#> 12 Portugal 2016 2 4 0 3 0 6 1 TRUE
#> 13 Spain Total 26 NA NA NA NA NA 0 FALSE
#> 14 Spain 2014 15 NA NA NA NA NA 2 FALSE
#> 15 Spain 2015 4 NA NA NA NA NA 3 FALSE
#> 16 Spain 2016 7 NA NA NA NA NA 2 FALSE
#> suppressed
#> 1 FALSE
#> 2 FALSE
#> 3 TRUE
#> 4 TRUE
#> 5 FALSE
#> 6 FALSE
#> 7 TRUE
#> 8 TRUE
#> 9 FALSE
#> 10 TRUE
#> 11 TRUE
#> 12 TRUE
#> 13 FALSE
#> 14 TRUE
#> 15 TRUE
#> 16 TRUE