Meet Edith: How to get a clear view of hierarchical data? Answer: Sunburst Charts

Edith Darin

Background

Bayesian models are great for hierarchical data, that is observations that have a grouped structure, typically grades of students that belongs to different classrooms in different schools.

This type of data structure violates the assumption of independence, key for basic models and can produce mislead conclusions, when the group-specific patterns differ from the global pattern.

But the point here is not to describe Bayesian hierarchichal models.

We are taking a step backward and we want to visualise the hierarchical structure of our data, for two reasons:

communicate with a lay audience
explore the groupings

Indeed there are two types of data hierarchy: crossed and nested. A crossed design happens when all combination of groupings exist (either in data or in reality). Typically if we were to distinguish between foreign-born and native students, we would expect that this grouping can occur in every classroom. On the opposite, a nested design describes a strict hierarchy, typically, one classroom belongs to one and only one school.

We want to show that sunburst charts¹ are a great tool to visualise the hierarchy and combined with some tidyverse magic can unravel missing combinations.

Example of a sunburst chart

R set-up

show

library(tidyverse)
library(RColorBrewer)
library(kableExtra)
library(plotly) # contains function for sunburst plot

Data

We will use a dataset provided by Leasure and al. as supplement to their paper (Douglas R. Leasure et al. 2020a).

show

# download  tutorial data
download.file(
  "https://www.pnas.org/highwire/filestream/949050/field_highwire_adjunct_files/1/pnas.1913050117.sd01.xls",
  '_posts/2021-07-31-unraveling-hierarchichal-data-structure-with-sunburst-charts/nga_demo_data.xls',
  method='libcurl',
  mode='wb'
)
getwd()

It consist in household surveys that collected information on the total population in 1141 clusters in 15 of 37 states in Nigeria during 2016 and 2017. The surveys are further described in Douglas R. Leasure et al. (2020b) and Weber et al. (2018).

show

# prepare data
data <- readxl::read_excel('_posts/2021-07-31-unraveling-hierarchichal-data-structure-with-sunburst-charts/nga_demo_data.xls')

data <- data %>% 
  mutate(
    id = paste0('cluster',1:nrow(data)),
    pop_density=N/A
  ) %>% 
  rename(
    population=N
  ) %>% 
  select(id, population, pop_density, type, region, state, local )

data %>% head()  %>% kbl() %>% kable_minimal()

id	population	pop_density	type	region	state	local
cluster1	547	180.11207	1	2	1	4
cluster2	803	268.57792	1	2	1	10
cluster3	750	243.64275	1	2	1	12
cluster4	281	93.06772	1	2	1	17
cluster5	730	241.30606	1	2	1	4
cluster6	529	171.19344	1	2	1	17

There are two sources of grouping in the data: by settlement type and by administrative divisions. There is three levels of administrative divisions reported: region, state and local governement area local. Settlement type is characteristic of a crossed hierarchy: every settlement type can be (potentially) present in all subgrouping eg in all regions, states or local government area.

We plot first the variation of the response variable population_density by settlement type

show

# plot population density per region
ggplot(data %>% 
         group_by(
           region
         ) %>% 
         mutate(
           mean_popDens = mean(pop_density)
           ) %>% 
         ungroup(), 
       aes(fill=mean_popDens, x=pop_density, y=as.factor(region)))+
  geom_boxplot()+
  theme_minimal()+
  scale_fill_stepsn( colours = brewer.pal(6, "YlOrRd"))+
  labs(fill='Mean \npopulation \ndensity', x='Population density', y='Region')

Figure 1: Boxplot of population densities per region

Figure 1 shows the motivation for building a hierarchical model.

Visualising the hierarchical structure

We will make the use of the chart type sunburstof the package plotly.

We need first to prepare the data with unique identifier for each administrative region:

show

# create unique id for the nested admin level
data <- data %>% 
  mutate(state= paste0(state,region),
         local = paste0(state, local))

A sunburst chart is divided into concentric layers that are defined by the idof each grouping and its corresponding parent.

We define the first layer, that is the kernel of the circles with the level typefrom the hierarchy. Because it is the kernel it has no parents we leave this attribute blank. We define also an attribute hoverto display the number of clusters in each grouping on hover.

show

layer1 <- data %>% 
    group_by(type) %>% 
    summarise(n=n()) %>% 
    mutate(
      ids = paste0('settlement', type),
      labels = paste0('settlement <br>', type),
      parents = '') %>% 
    ungroup() %>% 
    select(ids,labels, parents,n) %>% 
  mutate(
    hover= paste('\n sample size', n)
  )

plot_ly(layer1, 
        ids = ~ids, 
        labels = ~labels, 
        parents = ~parents, 
        type = 'sunburst', 
        hovertext=~hover, insidetextorientation='radial')

The next level corresponds the region and has for parent the settlement type ids previously defined:

show

sunburst_data_2layers <- rbind(
  # first layer
  layer1,
  # second layer
  data %>% 
    group_by(type, region) %>% 
    summarise(n=n()) %>% 
    mutate(
      ids = paste('settlement', type, '-', 'region', region),
      labels = paste0('region ', region),
      parents = paste0('settlement', type))%>% 
    ungroup() %>% 
    select(ids,labels, parents,n)%>% 
  mutate(
    hover= paste(ids, '\n sample size', n)
  )) 

plot_ly(sunburst_data_2layers, 
        ids = ~ids, 
        labels = ~labels,
        parents = ~parents, 
        type = 'sunburst', 
        hovertext=~hover, insidetextorientation='radial')

To get the full picture we add the two levels, stateand local .

show

# create data for sunburst plot
sunburst_data_4layers <- rbind(
  # first layer
  layer1,
  # second layer
  data %>% 
    group_by(type, region) %>% 
    summarise(n=n()) %>% 
    mutate(
      ids = paste('settlement', type, '-', 'region', region),
      labels = paste0('region ', region),
      parents = paste0('settlement', type))%>% 
    ungroup() %>% 
    select(ids,labels, parents,n)%>% 
  mutate(
    hover= paste(ids, '\n sample size', n)
  ),
  # third layer
  data %>% 
    group_by(type, region, state) %>% 
    summarise(n=n()) %>% 
    mutate(
      ids = paste('settlement', type, '-', 'region', region, '-', 'state', state, '-', 'region', region),
      labels = paste0('state ', state),
      parents = paste('settlement', type, '-', 'region', region))%>% 
    ungroup() %>% 
    select(ids,labels, parents,n)%>% 
  mutate(
    hover= paste(ids, '\n sample size', n)
  ),
  # fourth layer
  data %>% 
    group_by(type, region, state, local) %>% 
    summarise(n=n()) %>% 
    mutate(
      ids = paste('settlement', type, '-', 'region', region, '-', 'state', state,  '-', local),
      labels = paste0('local ', local),
      parents = paste('settlement', type, '-', 'region', region, '-', 'state', state, '-', 'region', region))%>% 
    ungroup() %>% 
    select(ids,labels, parents,n) %>% 
  mutate(
    hover= paste(ids, '\n sample size', n)
  )
)

plot_ly(sunburst_data_4layers, 
        ids = ~ids, 
        labels = ~labels, 
        parents = ~parents, 
        type = 'sunburst', 
        hovertext=~hover, insidetextorientation='radial')

Figure 2: Full grouping structure of the survey

Plotly enables interaction with the chart: click on the grouping to get the detailed picture of the hierarchy.

Visualising missing combinations

It is important to explore what are the missing combinations in the data:.

to understand the characteristics of our sample
to index correctly the model

For that purpose we will make use of the wonderful function from the tidyversesuite: complete.

To demonstrate, we will use the two first layers that are typeand region, such that we want to spot the regions that do not contain all settlement types:

show

# create missing combinations
data_complete <- data %>% 
  complete(region, nesting(type))

data_complete %>% filter(is.na(local)) %>% kbl() %>% kable_minimal()

region	type	id	population	pop_density	state	local
4	1	NA	NA	NA	NA	NA
4	2	NA	NA	NA	NA	NA
4	3	NA	NA	NA	NA	NA
4	4	NA	NA	NA	NA	NA
10	1	NA	NA	NA	NA	NA
10	2	NA	NA	NA	NA	NA
10	3	NA	NA	NA	NA	NA
10	4	NA	NA	NA	NA	NA

We see that two regions (4 and 10) comprise only settlement type 5.

We can visualise it with a sunburst chart by adding a color attribute:

show

makeSunburst2layer <- function(data){
  layers <- rbind(
  # first layer
  data %>% 
    group_by(type) %>% 
    summarise(n=sum(!is.na(population))) %>% 
    mutate(
      ids = paste0('settlement', type),
      labels = paste0('settlement <br>', type),
      parents = '') %>% 
    ungroup() %>% 
    select(ids,labels, parents,n),
  # second layer
  data %>% 
    group_by(type, region) %>% 
    summarise(n=sum(!is.na(population))) %>% 
    mutate(
      ids = paste('settlement', type, '-', 'region', region),
      labels = paste0('region ', region),
      parents = paste0('settlement', type))%>% 
    ungroup() %>% 
    select(ids,labels, parents,n)) %>%
    mutate(
      hover= paste(ids, '\n sample size', n),
      color= ifelse(n==0, 'yellow','')
    )
  
 return(layers)
}

plot_ly() %>% 
   add_trace(data=makeSunburst2layer(data), 
             ids = ~ids, labels = ~labels, parents = ~parents, 
             type = 'sunburst', 
             hovertext=~hover, marker= list(colors=~color),  
             insidetextorientation='radial',
             domain = list(column = 0)) %>% 
     add_trace(data=makeSunburst2layer(data_complete), 
             ids = ~ids, labels = ~labels, parents = ~parents, 
             type = 'sunburst', 
             hovertext=~hover, marker= list(colors=~color),  
             insidetextorientation='radial',
             domain = list(column = 1))  %>%  
  layout(
      grid = list(columns =2, rows = 1),
      margin = list(l = 0, r = 0, b = 0, t = 0))

Figure 3: Full grouping structure of the survey with missing groups

Leasure, Douglas R., Warren C. Jochem, Eric M. Weber, Vincent Seaman, and Andrew J. Tatem. 2020a. “National Population Mapping from Sparse Survey Data: A Hierarchical Bayesian Modeling Framework to Account for Uncertainty.” Proceedings of the National Academy of Sciences 117 (39): 24173–79. https://doi.org/10.1073/pnas.1913050117.

Leasure, Douglas R, Warren C Jochem, Eric M Weber, Vincent Seaman, and Andrew J Tatem. 2020b. “National Population Mapping from Sparse Survey Data: A Hierarchical Bayesian Modeling Framework to Account for Uncertainty.” Proceedings of the National Academy of Sciences.

Weber, Eric M., Vincent Y. Seaman, Robert N. Stewart, Tomas J. Bird, Andrew J. Tatem, Jacob J. McKee, Budhendra L. Bhaduri, Jessica J. Moehl, and Andrew E. Reith. 2018. “Census-Independent Population Mapping in Northern Nigeria.” Remote Sensing of Environment 204 (January): 786–98. https://doi.org/10.1016/j.rse.2017.09.024.

The example sunburst chart is from: https://www.syncfusion.com/products/wpf/control/images/sfsunburst/sunburst.png)’)↩︎

How to get a clear view of hierarchical data? Answer: Sunburst Charts

Background

R set-up

Data

Visualising the hierarchical structure

Visualising missing combinations

References

Citation

region	type	id	population	pop_density	state	local
4	1	NA	NA	NA	NA	NA
4	2	NA	NA	NA	NA	NA
4	3	NA	NA	NA	NA	NA
4	4	NA	NA	NA	NA	NA
10	1	NA	NA	NA	NA	NA
10	2	NA	NA	NA	NA	NA
10	3	NA	NA	NA	NA	NA
10	4	NA	NA	NA	NA	NA

region	type	id	population	pop_density	state	local
4	1	NA	NA	NA	NA	NA
4	2	NA	NA	NA	NA	NA
4	3	NA	NA	NA	NA	NA
4	4	NA	NA	NA	NA	NA
10	1	NA	NA	NA	NA	NA
10	2	NA	NA	NA	NA	NA
10	3	NA	NA	NA	NA	NA
10	4	NA	NA	NA	NA	NA

region	type	id	population	pop_density	state	local
4	1	NA	NA	NA	NA	NA
4	2	NA	NA	NA	NA	NA
4	3	NA	NA	NA	NA	NA
4	4	NA	NA	NA	NA	NA
10	1	NA	NA	NA	NA	NA
10	2	NA	NA	NA	NA	NA
10	3	NA	NA	NA	NA	NA
10	4	NA	NA	NA	NA	NA