Join Aggregate Transform#

The Join Aggregate transform acts in almost every way the same as an Aggregate transform, but the resulting aggregate is joined to the original dataset. To make this more clear, consider the following dataset:

import pandas as pd
import numpy as np

rand = np.random.RandomState(0)

df = pd.DataFrame({
    'label': rand.choice(['A', 'B', 'C'], 10),
    'value': rand.randn(10),
})
df
      label     value
    0     A -0.173070
    1     B -1.761652
    2     A -0.087673
    3     B  1.366879
    4     B  1.125314
    5     C -0.358996
    6     A  1.220608
    7     C -1.339496
    8     A  0.428373
    9     A -0.123463

Here is a pandas operation that is equivalent to Altair’s Aggregate transform, using the mean as an example:

mean = df.groupby('label').mean().reset_index()
mean
      label     value
    0     A  0.252955
    1     B  0.243514
    2     C -0.849246

And here is an output that is equivalent to Altair’s Join Aggregate:

pd.merge(df, mean, on='label', suffixes=['', '_mean'])
      label     value  value_mean
    0     A -0.173070    0.252955
    1     A -0.087673    0.252955
    2     A  1.220608    0.252955
    3     A  0.428373    0.252955
    4     A -0.123463    0.252955
    5     B -1.761652    0.243514
    6     B  1.366879    0.243514
    7     B  1.125314    0.243514
    8     C -0.358996   -0.849246
    9     C -1.339496   -0.849246

Notice that the join aggregate joins the aggregated value with the original dataframe, such that the aggregated values can be used in tandem with the original values if desired.

Here is an example of how the join aggregate might be used: we compare the IMDB and Rotten Tomatoes movie ratings, normalized by their mean and standard deviation, which requires calculations on the joined data:

import altair as alt
from vega_datasets import data

alt.Chart(data.movies.url).transform_filter(
    'datum.IMDB_Rating != null  && datum.Rotten_Tomatoes_Rating != null'
).transform_joinaggregate(
    IMDB_mean='mean(IMDB_Rating)',
    IMDB_std='stdev(IMDB_Rating)',
    RT_mean='mean(Rotten_Tomatoes_Rating)',
    RT_std='stdev(Rotten_Tomatoes_Rating)'
).transform_calculate(
    IMDB_Deviation="(datum.IMDB_Rating - datum.IMDB_mean) / datum.IMDB_std",
    Rotten_Tomatoes_Deviation="(datum.Rotten_Tomatoes_Rating - datum.RT_mean) / datum.RT_std"
).mark_point().encode(
    x='IMDB_Deviation:Q',
    y="Rotten_Tomatoes_Deviation:Q"
)

Transform Options#

The transform_joinaggregate() method is built on the JoinAggregateTransform class, which has the following options:

Property

Type

Description

groupby

array(FieldName)

The data fields for partitioning the data objects into separate groups. If unspecified, all data points will be in a single group.

joinaggregate

array(JoinAggregateFieldDef)

The definition of the fields in the join aggregate, and what calculations to use.