Discovering Altair - interactive plots

Sep 5, 2021

Diving into data analysis subjects made me realize how little attention we pay to all kinds of bias in quantum chemistry research. Leaving a deeper exploration of this topic for the future, with this post, I would like to relate to the “cognitive bias,” which inevitably enters the data analysis and data presentation stages of the research process. Some of the questions I ponder on for quite some time are: how to present research results objectively, yet discuss them (necessarily) in the context of state-of-the-art knowledge, and how to document and discuss the data we collect, including the data that does not fit to the desired narrative and does not lead to anticipated conclusions.

The way I see it, such issues are - to some extent - “byproducts” of the format we most often use to communicate our research through publications in scientific journals. These typically impose a fixed form presentation and rarely focus exclusively on the data, which (if available at all) is usually buried in the supplementary information, lack good standardized descriptions (such as the “datasheets for datasets”) or are in formats that may be hard to reuse. All that does not invite to explore the data and does not make it easy for the readers to disengage from the author’s opinions. In quantum chemistry and other sciences where the presented data is “raw” or minimally processed, this may not be such a big issue as in social studies (as discussed in this work), yet I believe it is something to keep in mind.

Reading scientific publications requires a lot of expertise, critical thinking, and awareness that the author(s) of the publication had a specific objective to reach (e.g., publication in this journal) and a certain message to convey. This, I think, is one of the reasons for which entering a new research field can be so difficult - without the training and domain knowledge, one is more prone to be hooked by a catchy title or abstract and fall into a marketed story. Without the tools to actively read the research papers, it is also not easy to fully appreciate their contribution.

One way to reduce this bias and make the research more transparent and inviting is to create frameworks that allow the reader to easily interact with data. Such a framework could be considered a complement to the (then - more opinionated) scientific publication. It could also be an exciting option for the automatic summarization of research projects (as discussed recently), a way to ignite more scientific discussions and a more informed alternative to popular TOC abstracts, whose role is only to catch attention. In practice, this can be easily realized in computational notebooks, in environments offered by Jupyter, RStudio, Observable or many more, recently praised in an article in Nature.

Whether - as authors - we should strive for objectivity in the presentation of our research work is another question. Yet, I believe specifying and separating the roles and motivations for using various communication mediums - for instance, allowing a more subjective discussion in scientific publications while providing interactive platforms to enable others to draw their own conclusions - could set the expectations right. I also feel that this could trigger more empathetic, inclusive, and creative behaviors in research work - it would, for instance, address the fact that we all have different ways to learn, as the authors of the The “Paper” of the Future convey.

This is already a rather long introduction to a small technical post, yet I hope to have motivated the exercise I want to dive into. Specifically, I would like to show how to prepare a Jupyter dashboard that assembles few simple interactive plots designed with Altair. The data used in this example is taken from our recently submitted paper.

This post can also be viewed as a jupyter notebook - link, opened in jupyter lab and (for instance) rendered with Voila.

What & why:

The data describes the solvent effects on the NMR shielding constant of transition metal nuclei ($\sigma$) in multiple complexes, that were calculated with few models:

  • in full solvated complexes ($\sigma_{super}$)
  • in solute molecules with the solvent effects estimated with two variants of the Frozen Density Embedding (FDE) method:

    • canonical FDE with the density of the solvent kept fixed ($\sigma_{fde0}$)
    • FDE with the freeze-thaw procedure, in which the densities of the solute and solvent are relaxed in each other presence ($\sigma_{fdeN}$)
  • isolated solute molecules in their relaxed geometries - as if they were parts of a complex ($\sigma_{isol}$)
  • isolated solute molecules in their vacuum geometry ($\sigma_{vac}$)

$\sigma_{super}$ is the most accurate result within the quantum chemistry model used, while $\sigma_{vac}$ is the reference value (no solvent effects). $\sigma_{isol}$, $\sigma_{fde0}$, $\sigma_{fdeN}$ are approximations and should approach the $\sigma_{super}$ value.

Data description:

  • df_dirac and df_adf are dataframes that collect the data obtained from calculations with DIRAC (with the DC Hamiltonian) and ADF (with the SO-ZORA Hamiltonian). The data is read from from dirac_data.csv and adf_data.csv files.

  • some definitions used in this notebook:

    • $\delta$: useful to discuss solvent effects estimated by a given model:
      • $\delta1 \equiv \delta_{isol} = \sigma_{isol} - \sigma_{vac}$
      • $\delta2 \equiv \delta_{fde0} = \sigma_{fde0} - \sigma_{vac}$
      • $\delta3 \equiv \delta_{fdeN} = \sigma_{fdeN} - \sigma_{vac}$
      • $\delta4 \equiv \delta_{super} = \sigma_{super} - \sigma_{vac}$
    • $\Delta$: useful to discuss contributions to solvent effects:
      • $\Delta1 \equiv\Delta_{isol} = \sigma_{isol} - \sigma_{vac} = \delta_{isol}$
      • $\Delta2 \equiv\Delta_{fde0} = \sigma_{fde0} - \sigma_{isol}$
      • $\Delta3 \equiv\Delta_{fdeN} = \sigma_{fdeN} - \sigma_{fde0}$
      • $\Delta4 \equiv \Delta_{super} = \sigma_{super} - \sigma_{fdeN}$
import pandas as pd
import altair as alt
import altair_viewer
df1=pd.read_csv('dirac_data.csv')
df2=pd.read_csv('adf_data.csv')
def def_molprop():

    mol =  ['Ti',
            'V',
            'Cr',
            'Mn',
            'Fe',
            'Co',
            'Ni',
            'Cu',
            'Zn',
            'Nb',
            'Mo',
            'Tc',
            'Ru',
            'Pd',
            'Ag',
            'Cd',
            'Ta',
            'W',
            'Re',
            'Os',
            'Pt',
            'Hg'
    ]
 
    atomic_number_X={}
    atomic_number_X['Ti']=22
    atomic_number_X['V']=23
    atomic_number_X['Cr']=24
    atomic_number_X['Mn']=25
    atomic_number_X['Fe']=26
    atomic_number_X['Co']=27
    atomic_number_X['Ni']=28
    atomic_number_X['Cu']=29
    atomic_number_X['Zn']=30
    atomic_number_X['Nb']=41
    atomic_number_X['Mo']=42
    atomic_number_X['Tc']=43
    atomic_number_X['Ru']=44
    atomic_number_X['Pd']=46
    atomic_number_X['Ag']=47
    atomic_number_X['Cd']=48
    atomic_number_X['Ta']=73
    atomic_number_X['W']=74
    atomic_number_X['Re']=75
    atomic_number_X['Os']=76
    atomic_number_X['Pt']=78
    atomic_number_X['Hg']=80     
       
    solvent={} 
    solvent['Ti']='12TiCl₄'
    solvent['V']='6C₆H₆'
    solvent['Cr']='12H₂O'
    solvent['Mn']='12H₂O'
    solvent['Fe']='6C₆H₆'
    solvent['Co']='12H₂O'
    solvent['Ni']='6C₆H₆'
    solvent['Cu']='12CH₃CN'
    solvent['Zn']='12H₂O'
    solvent['Nb']='12CH₃CN'
    solvent['Mo']='12H₂O'
    solvent['Tc']='12H₂O'
    solvent['Ru']='12H₂O'
    solvent['Pd']='12H₂O'
    solvent['Ag']='12H₂O'
    solvent['Cd']='6Cd(CH₃)₂'
    solvent['Ta']='12CH₃CN'
    solvent['W']='12H₂O'
    solvent['Re']='12H₂O'
    solvent['Os']='12CCl₄'
    solvent['Pt']='12H₂O'
    solvent['Hg']='6Hg(CH₃)₂'
    
    cplx={} 
    cplx['Ti']='TiCl₄ + 12TiCl₄'
    cplx['V']='VOCl₃ + 6C₆H₆'
    cplx['Cr']='CrO₄²⁻ + 12H₂O'
    cplx['Mn']='MnO₄⁻ + 12H₂O'
    cplx['Fe']='Fe(CO)₅ + 6C₆H₆'
    cplx['Co']='Co(CN)₆³⁻ + 12H₂O'
    cplx['Ni']='Ni(CO)₄ + 6C₆H₆'
    cplx['Cu']='Cu(CH₃CN)₄⁺ + 12CH₃CN'
    cplx['Zn']='Zn(H₂O)₆²⁺ + 12H₂O'
    cplx['Nb']='NbCl₆⁻ + 12CH₃CN'
    cplx['Mo']='MoO₄²⁻ + 12H₂O'
    cplx['Tc']='TcO₄⁻ + 12H₂O'
    cplx['Ru']='Ru(CN)₆⁴⁻ + 12H₂O'
    cplx['Pd']='PdCl₆²⁻ + 12H₂O'
    cplx['Ag']='Ag(H₂O)₄⁺ + 12H₂O'
    cplx['Cd']='Cd(CH₃)₂ + 6Cd(CH₃)₂'
    cplx['Ta']='TaCl₆⁻ + 12CH₃CN'
    cplx['W']='WO₄²⁻ + 12H₂O'
    cplx['Re']='ReO₄⁻ + 12H₂O'
    cplx['Os']='OsO₄ + 12CCl₄'
    cplx['Pt']='PtCl₆²⁻ + 12H₂O'
    cplx['Hg']='Hg(CH₃)₂ + 6Hg(CH₃)₂'    

    charge={} 
    charge['Ti']=0
    charge['V']=0
    charge['Cr']=-2
    charge['Mn']=-1
    charge['Fe']=0
    charge['Co']=-3
    charge['Ni']=0
    charge['Cu']=1
    charge['Zn']=2
    charge['Nb']=-1
    charge['Mo']=-2
    charge['Tc']=-1
    charge['Ru']=-4
    charge['Pd']=-2
    charge['Ag']=1
    charge['Cd']=0
    charge['Ta']=-1
    charge['W']=-2
    charge['Re']=-1
    charge['Os']=0
    charge['Pt']=-2
    charge['Hg']=0
    
    row={} 
    row['Ti']=4
    row['V']=4
    row['Cr']=4
    row['Mn']=4
    row['Fe']=4
    row['Co']=4
    row['Ni']=4
    row['Cu']=4
    row['Zn']=4
    row['Nb']=5
    row['Mo']=5
    row['Tc']=5
    row['Ru']=5
    row['Pd']=5
    row['Ag']=5
    row['Cd']=5
    row['Ta']=6
    row['W']=6
    row['Re']=6
    row['Os']=6
    row['Pt']=6
    row['Hg']=6
      
    order_where=['DC','SO-ZORA']
    return atomic_number_X, cplx, solvent, charge, row


def prep_adf(df):
    
    an,cplx,solvent,charge,row=def_molprop()
   
    df_se = df[['mol']].copy()
    df_se['x_labels'] = df_se.mol
    df_se['where'] = 'SO-ZORA'
    df_se['solveff']  = df['supermolecule']-df['isolated_vac']
    df_se['row'] = df_se['mol'].map(row)
    df_se['charge'] = df_se['mol'].map(charge)
    df_se['solvent'] = df_se['mol'].map(solvent)
    df_se['complex'] = df_se['mol'].map(cplx)
    
    df_delta = df[['mol']].copy()
    df_delta['delta1']  = df['isolated_supergeom']-df['isolated_vac']
    df_delta['delta2']  = df['fde']-df['isolated_vac']
    df_delta['delta3']  = df['fnt']-df['isolated_vac']    
    df_delta['delta4']  = df['supermolecule']-df['isolated_vac']     
    df_delta=df_delta.drop(['mol'], axis=1)
    
    df_Delta = df[['mol']].copy()
    df_Delta['Delta1']  = df['isolated_supergeom']-df['isolated_vac']
    df_Delta['Delta2']  = df['fde']-df['isolated_supergeom']
    df_Delta['Delta3']  = df['fnt']-df['fde']    
    df_Delta['Delta4']  = df['supermolecule']-df['fnt']   
    df_Delta=df_Delta.drop(['mol'], axis=1)

    df_d = df[['mol']].copy() 
    df_d['d1']      = (df['isolated_supergeom']-df['supermolecule'])/df['isolated_vac']
    df_d['d2']      = (df['fde']-df['supermolecule'])/df['isolated_vac']
    df_d['d3']      = (df['fnt']-df['supermolecule'])/df['isolated_vac']
    df_d['d4']      = (df['supermolecule']-df['supermolecule'])/df['isolated_vac']
    df_d=df_d.drop(['mol'], axis=1)
    
    df_all = pd.concat([df_se, df_delta, df_Delta, df_d],axis=1,sort=False)
    
    return df_all, df_se, df_delta, df_Delta, df_d


def prep_dirac(df):
    
    an,cplx,solvent,charge,row=def_molprop()
 
    df_se = df[['mol']].copy()
    df_se['x_labels'] = df_se.mol
    df_se['where'] = 'DC'
    df_se['solveff']  = df['supermolecule']-df['isolated_vac']
    df_se['row'] = df_se['mol'].map(row)
    df_se['charge'] = df_se['mol'].map(charge)
    df_se['solvent'] = df_se['mol'].map(solvent)
    df_se['complex'] = df_se['mol'].map(cplx)

    df_delta = df[['mol']].copy()
    df_delta['delta1']  = df['isolated_supergeom_supergrid']-df['isolated_vac']
    df_delta['delta2']  = df['fde_vw11']-df['isolated_vac']
    df_delta['delta3']  = df['fnt_vw11']-df['isolated_vac'] 
    df_delta['delta4']  = df['supermolecule']-df['isolated_vac']     
    df_delta=df_delta.drop(['mol'], axis=1)
    
    df_Delta = df[['mol']].copy()
    df_Delta['Delta1']  = df['isolated_supergeom_supergrid']-df['isolated_vac']
    df_Delta['Delta2']  = df['fde_vw11']-df['isolated_supergeom_supergrid']
    df_Delta['Delta3']  = df['fnt_vw11']-df['fde_vw11']
    df_Delta['Delta4']  = df['supermolecule']-df['fnt_vw11'] 
    df_Delta=df_Delta.drop(['mol'], axis=1)

    df_d = df[['mol']].copy() 
    df_d['d1']      = (df['isolated_supergeom_supergrid']-df['supermolecule'])/df['isolated_vac']
    df_d['d2']      = (df['fde_vw11']-df['supermolecule'])/df['isolated_vac']
    df_d['d3']      = (df['fnt_vw11']-df['supermolecule'])/df['isolated_vac']
    df_d['d4']      = (df['supermolecule']-df['supermolecule'])/df['isolated_vac']
    df_d=df_d.drop(['mol'], axis=1)
    
    df_all = pd.concat([df_se, df_delta, df_Delta, df_d],axis=1,sort=False)
    
    return df_all, df_se, df_delta, df_Delta, df_d

df_adf_all, df_adf_se, df_adf_delta, df_adf_Delta, df_adf_d = prep_adf(df2)
df_adf_all_melted=df_adf_all.melt(id_vars =['mol','x_labels', 'where', 'row', 'charge', 'solvent', 'complex'])
df_adf_all_melted_delta=df_adf_all.melt(id_vars =['mol','x_labels', 'where', 'row', 'charge', 'solvent', 'complex',
                                                 'Delta1','Delta2', 'Delta3', 'Delta4',
                                                 'd1', 'd2', 'd3', 'd4',
                                                 'solveff'])

df_dirac_all, df_dirac_se, df_dirac_delta, df_dirac_Delta, df_dirac_d = prep_dirac(df1)
df_dirac_all_melted=df_dirac_all.melt(id_vars =['mol','x_labels', 'where', 'row', 'charge', 'solvent', 'complex'])
df_dirac_all_melted_delta=df_dirac_all.melt(id_vars =['mol','x_labels', 'where', 'row', 'charge', 'solvent', 'complex',
                                                 'Delta1','Delta2', 'Delta3', 'Delta4',
                                                 'd1', 'd2', 'd3', 'd4',
                                                 'solveff'])

Total solvent effects:

We measure solvent effects at a given approximation by $\delta$. If we want to focus on one approximation, for instance on $\delta_{super}$ (the total solvent effects), we can first show the dependence of $\delta_{super}$ on the row of the periodic table to which the nucleus of interest belongs, and on the solvent used.

Let’s try to apply some of the interactive Altair’s hover methods (full gallery) to the results from ADF.

def plot_chart(df, xData, yData, colorData, textData, yDataTitle):
    
    hover = alt.selection_single(on='mouseover', nearest=True, empty='none')

    bckcolor='lightgray'
    
    base = alt.Chart(df).encode(
        x=alt.X(xData,
                axis=alt.Axis(title="X",labelAngle=0)
                ),
        y=alt.Y(yData,
                axis=alt.Axis(title=yDataTitle, 
                              format=",.0f")
                ),
        color=alt.condition(hover,
                            colorData,
                            alt.value(bckcolor)
                           )
        ).properties(width=230,
                     height=190
                    )
    
    points = base.mark_point().add_selection(hover)

    text = base.mark_text(dy=-5).encode(text = textData,
                                        opacity = alt.condition(hover,
                                                                alt.value(1),
                                                                alt.value(0)
                                                                )
                                        )
    
    return points, text


def def_layer(points, text, xIndp, yIndp, plotSpacing, plotTitle):
    
    selected=alt.selection_multi(fields=['row'], bind='legend')
    
    layer=alt.layer(points, text).facet(facet=alt.Facet('row:N', title='row(X)'),
                                        spacing=plotSpacing,
                                        title=plotTitle
                                        )
    
    layer=layer.add_selection(selected).transform_filter(selected)
    
    if xIndp and yIndp:
        layer=layer.resolve_scale(x='independent', y='independent')
    elif xIndp:
        layer=layer.resolve_scale(x='independent')
    elif yIndp:
        layer=layer.resolve_scale(y='independent')
    
    return layer

In the plot below: we show the total solvent effects for all complexes; when we hover over the point, the color of that point indicates the solvent.

delta_points, delta_text=plot_chart(df_adf_all, 
                                    'mol:O', 
                                    'solveff', 
                                    'solvent:N', 
                                    'complex:N', 
                                    "δ(super) [ppm]")
                                          
layer=def_layer(delta_points, 
                delta_text,
                True, False,
                -30,
                'Solvent effects on σ(X) [ppm]')

layer

Let’s then try to show the data corresponding to all introduced approximations to the solvent effects (represented by all $\delta$s).

In the plot below, when we hover over the point, the color of the point indicates the solvent, and the text indicates the approximation to which the data point refers.

delta_points, delta_text=plot_chart(df_adf_all_melted_delta, 
                                    'mol:O', 
                                    'value', 
                                    'solvent:N', 
                                    'variable:N', 
                                    "δ(fdeN) [ppm]")
                                          
layer=def_layer(delta_points, 
                delta_text,
                True, False,
                10,
                'Approximations to solvent effects on σ(X) [ppm]')

layer

To zoom into the plots for each row, we may keep the y-axes of all plots independent:

delta_points, delta_text=plot_chart(df_adf_all_melted_delta, 
                                    'mol:O', 
                                    'value', 
                                    'solvent:N', 
                                    'variable:N', 
                                    "δ(fdeN) [ppm]")
                                          
layer=def_layer(delta_points, 
                delta_text,
                True, True,
                10,
                'Approximations to solvent effects on σ(X) [ppm] (note different y-scales)')

layer

To show the dependence of results on other parameters (e.g. charge of the complex), we can use the repeat chart method of Altair.

For that we need just a small adjustment of our plot_chart function:

def plot_chart2(df, colorData, rows, cols):
    
    hover = alt.selection_single(on='mouseover', nearest=True, empty='none')
    
    bckcolor='lightgray'
    
    base = alt.Chart(df).encode(
        
        x=alt.X(alt.repeat("column"),
                type='ordinal',
                ),
        y=alt.Y(alt.repeat("row"),
                type='ordinal',
                sort='descending',
                axis=alt.Axis(format=",.0f")
                ),
        color=alt.condition(hover,
                            colorData,
                            alt.value(bckcolor),
                            type='nominal',
                            scale=alt.Scale(scheme='dark2')
                           )
        ).properties(width=230,
                     height=190
                    )
    
    points = base.mark_point().encode().add_selection(hover).repeat(
        row=rows,
        column=cols)
    
    
    return points

delta_points=plot_chart2(df_adf_all,
                         'charge',
                         ['solveff', 'delta1', 'delta2', 'delta3', 'delta4'],
                         ['mol', 'solvent']
                        )
                                          
delta_points

There’s still a lot that could be done to improve the readibility of the plots, besides Altair has many other interactive options to offer in addition to simple hover. For instance, what could be done is to show the results from both programs on one plot or plot some more statistics. However, I will dive into this using larger data in the future.

I am still not entirely convinced of Altair (or maybe just got used to matplotlib+seaborn), but it has many exciting features that help to create interactive presentations. Also, perhaps the example above is not the best to advocate for the interactive plots. Since the data is small, these plots do not necessarily make up for better communication than the static graphs and tables. Yet, I hope it shows another possibility to present research results and that it brings some more fun into it.