This is the personal blog of Bastian Rieck. I add posts in irregular intervals about things I consider relevant. Send any feedback, comments etc. to bastian@annwfn.net.

In this post, I want to briefly explain how to combine methods from topological data analysis and machine learning. Given a Shakespearean social network, I want to predict whether it is likely to be a comedy or not.

Preliminaries & terms

Let’s first discuss some terms: a Shakespearean social network is a co-occurrence network generated from an annotated corpus of Shakespeare’s plays. The basic idea is to calculate a graph from the play in which characters that co-occur in a scene are connected by an edge. The edge weight roughly indicates how many lines of text a character has—the idea being that less important characters should be connected by edges with smaller weights.

I already wrote two posts about these networks, with the first one talking about the general ideas behind these networks and providing some basic statistics, and the second one talking about a more formal evaluation in the form of a short paper titled ‘Shall I compare thee to a network?’—Visualizing the Topological Structure of Shakespeare’s Plays.

The short paper focuses more on visualizing structural similarities between plays. Much has happened in the meantime, however. I started working on Aleph, a library for exploring persistent homology. Moreover, encouraged by previous publications (namely, my two papers about analysing clustering algorithms and dimensionality reduction methods), I started to delve a little bit deeper into machine learning.

In some sense, this post is a continuation of my work in that area.

Topological data analysis & persistent homology

Since I want to keep this post brief (I plan on writing up a more detailed explanation as a follow-up post), I merely focus on the basic idea behind my analysis. Using persistent homology, a method from topological data analysis, it is possible to measure the amount of ‘holes’ in a network. Yes, you read that correctly. While many algorithms describe what is there, persistent homology focuses on what is missing. This paradigm is highly successful and has been applied in many different application domains so far—let me just mention Gunnar Carlsson’s paper Topological pattern recognition for point cloud data as one particularly striking example of this.

At the heart of persistent homology is the concept of a persistence diagram, a two-dimensional scatter plot that depicts scale information about each hole. For every dimension of an input data set, we have a separate persistence diagram. These diagrams are commonly taken as a feature descriptor that summarizes a data set. Their appeal lies in certain structural properties such as robustness. Furthermore, it is possible to calculate the Wasserstein distance between them. This makes it possible to compare two extremely complex objects, shapes, or spaces by a rather simple algorithm!

Distances & kernels

In the aforementioned short paper, I used the Wasserstein distance between persistence diagrams in order to obtain a low-dimensional embedding of Shakespeare’s plays. Proximity in this embedding means that the co-occurrence networks of two plays have a similar structure. Among other things, one of the results of the short paper was that comedies such as A Midsummer Night’s Dream appear to form clusters, while tragedies such as Hamlet appear to be spread out more.

It is thus a natural question to ask whether it is possible to decide if a random Shakespeare play, described as a co-occurrence network, is a comedy or not. This seems like a perfect job for machine learning! Based on the successful usages of persistent homology in other contexts, it seemed natural to me to use it in order to assess the dissimilarity between the networks. There are other options out there for doing this—and it would be an interesting side-project to compare all of them—but seeing that most of my research centres on persistent homology one way or the other, I will stick with it.

In order to unleash the power of machine learning on these unsuspecting networks, we require the definition of a suitable kernel. A kernel function permits us to use supervised learning algorithms without ever having to specify coordinates in the feature space. Given the fact that the feature space is extremely hard to describe—notwithstanding current efforts to introduce vector-based representations of persistence diagrams, such as persistence images—kernel methods are a blessing. Thanks to scikit-learn, they can be used quite easily.

There are multiple options for calculating kernels. Here, I experimented a little bit with a simple Wasserstein kernel (obtained from negating the Wasserstein distance values) as well as persistence indicator functions( of which I hope to write more in a subsequent post).

I am leaving out the generation process of the kernel matrices for now, because I plan on describing them in a subsequent blog post: my library for calculating persistent homology is still subject to change and I do not want to prematurely write about APIs that are not yet ready.

Experiments & results

After generating the kernel matrices with Aleph, we can use support vector machines to test how easily we may detect comedies. Some caution is necessary, though: there are only 37 plays, of which 17 are comedies. A standard splitting of the data set into training and test data is thus not useful here because there are not enough samples. I thus opted for cross-validation techniques, more precisely leave-p-out cross-validation, with p up to 3.

The results are as follows:

Kernel Accuracy Average ROC AUC
Persistence indicator functions, d=1 0.86 0.91
Persistence indicator functions, d=2 0.73 0.80
Persistence indicator functions, d=3 0.66 0.74
Wasserstein, d=1 0.86 0.95
Wasserstein, d=2 0.81 0.94
Wasserstein, d=3 0.81 0.93

The entries in the third column refer to the area under the curve of the receiver operating characteristic and give an idea about how the classifier will perform in general. These values are to be taken with a grain of salt, however, because there are few samples of both classes. I selected the accuracy performance measure because there are only two classes and they are not highly skewed. Given that the probability of a random play being a comedy is about 0.46, it is nice to see that we are able to detect comedies rather well.

Nonetheless, the values in the table are somewhat surprising. First of all, from personal experience I would have expected a slightly better performance from the second Wasserstein distance—it is the one that is commonly used because its local costs resemble the basic Euclidean distance. This has the nice property of suppressing topological features with low persistence values. It thus appears that these features are useful for classification purposes. Paul Bendich suspected something similar in his paper Persistent Homology Analysis of Brain Artery Trees:

That said, there is no guarantee of persistence being correlated with importance, just with reliability. Indeed, one of the findings in Section 5 is that dots of not-particularly-high persistence have the most distinguishing power in our specific application.

What now?

This was/is a fun project for me as it combines several interesting aspects. I look forward to trying out more kernels, such as the one proposed by Reininghaus et al. in A Stable Multi-Scale Kernel for Topological Machine Learning.

One of the results of this paper was that the Wasserstein distance is theoretically unsuitable as a kernel function. This is somewhat unfortunate because it still appears to work. I sincerely hope that the persistence indicator function proves to be a viable candidate for kernel methods because it can be calculated very quickly and, in contrast to persistence diagrams, permits the definition of a unique mean function.

As soon as time permits, I shall write a more detailed article about the technical details of this experiment. Please take a look at the GitHub repository that I created for this article. Here, you will find the raw data along with some scripts to reproduce the values reported in this blog post.

Have fun and feel free to drop me a line if you are interested in these (or related) methods. I will be more than happy to have an interesting discussion.

Posted Tuesday evening, May 16th, 2017 Tags:

Scatterplot matrices are one of the most practical direct visualization of unstructured data sets—provided that the dimensionality does not become too large. Despite their prominence in the visualization community, there are few non-commercial tools out there that are able to produce scatterplot matrices for publication-ready visualizations.

In this post, I want to briefly explain how to use a recent version of gnuplot, one of my favourite plotting tools, to create something that resembles a scatterplot matrix. In this example, I will use the famous Iris data set, which I am sure everyone is familiar with by now.

I prepared a slightly-modified version of the data set, which makes processing it with gnuplot easier. To this end, I shortened the labels and inserted newlines after each block of identical labels. This makes using colours much easier.

The basic idea of the plot is to use the multiplot mode of gnuplot and create all plots in a nested loop. The diagonal elements are filled with histogram of the corresponding attribute—which is of course also created using gnuplot. The only thing I am not proud of is the use of labels. I am using the x2label feature in order to put axis labels on top of a plot, but as gnuplot does not (yet) support a simpler mechanism for placing labels in multiple plots, I had to resort to a truly horrid way of getting this to work…

Without further ado, here’s the code:

And this is how the output looks like:

Scatterplot matrix of the Iris
data set

You can find the code for this script as a gist on GitHub.

Posted late Wednesday evening, April 5th, 2017 Tags:

Most of the time, PhD comics hits too close to home. Recently, they ran a strip showing the temporal evolution of a typical thesis document. As expected, the comic proves that efforts are “doubly redoubled” when the deadline approaches. Naturally, I wondered how my own thesis—still under review—fares in that regard.

I keep my thesis under version control because I am a paranoid and cranky person when it comes to data security and backups. Consequently, I required a script for reconstructing the amount of changes I did to the thesis sources over time. Here’s what I came up with:

If I let this script run, with the output being stored in a file somewhere, I can use the venerable gnuplot to create a nice graph. This is what it looks like for my dissertation:

Word count in my thesis over time

Note that the script is unable to count “proper” words for now. Hence, changes in the LaTeX markup will also result in increasing the total number of words. I guess the slope of my own efforts is not too similar to the comic. At the same time, you can see that most of the content of the thesis was added in a comparatively short amount of time.

Feel free to apply this to your own documents. The code is available as a gist on GitHub. This is my first real gist—thanks to Conrad for the suggestion!

Posted late Thursday evening, February 9th, 2017 Tags:

I recently had to convert several structured grids in VTK format for further processing. Since I could not find any other code for doing this, here is my quick and dirty approach:

#!/usr/bin/env python3

from PIL import Image

import numpy
import os
import re
import sys

reDimensions = re.compile( r'DIMENSIONS\s+(\d+)\s+(\d+)\s+(\d+)' )
reSpacing    = re.compile( r'SPACING\s+(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)' )
rePointData  = re.compile( r'POINT_DATA\s+(\d+)' )
reTableData  = re.compile( r'(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)' )

for filename in sys.argv[1:]:
    with open( filename ) as f:
        width  = 0
        height = 0
        depth  = 0

        data = None # Contains matrix data
        ii   = 0    # Current data index for filling matrix

        # Stores only those coordinates (given as an x and y position)
        # that are nonzero. This permits to match adjacent time steps
        # in a very concise and sparse manner.
        nonZero = list()

        for line in f:
            if reDimensions.match(line):
                result = reDimensions.match(line)
                width  = int( result.group(1) )
                height = int( result.group(2) )
                depth  = int( result.group(3) )

                assert width * height * depth > 0, "Invalid dimensions"

                data = numpy.zeros( (height,width) )
            elif reSpacing.match(line):
                result = reSpacing.match(line)
                sx     = result.group(1)
                sy     = result.group(2)
                sz     = result.group(3)

                # TODO: I am ignoring the spacing, hence I am only
                # able to load uniform rectilinear grids.
                assert sx == sy and sy == sz, "Invalid spacing"
            elif rePointData.match(line):
                result = rePointData.match(line)
                num    = int( result.group(1) )

                assert num == width * height * depth, "Inconsistent dimensions"
            elif reTableData.match(line):
                result = reTableData.match(line) 
                d      = (u,v,w) = (
                                    float( result.group(1) ),
                                    float( result.group(2) ),
                                    float( result.group(3) )
                                   )

                for k in range(0,3):
                    ix     = ii % width
                    iy     = ii // width 
                    ii     = ii + 1

                    if d[k] > 0:
                        nonZero.append( (ix,iy) )

                    data[iy][ix] = d[k]

    # TODO: `data` is now a matrix corresponding to the dimensions of
    # the original grid.Do something with it...


    #
    # Store non-zero coordinates
    #

    outputCoordinates = "/tmp/"\
             + os.path.splitext( os.path.basename(filename) )[0]\
             + ".txt"

    with open(outputCoordinates, "w") as g:
        for (x,y) in nonZero:
            print("%d\t%d" % (x,y), file=g)

    #
    # Store image
    #

    outputImage =   "/tmp/"\
             + os.path.splitext( os.path.basename(filename) )[0]\
             + ".png"

The code from above iterates over a list of VTK files (assumed to be in ASCII format) that contains POINT_DATA, i.e. a scalar value for each grid cell. Provided that the grid follows uniform spacing, the code is capable of parsing it and stores the POINT_DATA values in a numpy matrix. Moreover, the list of non-zero coordinates of the matrix is written to /tmp, along with a pixel-based image of grid. Both outputs are highly useful for debugging and understanding the input.

Feel free to do more with this code—I am hereby releasing it into the public domain.

Posted Wednesday afternoon, February 8th, 2017 Tags:

In a previous article, I have already talked about the motivation behind analysing social networks of Shakespeare’s plays. Since then, I have pursued this subject somewhat further. In particular, I took some time to formulate a more methodical analysis approach.

The result is a nice short paper, titled ‘Shall I compare thee to a network?’—Visualizing the Topological Structure of Shakespeare’s Plays. I was very honoured to present this at the 2016 Workshop on Visualization for the Digital Humanities at IEEE Vis 2016. It was a great venue with lots of interesting papers and I urge you to take a look at the list of submissions.

Our paper follows the basic strategy outlined in the previous blog post. Hence, we first extract a co-occurrence network between characters of a play is being extracted. There are numerous ways of doing that, and each different way influences the structures that may be detected in a subsequent analysis step. In the paper, I present two different weight schemes. One is based on the amount of speech uttered by each character, the other one measures the point in time at which characters start to interact with each other.

The co-occurrence network is then subjected to a topological analysis based on persistent homology. Basically, we consider the network to grow in accordance to the weights. This permits us to treat it as a graph-based filtration, which in turn means that we may calculate persistence diagrams and, from that, derive a measure of dissimilarity. The interesting thing here is that this approach results in well-defined distances between individual plays—meaning that we can measure structural differences and structural similarities.

The paper contains some results that appear to indicate the utility of this approach. Without further training, our algorithm is capable of figuring out that All’s Well That Ends Well is a comedy that is structurally extremely dissimilar to all remaining comedies. This is interesting, insofar as this play is considered to be one of Shakespeare’s problem plays.

In the paper, we describe more results of the same ilk. I have made the networks available on GitHub, where I also plan on uploading code for calculating the dissimilarity measure.

Posted late Monday evening, November 7th, 2016 Tags:

Debugging is one of the primary skills every computer science student should acquire. I want to provide an easy introduction to this topic, starring C++ and gdb, the debugger of the GNU project. In the following, I assume that the reader is somewhat familiar with the basic concepts of C++ programming.

A simple example

Let’s assume that we want to transform a string into a palindrome and write the following program to do so:

#include <iostream>
#include <string>

std::string palindrome( std::string s )
{
  for( auto i = s.size() - 1; i >= 0; i-- )
  {
    if( s[i] != s[s.size() - i - 1] )
      s[s.size() - i - 1] = s[i];
  }

  return s;
}

int main()
{
  std::cout << palindrome( "Hello, World!" ) << std::endl;
}

We compile the program using

$ g++ -std=c++11 foo.cc

and run ./a.out just to find out that the program crashes with a segmentation fault. Bummer. Let’s recompile the program and prepare it for debugging. To this end, we just add the -g switch on the command-line for the compiler. This ensures that debug symbols are included. Loosely speaking, these symbols are required for the debugger to connect the binary executable with our source code.

So, we issue

$ g++ -std=c++11 -g foo.cc

and run gdb with our executable as an argument:

$ gdb a.out

This should result in output that is somewhat similar to this:

GNU gdb (GDB) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
---Type <return> to continue, or q <return> to quit---
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
---Type <return> to continue, or q <return> to quit---
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from a.out...done.
(gdb)

The most important thing is that gdb greets us with its own prompt, meaning that from now on, we directly interact with the debugger instead of running the program directly.

We first run the binary to find out what happens:

(gdb) run

This should yield an output similar to

Program received signal SIGSEGV, Segmentation fault.
0x0000000000400bd4 in palindrome (s="!dlroW World!") at foo.cc:8
8       if( s[i] != s[s.size() - i - 1] )

and send us right back to the prompt. We can see that our program caused a segmentation fault. This error usually indicates that we accessed memory that does not belong to our program somewhere along the way.

How can we find out more about what went wrong? We first take a look at the backtrace of the error. Roughly speaking, these are all the functions that are currently active—currently meaning at the time of the error.

To find out more about the backtrace, we type bt or backtrace:

(gdb) bt

This gives us lines like these:

#0  0x0000000000400bd4 in palindrome (s="!dlroW World!") at foo.cc:8
#1  0x0000000000400c65 in main () at foo.cc:17

Not too surprisingly, the last function, i.e. function #2, is the main function of our program. After that, our palindrome function was called. Each of the lines in this list is referred to as a frame. To see a specific frame—i.e. a certain function along with its code—we type frame NUMBER or f NUMBER.

Let us first take a look at the main function:

(gdb) frame 1

This gives us some code:

#1  0x0000000000400c65 in main () at foo.cc:17
17    std::cout << palindrome( "Hello, World!" ) << std::endl;

We can see the line number (17) and some of the parameters of the function. Furthermore, we see the next function call, viz. palindrome, which in turn causes the segmentation fault.

Since the line looks relatively safe, there must be an issue with the variables that are used to access the characters of the string. So, let’s get back to the first frame:

(gdb) frame 0
#0  0x0000000000400bd4 in palindrome (s="!dlroW World!") at foo.cc:8
8       if( s[i] != s[s.size() - i - 1] )

Again, the if condition seems relatively safe. To cause a segmentation fault, we must access some bad memory somewhere. Thus, let us take a look at the values of the variables at this point. To this end, we can use the print command or p for short.

Let us begin with the string itself:

(gdb) print s
$1 = "!dlroW World!"

This seems to be correct so far. There is certainly nothing wrong here. So let us try the loop variable next:

(gdb) print i
$2 = 18446744073709549580

Alright. I am fairly confident that our string is not that long. So, we have identified the culprit—something is going wrong with the loop counter. At this point, we could use the edit command to open the file in our favourite editor and add some std::cout statements.

Yes, I am serious. std::cout can be a very powerful debugging tool! As long as we give some serious thought to identifying potential places in our code that may cause an error, there is nothing wrong with sprinkling some printing statements over your code. Of course we could also set breakpoints and watch how the value of i changes over the course of the program. But this is more advanced stuff, so I will rather cover it in another article.

So, let us assume that we have added enough std::cout statements. Running the program with these statements quickly shows that we are traversing the string incorrectly. As s.size() is always unsigned, the condition i >= 0 in the loop always evaluates to true. By traversing the string in a forward manner, e.g.

for( std::size_t i = 0; i < s.size(); i++ )

we finally get our program to work, resulting in the beautiful output of

Hello, ,olleH

Fun with make

If you have a Makefile—or if you are using CMake, for example—the debugger offers you a very cool integrated workflow. You can directly use edit to work on the code. Afterwards, you can type make to recompile the program, followed by run to check its behaviour again. For easily-fixable errors, this means that you never even have to leave the debugger. More about this in a subsequent article.

Conclusion

In the next article, we shall see how to combine debugging with make. Also, I want to show you the wonderful world of breakpoints. Until that time, here are the main lessons of debugging:

  • Try to figure out what causes the error
  • Always assume that your code is at fault
  • Do not be afraid of using std::cout or printf or whatever for debugging. Whatever works.
  • Always compile with extra warnings enabled. By compiling with -Wall -Wextra, the compiler already helps you catch a lot of potential issues. In our case, for example, the compiler would have complained with comparison of unsigned expression >= 0 is always true, which would have already caught the error before debugging.

Happy bug-fixing!

Posted late Wednesday evening, September 28th, 2016 Tags:

I recently had the pleasure to see a talk about the style of Shakespeare, given by Prof. Dr. Beatrix Busse and her inspired staff. One of the talks, given by Ingo Kleiber, struck a nerve in me. He talked about creating social networks from Shakespeare’s plays and analyse them in the way we analyse, well, “real” social networks.

Naturally, I needed to see for myself how fun this would be. In the subsequent sections, I am going to describe some initial considerations. I plan on neglecting my dissertation for a little bit to present a more concise write-up later on.

Where to get some data?

Luckily for us, many people have already invested large chunks of their time to make Shakespeare’s plays accessible and readable for computers. There is a whole corpus of 37 plays, in a computer-readable format, available for download. Hence, no need to do any natural language processing for now.

What do with it?

Some gentle prodding with Python makes it possible to extract a co-occurrence network from the play. How this extraction is done is somewhat subjective. In my initial attempt, following the analysis made by Ingo, I broke down the plays in individual scenes. After identifying the characters in every scene, I opted for adding an edge between two characters if they appear in the same scene. Of course, this is somewhat rough—we may never know whether they actually talk to each other. Also, I did not yet take stage directions into account. It is thus well possible that I add an edge between two characters that are never actually on the stage at the same time. This obviously needs to be improved for any “serious” analysis—from which I am very far removed anyway, by virtue of being a barbarian in the ways of linguistics!

So, having extracted all those edges, I decided on building a adjacency matrix where edge weights indicate how often characters are part of the same scene. This permits me to draw really nice network diagrams, such as this one here (which comes from Macbeth, one of my most favourite plays):

Macbeth co-occurrence network

Nodes have been scaled and coloured according to their degree, just like the edges. The darker the colour, the more important the character. Not too surprisingly, Macbeth itself appears to be important to this play. Who could have anticipated this? The graph layout algorithm—a simple force-directed scheme—also serves to separate the play quite well. The parts with the “Weird Sisters” are clearly shown to be somewhat different from the rest of the action.

Density

While this looks nice and all, it is not exactly informative. We need to compare more plays with and among each other! To do just that, I calculated the graph density—the ratio between the actual amount of edges and the total number of possible edges—for every play. The graph density measures how, well, dense the graph is in terms of edges. In other words, the more the graph resembles a complete graph, the higher its density will be. In theatrical terms, I imagine that a dense co-occurrence network indicates many plot lines that are interwoven and converge later on, prompting literally every character to be connected to the other characters.

To perform some exploratory data analysis on Shakespeare’s plays, let us take all plays, tag them according to three types (comedy, tragedy, and history), which the corpus fortunately provides on its own, and plot graph density against the number of characters. This is how it looks like:

Graph density vs. number of characters in Shakespearean plays

Finally, some nice patterns! Notice that all the historical plays—such as Henry V—are characterized by having a rather low density (less than 0.4) but, for the most part, a large amount of characters. The comedies, by contrast, have a lower amount of characters and a substantially higher density. The most dense play, in this representation, is Love’s Labour’s Lost, one of Shakespeare’s early comedies. The most sparse one is Pericles, Prince of Tyre, where the authorship of Shakespeare is somewhat unclear.

What to make of this?

It is certainly interesting to see that some patterns start to emerge when one uses statistical or data analysis techniques to analyse a whole corpus of works. There are lots of interesting applications here. I did not include any dates, for example. How do the works change over time? Are plays of later Shakespearean periods less dense or more dense?

I am very much looking forward to doing more with these data.

Since I did not release any code so far, please take a look at Ingo’s repository, where he thankfully provides his sample code to create social networks from the plays.

Let the age of Bard Data begin!

Posted late Sunday evening, June 19th, 2016 Tags:

I am using the fine glossaries package to typeset glossaries and acronyms in my dissertation.

While the default settings are not bad, I like to do some custom adjustments to get the glossaries printed just right.

Capitalized descriptions in the glossary

I want entries in the glossary to be capitalized—regardless of the original capitalization. Hence, I prefer minimum spanning tree (MST), but in the glossary, I want it to be written as Minimum spanning tree. To get the package to comply, I find the easiest way to do this is to define a new glossary style. Since I like the long glossary style, I can inherit all features from it and only need to change a small part:

\newglossarystyle{long-br}{%
  \setglossarystyle{long}

  \renewcommand{\glossentry}[2]{%
    \glsentryitem{##1}\glstarget{##1}{\glossentryname{##1}} &
    \Glossentrydesc{##1}\glspostdescription\space ##2\tabularnewline
  }%
}

\setglossarystyle{long-br}

The salient change is the usage of \Glossentrydesc instead of \glossentrydesc, which ensures that the description is printed capitalized.

Non-breaking spaces in the long-short form of an acronym

When glossaries typesets an acronym, I prefer the long-short form. For example, the first use of an acronym should look like minimum spanning tree (MST). Unfortunately, the space between the long form and the short form permits a line-break. This is of course unacceptable and needs to be rectified. The easiest way to solve this is to use

\renewcommand{\acrfullformat}[2]{#1~(#2)}   

in the preamble of the document. This does not work when you use a custom style, though. See below if you want to copy my style.

Printing acronyms always upright in the long-short form

When the long-short form of an acronym is used, I prefer any text decorations only to apply to the long description, but not to the acronym. Consequently, I like minimum spanning tree (MST) better than I like minimum spanning tree (MST). To achieve this, we need to define our own style. Again, I like the long-short style, so I am going to base the custom style upon it:

\newacronymstyle{long-short-br}
{%
  \GlsUseAcrEntryDispStyle{long-short}%
}%
{%
  \GlsUseAcrStyleDefs{long-short}%  
  \renewcommand*{\genacrfullformat}[2]{%
    \glsentrylong{##1}##2~\textup{(\firstacronymfont{\glsentryshort{##1}})}%
  }%
  \renewcommand*{\Genacrfullformat}[2]{%
    \Glsentrylong{##1}##2~\textup{(\firstacronymfont{\glsentryshort{##1}})}%
  }%
  \renewcommand*{\genplacrfullformat}[2]{%
    \glsentrylongpl{##1}##2~\textup{(\firstacronymfont{\glsentryshortpl{##1}})}%
  }%
  \renewcommand*{\Genplacrfullformat}[2]{%
    \Glsentrylongpl{##1}##2~\textup{(\firstacronymfont{\Glsentryshortpl{##1}})}%
  }%
}

\setacronymstyle{long-short-br}

Note that this style also uses a non-breaking space ~ to connect the long form to the short form. If you do not want this, just use \space instead.

That’s it for now—as soon as I have found out more ways to procrastinate when using LaTeX, there will be a new post.

Posted Tuesday evening, May 31st, 2016 Tags:

Time for another tale from the trenches, better known as the course on C++ programming taught by my colleague Filip Sadlo. This time, the problem is very subtle and involves unsequenced operations. If you are not familiar with this term—because you have been working with nicer programming languages than C++ so far—let me briefly explain: A sequence point is a point in a program where no more side effects of previous expressions exist any more. It is thus the smallest self-contained unit of code that may be executed. If an expression does not contain any sequence points, the order in which different sub-expressions are evaluated is not necessarily specified.

At this point you may ask yourself: Why should I care? Well, the point is that in most cases, you do not need to care. However, there are a few cases where it makes sense to think about sequencing—especially, when the same variable appears multiple times in an expression.

A classical example, which I have to admit also used to be part of our codebase, is

i = i++

for which the final value of i is ambiguous because the increment is not guaranteed to be executed before the assignment. In essence, the problem is that both operations modify i as a side-effect but the assignment operator does not define a particular execution order.

A similar example, which was part of an examination, is

--x == x--

which also combines multiple side-effects concerning the same variable.

So, what can a poor compiler do in these cases? Nothing much, it turns out. In fact, this is undefined behaviour territory. Just for the fun of its, let us take a look at how different compilers handle this. I will use the following code:

#include <iostream>

int main()
{
  int x = 0;

  std::cout << ( --x == x-- ) << std::endl;
  std::cout << ( x-- == --x ) << std::endl;
}

First, gcc is on.

$ g++ --version
g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ g++ -std=c++11 unsequenced.cc; ./a.out
0
0

No warnings? Shame on you. You can surely do better!

$ g++ -std=c++11 -Wall unsequenced.cc; ./a.out
unsequenced.cc: In function ‘int main()’:
unsequenced.cc:7:43: warning: operation on ‘x’ may be undefined [-Wsequence-point]
   std::cout << ( --x == x-- ) << std::endl;
                                           ^
unsequenced.cc:8:43: warning: operation on ‘x’ may be undefined [-Wsequence-point]
   std::cout << ( x-- == --x ) << std::endl;

0
0

Now on to clang!

$ clang++ --version
clang version 3.7.1 (tags/RELEASE_371/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
$ clang++ -std=c++11 unsequenced.cc; ./a.out      
unsequenced.cc:7:18: warning: multiple unsequenced modifications to 'x'
      [-Wunsequenced]
  std::cout << ( --x == x-- ) << std::endl;
                 ^       ~~
unsequenced.cc:8:19: warning: multiple unsequenced modifications to 'x'
      [-Wunsequenced]
  std::cout << ( x-- == --x ) << std::endl;
                  ^     ~~
2 warnings generated.
1
0

That’s more like it!

So, from the output we can see that anything may happen when we journey into undefined behaviour territory. It is important to be at least aware of these pitfalls—whenever you use some variable multiple times in an expression, I would recommend to briefly reflect about whether some unsequenced modifications may have slipped in.

By the way: As the output of gcc shows, it is almost always a very good idea to compile with a good set of warnings enabled. At the very least, you should use -Wall, which—contrary to its name—does not activate all warnings. I guess naming the switch -Wsome-but-not-all-warnings would have been too silly…

Posted Monday night, April 11th, 2016 Tags:

I am a fan of higher-order functions such as filter and map. To me, they are much more elegant and concise than the corresponding constructions with for-loops and the like. Of course, I wanted to use them in my daily C++ grind as well. Until recently, this turned out to be rather clunky—C++98 and C++03 did not include std::copy_if, which can be used to implement filter. Furthermore, std::transform, which is the equivalent of apply or map, takes an output iterator as an additional argument. This does not yield concise code.

Luckily, with C++11 the situation became better. First, let’s implement filter:

#include <algorithm>
#include <iterator>
#include <vector>

template <
  class InputIterator,
  class Functor>
std::vector<typename std::iterator_traits<InputIterator>::value_type>
filter( InputIterator begin, InputIterator end, Functor f )
{
  using ValueType = typename std::iterator_traits<InputIterator>::value_type;

  std::vector<ValueType> result;
  result.reserve( std::distance( begin, end ) );

  std::copy_if( begin, end,
                std::back_inserter( result ),
                f );

  return result;
}

This snippet permits us to take any range, apply a functor to it, and get a vector in return that only contains those elements for which the functor returned true. I realize that this is not as flexible as the STL output iterator concept, but be honest—when was the last time you actually wanted anything other than a vector to store stuff? Just as ed is the standard text editor, vector is the standard container in my opinion.

We can use the new implementation in the following manner:

std::vector<unsigned> a = { 2,3,4,5,6 };
auto b = filter( a.begin(), a.end(),
                 [] ( unsigned i ) { return i % 2 == 0; } );

That is pretty concise in my opinion.

Let’s implement map next:

#include <algorithm>
#include <iterator>
#include <type_traits>
#include <vector>

template <
  class InputIterator,
  class Functor>
auto map( InputIterator begin, InputIterator end, Functor f )
  -> std::vector<typename std::result_of< Functor( typename std::iterator_traits<InputIterator>::value_type ) >::type>
{
  using FunctorValueType = typename std::result_of< Functor( typename std::iterator_traits<InputIterator>::value_type ) >::type;

  std::vector<FunctorValueType> functorValues;
  functorValues.reserve( unsigned( std::distance( begin, end ) ) );

  std::transform( begin, end,
                  std::back_inserter( functorValues ), f );

  return functorValues;
}

That contains two awesome C++11 features: Trailing return types and std::result_of. We need this because the value type of the returned vector depends on the result of applying the functor to one of the values in the range! To be honest, the trailing return type is not really necessary, but I always wanted to use it somewhere. Again, I think that this implementation results in quite concise code:

std::vector<unsigned> a = { 2,3,4,5,6 };
auto b = map( a.begin(), a.end(),
              [] ( unsigned i ) { return i * 2; } );

The holy grail would of course be the possibility to return the same type of container that is used as the input, but the iterators do not permit this—for very good reasons. In any case, I am now happily using these constructions in my code and hope that you will, too.

Posted late Sunday evening, March 13th, 2016 Tags: