Some gnuplot and datamash adventures

Thursday, December 29, 2022 

I’ve been collecting data on the state of the Ukrainian digital network since about the start of the war on a daily basis, some details of the process are in this post.  I was creating and updating maps made with qgis when particularly notable things happened, generally correlated with significant damage to the Ukrainian power infrastructure (and/or data infrastructure).  I wanted a way to provide a live update of the feed, and as all such projects go, the real reward was the friends made along the way to an automatically updated “live” summary stats table and graph.

My data collection tools generate some rather large CSV files for the mapping tools, but to keep a running summary, I also extract the daily total of responding servers and compute the day over day change and append those values to a running tally CSV file.  A few really great tools from the Free Software Foundation help turn this simple data structure into a nicely formatted (I think) table and graph: datamash and gnuplot. I’m not remotely expert enough to get into the full details of these excellent tools, but I put together some tricks that are working for me and might help someone else trying to do something similar.

Using datamash for Statistical Summaries

Datamash is a great command line tool for getting statistics from text files like logs or CSV files or other relatively accessible and easily managed data sources.  It is quite a bit easier to use and less resource intensive than R, or Gnu Octave, but obviously also much more limited. I really only wanted very basic statistics and wanted to be able to get to them from Bash with a cron job calling a simple script and for that sort of work, datamash is the tool of choice.

Basic statistics are easy to compute with datamash; but if you want a thousands grouped comma delimited median value of a data set that looks like 120,915 (say), you might need a slightly more complicated (but still one-liner) command like this:

Median="$(/usr/bin/datamash -t, median 2  < /trend.csv | datamash round 1 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta')"

Median=               Assign the result to the variable $Median
-t,                   Comma delimited (instead of tab, default)
median                one of a bazillion stats datamash can compute
2                     use column two of the CSV data set.
&lt; /trend.csv          feed the previous command a CSV file nom nom
| datamash round 1    pipe the result back to datamash to round the decimals away
| sed (yadda yadda)   pipe that result to sed to insert comma thousands separator*

*HT @sim

Once I have these values properly formatted as readable strings, I needed a way to automatically insert those updates into a consistently formatted table like this:

Sample statistics chart

I first create a dummy table with a plugin called TablePress with target dummy values (like +++Median) which I then extract as HTML and save as a template for later modification. With the help of a little external file inclusion code into WordPress, you can pull that formatted but now static HTML back into the post from a server-side file.  Now all you need to do is modify the HTML file version of the table using sed via a cron job to replace the dummy values with the datamash computed values and then scp the table code with updated data to the server so it is rendered into the viewed page:

sed -i -e "s/+++Median/$Median/g" "stats_table.html"
/usr/bin/sshpass -P assphrase -f '~/.pass' /usr/bin/scp -r stats_table.html user@site.org:/usr/local/www/wp-content/uploads/stats_table.html

For this specific application the bash script runs daily via cron with appropriate datamash lines and table variable replacements to keep the table updated on a daily basis.  It first copies the table template into a working directory, computes the latest values with datamash, then seds those updated values into the working copy of the table template, and scps that over the old version in the wp-content directory for visitor viewing pleasure.

Using gnuplot for Generating a Live Graph

The basic process of providing live data to the server is about the same.  A different wordpress plugin, SVG Support, adds support for SVG filetypes within WordPress.  I suspect this is not default since svg can contain active code, but a modern website without SVG support is like a fish without a bicycle, isn’t it? SVG is useful in this case in another way, the summary page integrates a scaled image which is linked to the full size SVG file.  For bitmapped files, the scaled image (or thumbnail) is generated by downsampling the original (with ImageMagick, optimally, not GD) and that needs an active request (i.e. PHP code) to update.  In this case, there’s no need since the SVG thumbnail is the just the original file resized—SVG: Scalable Vector Graphics FTW.

Gnuplot is a impressively full-featured graphing tool with a complex command structure.  I had to piece together some details from various sources and then do some sedding to get the final touches as I wanted them.  As every plot is different, I’ll just document the bits I pieced together myself, the plotting details go in the gnuplot command script, the other bits in a bash script executed later to add some non-standard formatting to the gnuplot svg output.

Title of the plot

The SVG <title> block is set as “Gnuplot” and I don’t see any way to change that from the command line, so I replaced it with the title I wanted, using a variable for the most recently updated data point extracted by datamash as above:

sed -i -e "s/<title>Gnuplot<\/title>/<title>Ukrainian Servers Responding on port 80 from 2022-03-05 to $LDate<\/title>/g" "/UKR-server-trend.svg" sed -i -e "s/<desc>Produced by GNUPLOT 5.2 patchlevel 2 <\/desc>/<desc>Daily automated update of Ukrainian server response statistics.<\/desc>/g" "/UKR-server-trend.svg"

This title value is used as the tab title.  I’m not sure where the <desc> will show up, but likely read by various spiders and is an accessibility thing for online readers.

Last Data Point

I wanted the most recent server count to be visible at the end of the plot.  This takes two steps: first plot that data point alone with a label (but no title so it doesn’t show up in the data key/legend) by adding a separate plot of just that last datum like:

"< tail -n 1 '/trend.csv'" u 1:2:2 w labels notitle

This works fine, but if you hover over the data point, it just pops up “gnuplot_plot_4” and I’d rather have more useful data so I sed that out and replace it with some values I got from datamash queries earlier in the script like so:

sed -i -e "s/<title>gnuplot_plot_4<\/title>/<title>Tot: $LTot; Diff: $LDif<\/title>/g" "/UKR-server-trend.svg"
Adding Link Text

SVG supports clickable links, but you can’t (I don’t think) define those URLs in the label command.  So first set the visible text with a simple gnuplot label command:

set label "Black Rose Technology https://brt.llc" at graph 0.07,0.03 center tc rgb "#693738" font "copperplate,12"

and then enhance the resulting svg code with a link using good old sed:

sed -i -e "s#<text><tspan font-family=\"copperplate\" >Black Rose Technology https://brt.llc</tspan></text>#<a xlink:href=\"https://brt.llc/\" target=\"__blank\"><text><tspan font-family=\"copperplate\" >Black Rose Technology https://brt.llc</tspan></text></a>#g" "/UKR-server-trend.svg"
Hovertext for the Delta Bars

Adding hovertext to the ends of the daily delta bars was a bit more involved.  The SVG <title> type is interpreted by most browsers as a hoverable element but adding visible data labels to the ends of the bars makes the graph icky noisy.  Fortunately SVG supports transparent text. To get all this to work, I replot the entire bar graph data series as just labels like so:

'/trend.csv' using 1:3:3 with labels font "arial,4" notitle axes x1y2

But this leaves a very noisy looking graph, so we pull out our trusty sed to set opacity to “0” so they’re hidden:

sed -i -e "s/\(stroke=\"none\" fill=\"black\"\)\( font-family=\"arial\" font-size=\"4.00\"\)/\1 opacity=\"0\"\2/g" "/UKR-server-trend.svg"

and then find the data value and generate a <title> element of that data value using back-references.  I must admit, I have not memorized regular expressions to the point where I can just write these and have them work on the first try: gnu’s sed tester is very helpful.

sed -i -e "s/\(<text><tspan font-family=\"arial\" >\)\([-1234567890]*\)<\/tspan><\/text>/\1\2<title>\2<\/title><\/tspan><\/text>/g" "/UKR-server-trend.svg"

And you get hovertext data interrogation.  W00t!

Sample of gnuplot showing hovertext

Note that cron jobs are executed with different environment variables than user executed scripts, which can result in date formatting variations (which can be set explicitly in gnuplot) and thousands separator and decimal characters (,/.). To get consistent results with a cron job, explicitly set the appropriate locale, either in the script like

#!/bin/bash
LC_NUMERIC=en_US.UTF-8
...

or for all cron jobs as in crontab -e

LC_NUMERIC=en_US.UTF-8
MAILTO=user@domain.com
# .---------------- minute (0 - 59) 
# |    .------------- hour (0 - 23)
# |    |      .---------- day of month (1 - 31)
# |    |      |    .------- month (1 - 12) OR jan,feb,mar,apr ... 
# |    |      |    |    .---- day of week (0 - 6) (Sunday=0 or 7)  OR sun,mon,tue,wed,thu,fri,sat 
# |    |      |    |    |
# *    *      *    *    *    <command to be executed>

 

The customized SVG file is SCPd to the server as before, replacing the previous day’s.  Repeat visitors might have to clear their cache.  It’s also important to disable caching on the site for the page, for example if using wp super cache or something, because there’s no signal to the cache management engine that the file has been updated.

Posted at 05:19:01 GMT-0700

Category: GeopostHowToLinuxTechnology