Some gnuplot and datamash adventures
I’ve been collecting data on the state of the Ukrainian digital network since about the start of the war on a daily basis, some details of the process are in this post. I was creating and updating maps made with qgis when particularly notable things happened, generally correlated with significant damage to the Ukrainian power infrastructure (and/or data infrastructure). I wanted a way to provide a live update of the feed, and as all such projects go, the real reward was the friends made along the way to an automatically updated “live” summary stats table and graph.
My data collection tools generate some rather large CSV files for the mapping tools, but to keep a running summary, I also extract the daily total of responding servers and compute the day over day change and append those values to a running tally CSV file. A few really great tools from the Free Software Foundation help turn this simple data structure into a nicely formatted (I think) table and graph: datamash and gnuplot. I’m not remotely expert enough to get into the full details of these excellent tools, but I put together some tricks that are working for me and might help someone else trying to do something similar.
Using datamash for Statistical Summaries
Datamash is a great command line tool for getting statistics from text files like logs or CSV files or other relatively accessible and easily managed data sources. It is quite a bit easier to use and less resource intensive than R, or Gnu Octave, but obviously also much more limited. I really only wanted very basic statistics and wanted to be able to get to them from Bash
with a cron
job calling a simple script and for that sort of work, datamash
is the tool of choice.
Basic statistics are easy to compute with datamash
; but if you want a thousands grouped comma delimited median value of a data set that looks like 120,915
(say), you might need a slightly more complicated (but still one-liner) command like this:
Median="$(/usr/bin/datamash -t, median 2 < /trend.csv | datamash round 1 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta')" Median= Assign the result to the variable $Median -t, Comma delimited (instead of tab, default) median one of a bazillion stats datamash can compute 2 use column two of the CSV data set. < /trend.csv feed the previous command a CSV file nom nom | datamash round 1 pipe the result back to datamash to round the decimals away | sed (yadda yadda) pipe that result to sed to insert comma thousands separator*
Once I have these values properly formatted as readable strings, I needed a way to automatically insert those updates into a consistently formatted table like this:
I first create a dummy table with a plugin called TablePress with target dummy values (like +++Median
) which I then extract as HTML and save as a template for later modification. With the help of a little external file inclusion code into WordPress, you can pull that formatted but now static HTML back into the post from a server-side file. Now all you need to do is modify the HTML file version of the table using sed
via a cron
job to replace the dummy values with the datamash
computed values and then scp
the table code with updated data to the server so it is rendered into the viewed page:
sed -i -e "s/+++Median/$Median/g" "stats_table.html" /usr/bin/sshpass -P assphrase -f '~/.pass' /usr/bin/scp -r stats_table.html user@site.org:/usr/local/www/wp-content/uploads/stats_table.html
For this specific application the bash
script runs daily via cron
with appropriate datamash
lines and table variable replacements to keep the table updated on a daily basis. It first copies the table template into a working directory, computes the latest values with datamash
, then sed
s those updated values into the working copy of the table template, and scp
s that over the old version in the wp-content
directory for visitor viewing pleasure.
Using gnuplot for Generating a Live Graph
The basic process of providing live data to the server is about the same. A different wordpress plugin, SVG Support, adds support for SVG filetypes within WordPress. I suspect this is not default since svg can contain active code, but a modern website without SVG support is like a fish without a bicycle, isn’t it? SVG is useful in this case in another way, the summary page integrates a scaled image which is linked to the full size SVG file. For bitmapped files, the scaled image (or thumbnail) is generated by downsampling the original (with ImageMagick, optimally, not GD) and that needs an active request (i.e. PHP code) to update. In this case, there’s no need since the SVG thumbnail is the just the original file resized—SVG: Scalable Vector Graphics FTW.
Gnuplot
is a impressively full-featured graphing tool with a complex command structure. I had to piece together some details from various sources and then do some sed
ding to get the final touches as I wanted them. As every plot is different, I’ll just document the bits I pieced together myself, the plotting details go in the gnuplot command script, the other bits in a bash script executed later to add some non-standard formatting to the gnuplot svg output.
Title of the plot
The SVG
block is set as “Gnuplot” and I don’t see any way to change that from the command line, so I replaced it with the title I wanted, using a variable for the most recently updated data point extracted by datamash
as above:
sed -i -e "s/Gnuplot<\/title>/ Ukrainian Servers Responding on port 80 from 2022-03-05 to $LDate<\/title>/g" "/UKR-server-trend.svg" sed -i -e "s/ Produced by GNUPLOT 5.2 patchlevel 2 <\/desc>/ Daily automated update of Ukrainian server response statistics.<\/desc>/g" "/UKR-server-trend.svg"
This title value is used as the tab title. I’m not sure where the
will show up, but likely read by various spiders and is an accessibility thing for online readers.
Last Data Point
I wanted the most recent server count to be visible at the end of the plot. This takes two steps: first plot that data point alone with a label (but no title so it doesn’t show up in the data key/legend) by adding a separate plot of just that last datum like:
"< tail -n 1 '/trend.csv'" u 1:2:2 w labels notitle
This works fine, but if you hover over the data point, it just pops up “gnuplot_plot_4” and I’d rather have more useful data so I sed
that out and replace it with some values I got from datamash
queries earlier in the script like so:
sed -i -e "s/gnuplot_plot_4<\/title>/ Tot: $LTot; Diff: $LDif<\/title>/g" "/UKR-server-trend.svg"
Adding Link Text
SVG supports clickable links, but you can’t (I don’t think) define those URLs in the label command. So first set the visible text with a simple gnuplot label command:
set label "Black Rose Technology https://brt.llc" at graph 0.07,0.03 center tc rgb "#693738" font "copperplate,12"
and then enhance the resulting svg code with a link using good old sed
:
sed -i -e "s## Black Rose Technology https://brt.llc #g" "/UKR-server-trend.svg" Black Rose Technology https://brt.llc
Hovertext for the Delta Bars
Adding hovertext to the ends of the daily delta bars was a bit more involved. The SVG
type is interpreted by most browsers as a hoverable element but adding visible data labels to the ends of the bars makes the graph icky noisy. Fortunately SVG supports transparent text. To get all this to work, I replot the entire bar graph data series as just labels like so:
'/trend.csv' using 1:3:3 with labels font "arial,4" notitle axes x1y2
But this leaves a very noisy looking graph, so we pull out our trusty sed
to set opacity to “0
” so they’re hidden:
sed -i -e "s/\(stroke=\"none\" fill=\"black\"\)\( font-family=\"arial\" font-size=\"4.00\"\)/\1 opacity=\"0\"\2/g" "/UKR-server-trend.svg"
and then find the data value and generate a
element of that data value using back-references. I must admit, I have not memorized regular expressions to the point where I can just write these and have them work on the first try: gnu’s sed tester is very helpful.
sed -i -e "s/\(\)\([-1234567890]*\)<\/tspan><\/text>/\1\2 \2<\/title><\/tspan><\/text>/g" "/UKR-server-trend.svg"
And you get hovertext data interrogation. W00t!
Note that cron
jobs are executed with different environment variables than user executed scripts, which can result in date formatting variations (which can be set explicitly in gnuplot
) and thousands separator and decimal characters (,/.). To get consistent results with a cron
job, explicitly set the appropriate locale, either in the script like
#!/bin/bash LC_NUMERIC=en_US.UTF-8 ...
or for all cron
jobs as in crontab -e
LC_NUMERIC=en_US.UTF-8 MAILTO=user@domain.com # .---------------- minute (0 - 59) # | .------------- hour (0 - 23) # | | .---------- day of month (1 - 31) # | | | .------- month (1 - 12) OR jan,feb,mar,apr ... # | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat # | | | | | # * * * * *
The customized SVG
file is SCP
d to the server as before, replacing the previous day’s. Repeat visitors might have to clear their cache. It’s also important to disable caching on the site for the page, for example if using wp super cache or something, because there’s no signal to the cache management engine that the file has been updated.