Bash Speedup

Pascal Thalmann
The Startup
Published in
5 min readDec 24, 2020

--

Foto by Herbert Aust on Pixabay

The other day I had to write a shell script that generates 5637 queries cramped into a file. The goal was to run these queries against a database and change discrepancies between what is and what should be. Anyway, this article is not about queries and databases, but the raw power of Bash and why it makes sense to rethink the process twice before you go the beaten path.

The situation

The input emanates from a file that contains precisely 5637 value pairs.

They look like this:

Value1.XYZ Value2.XYZ

The first field contains the current value stored in the database. The second field defines its new name. So, I created a query template that parses all the lines and replaces the following pattern: “==VALUE1==” and “==VALUE2==” with Value1.XYZ and Value2.XYZ. Invariably, I would have this template reused in a loop:

[==TITLE==]
...
ROOT = DB/DO_SOMETHING(NEW_VALUE_NAME="==VALUE2==")
...
IMPORTANT_FIELD = DB_NAME::==VALUE1== 0
...

The template was way bigger, but for the sake of discretion and because it could be potentially distracting, I commented all those lines out. Yes, and there is the field “==TITLE==” which must be also unique and has to be replaced with something like Value1Value2.

The final query file that contains all those subqueries will have 90205 lines.

first attempt

Since that was only a one-time task, I honestly did not really think about performance. Since we already have a template, a loop, and some replacements involved, I immediately took sed as the tool of choice, and the subsequent outcome looked similar to this:

template=template.tpl
workcopy=workcopy.tmp
output=query.out

while read -r line
do
array=(${line})
value1=${array[0]}
value2=${array[1]}
cp ${template} ${workcopy}
sed -i s/==TITLE==/${value1}${value2}/g ${workcopy}
sed -i s/==VALUE1==/${value1}/g ${workcopy}
sed -i s/==VALUE2==/${value2}/g ${workcopy}
cat ${workcopy} >> ${output}
done < inputfile.txt

When I ran the code snippet in my test VM, “time” returned me the following measurements:

real    22m52.692s
user 0m33.331s
sys 1m47.344s

Wow, that has taken a long time. Almost 23 minutes. That script will run for all 20 or so DB-Instances, so that is a lot of time to wait. As such, I decided to do some improvements.

second attempt

I realized that in my first attempt I copied 5637 times the template to the memory, then again to the filesystem, sed had to read it copies it into memory, compute changes, writes it back, that is done 3 times by sed and then the workcopy is read again from the filesystem, copied to the memory, and appended to the existing output file (which must be probably also read and copied into memory, computed and written back to the filesystem). Not to mention, that sed itself needs to be loaded into memory. In other words: a lot of io.

So I had to get rid of all the unnecessary io and focused first on copying the template. So I thought I append the template right to the output file and replace the “== XX ==”-fields with the correct values. That was a horrible idea and you will eventually see why. Here is the code:

template=template.tpl
workcopy=workcopy.tmp
output=query.out

while read -r line
do
array=(${line})
value1=${array[0]}
value2=${array[1]}
cat ${template} >> ${output}
sed -i s/==TITLE==/${value1}${value2}/g ${output}
sed -i s/==VALUE1==/${value1}/g ${output}
sed -i s/==VALUE2==/${value2}/g ${output}
done < inputfile.txt

The result was:

real    74m30.974s
user 13m8.227s
sys 46m4.676s

74 minutes was definitely no option. But what was happening here? Yes, I got rid of the unnecessary copying of the template. But to what end! Now I am copying the whole output file again and again from the filesystem to the memory and back. And with each iteration i add more and more read/write operations. In other words, this will never scale. With a big enough inputfile, this can break the whole system. So I got back to start and decided to only optimize the sed calls.

third attempt

To remove a workcopy from the process was a bad idea. But there is still the part where sed is called 3 times, and this is causing even more read/write operations. So i changed this part:

sed -i s/==TITLE==/${value1}${value2}/g ${workcopy}
sed -i s/==VALUE1==/${value1}/g ${workcopy}
sed -i s/==VALUE2==/${value2}/g ${workcopy}

to:

template=template.tpl
workcopy=workcopy.tmp
output=query.out

while read -r line
do
array=(${line})
value1=${array[0]}
value2=${array[1]}
cp ${template} ${workcopy}
sed -i "s/==TITLE==/${value1}${value2}/g;
s/==VALUE1==/${value1}/g;
s/==VALUE2==/${value2}/g" ${workcopy}
cat ${workcopy} >> ${output}
done < inputfile.txt

Again, the template is copied in and out of the memory, but by optimizing sed, I reduced at least 3 read/write operations to 1. And that ended in a much better result than the first attempt:

real    14m26.738s
user 0m20.367s
sys 1m0.661s

From 23 minutes down to 15. That was not bad, but still not good. So why not execute all the computations in memory? Does Bash have the ability to do that? Can I even get rid of external tools like sed or awk?

fourth attempt

The short answer is: yes, Bash can do that and does an outstanding job at it.

Variables can be filled with the content of a file, and even substitutions can be done with a simple line of code. But there are a few things to mention!

Preserving the structure

If I echo a variable without double quotes, Bash does not print things like newlines. Instead of having a structured text, I will end up with a word salad if I don’t put the variable in quotes.

The second thing is that Bash truncates leading and trailing newlines. If you read from a file and you append the same text repeatedly to the existing text, add not only a newline at the end of your template but a blank space. Bash will not cut that space. So my template was slightly changed, I saw that as the easiest way to get around that trailing newline problem:


[==TITLE==]
...
ROOT = DB/DO_SOMETHING(NEW_VALUE_NAME="==VALUE2==")
...
IMPORTANT_FIELD = DB_NAME::==VALUE1== 0
...

Now to the code. I got rid of my workcopy and the sed call. I copied the template to the variable $template and with each iteration I made a copy into the memory as variable $tempvar, which I used for the substitutions:

template=$(cat template.tpl)
output=query.out

while read -r line
do
array=(${line})
value1=${array[0]}
value2=${array[1]}
tempvar="$template"
tempvar=${tempvar/==TITLE==/${value1}${value2}}
tempvar=${tempvar/==VALUE1==/${value1}}
tempvar=${tempvar/==VALUE2==/${value2}}
workcopy+="${tempvar}"
done < ${pairfile}
echo "$workcopy" >> ${output}

The result was overwhelming:

real    0m6.655s
user 0m2.883s
sys 0m3.272s

6 seconds instead of 23 minutes. That is awesome.

Conclusion

You might ask: “why all the fuss and not use Python or Perl instead?”. And I agree: I am a huge fan of both languages, but sometimes all that is needed is a Shell script. And for that, Shell can provide some decent tools that give you quite a remarkable speed.

If you stayed with me until this point, I want to thank you, and I hope this article might help you one day and save you some time!

--

--

Pascal Thalmann
The Startup

Entrepreneur, Systems Engineer, Linux Enthusiast, Python Aficionado, Elasticsearch Evangelist