Continuous file ranges with bash

After the downtime and later recovery, DAS is finally back with a changed business model. Previous subscribers like me have access to content until some deadline, after which seasons must be purchased.

Having learned quite a bit from grb’s excellent screencasts, I wanted to archive them so that I could revisit and even discover anything I’d missed. I already had some of the files downloaded so I wanted to download only the ones that were missing.

Disclaimer: I wrote all the code in a single line, and edited it in wordpress as multiline; as such it might not work, even though I think I’ve tested them out. All code is public domain, hopefully you can learn something from it. I wont be explaining everything, just a bit of the process behind getting to the final result. I also did not use any if statements, which makes the code rather unreabable at first.

DAS screencast filenames follow a readable format like: das-([0]*[1-9][0-9]*)-([^.]+).mov where groups are sequence number and screencast title. Having had to come back to shell scripting after a while recently, I wanted to build a simple script to help me find the missing files.

Specification #1

To put it as an example, given current working directory with files:

  • das-0001-foo.mov
  • das-0005-foo.mov
  • das-0006-foo.mov

It should tell me that 2, 3, 4 are missing, and optionally that 1, 5, 6 were found.

Final solution

The solution I came up with can’t be the simplest or completest, but it seems to work:

show_found="true" # [true|false] to show or hide found ranges
open=""           # first found value in a range
last=""           # previous found file
while read filename; do 
    # parse out the screencast sequence number
    current=$(echo "$filename" |sed -Ee 's/^.*-0+([1-9]*[0-9])+-.*$/\1/')

    # sed "returns" the input if 's' "did nothing"; ignore non-matching files
    [ "$filename" = "$current" ] && continue

    # did a range just start or are we at the first file of a range?
    ( [ -z "$last" ] || [ -z "$open" ] ) && open="$current"

    # we are not on the first line and the previous is not current - 1
    [ "$last" ] && [ $((current - 1)) -ne "$last" ] && {
        [ "$show_found" = "true" ] && {
            # format the range       single element      multiple elements
            [ "$open" = "$last" ] && echo "+ [$open]" || echo "+ [$open...$last]"; 
        }
        
        # is there a gap? if so, format the missing, again first 
        # single element then multiple
        [ "$((last + 1))" -eq "$((current - 1))" ] \
            && echo "- [$((last + 1))]" \
            || echo "- [$((last+1))...$((current - 1))]";

        # current is always something that starts an found range or "open"
        open="$current";
    }
    
    # save our last foudn value
    last="$current"
done < <(ls -X1 .)

# if found ranges were wanted, we most likely have one to print
[ "$show_found" = "true" ] && [ "$open" ] && { 
    # again with the single/multiple range formatting
    [ "$open" = "$last" ] && echo "+ [$open]" || echo "+ [$open...$last]"
}

Coding it up

Looking at my bash history, it took me about 65 rounds to get it working as I wanted. I started off with by testing the output of ls -1 (1), ls -1 | sort -n (2) and then moved on to processing it line by line:

ls -1 . |\
    sort -n |\
    while read filename; do 
        current=$(echo "$filename" |sed -Ee 's/^.*-([0-9]){4}-.*$/\1/')
        [ "$last" ] && [ $((current - 1)) -ne "$last" ] && { 
            echo "missing $((current - 1))"
        }
        last=$current
    done

Attempts 3-5 were mostly about fixing quick sed expression and missing semicolons after curly braces. 6th attempt would had been fine except that it only reports “missing 4” in a cwd like specified in the example. This is because we only have files one, five and six, not everything in the range (input is generated by ls).

I did not notice the correct problem stated above immediatedly; I thought it must had been something gone bad with doing arithmetic on zero-prefixed numbers, as sed will yield in the loop. As such I went on few attempts to refine the regexp to ignore zero-prefixes.

After noticing that fixing the numbers did not fix this, I realized the need to loop from $last to $((current - 1)) or $(seq $last $((current - 1))), which finally worked in the attempt 14:

ls -1 . |\
    sort -n |\
    while read filename; do 
        current=$(echo "$filename" |sed -Ee 's/^.*-0+([1-9]*[0-9])+-.*$/\1/')
        [ "$last" ] && [ $((current - 1)) -ne "$last" ] && {
            for x in $(seq "$((last + 1))" "$((current - 1))"); do 
                ! [ -f "*$x*.mov" ] && echo "missing $x" || { nextlast=$x; break; }
            done  
        } 
        [ "$nextlast" ] && { last="$nextlast"; nextlast=""; } || { last=$current;}
    done

Specification #2

Well attempt #14 worked nicely. But as the downloads were still going on, I had more time to spend; how about better output, so that I could see both the found and not found ranges? Or, as an example; given a current working directory with the files:

  • das-0001-foo.mov
  • das-0005-foo.mov
  • das-0006-foo.mov
  • das-0009-foo.mov

The script should output something like:

+ [1]
- [2-4]
+ [5-6]
- [9]

Turns out that writing a script to output the above is a bit more compilicated. Most importantly, the whole structure needed to be re-organized so that the first found value of a range (open) could be accessed after having processed all input with the while-loop. This could had been side-stepped by introducing for example an empty line in to the input, but I wanted to learn about variable scoping in bash.

stackoverflow.com has the answer which tells the “piping to a while-loop” created a sub-shell, which of course has it’s own variables, not visible to the parent shell. The trick is to just run the while loop, but feed the input from a sub-shell as in:

while read line; do echo "$line"|wc -c; done < <(cat long-file.txt)

In reality, I didn’t quite figure out the output format I really wanted until attempt 26. Also by then “one-liner” had now grown to not-really-functioning-one-liner of 613 characters, which made want to start over for a while.

After starting more downloads I realized the main problem was in the inner for-loop; it was wasteful, lead to introduction of even more variables (nextlast) and prevented me from actually outputting what I wanted. The trick was to start seeking out a solution that’d minimize the number of variables and through that I realized that recording only the last first found (open) and last processed value were enough.

Rest of the attempts were really about fixing cases I really did not have in my das folder while files were downloading, for example:

  • all downloaded
  • ranges of one

Tail

In the end, downloading DAS screencasts taught me even more things that I have been able to learn from the videos themselves. Also, a plot of “one-liner length” per attemp might be of interest:

graph of chars per attempt with my one-liner

Perhaps using a proper vcs instead of history file would had shown the amount of rewriting that was going on.

And here’s the source of the graph; gnuplot is nice, and all you need is a single line (edited here for readability).

gnuplot <(
    echo "set terminal png;" 
    echo "set output 'daslines.png';"
    echo "set title 'Characters per attempt';"
    echo "set xlabel 'Attempt #';"
    echo "set ylabel 'Characters';"
    echo "set yrange [0:*];"
    echo "plot '-' using 1:2 title 'Attempt length' with linespoints"; 
    lineno=1; 
    while read line; do 
        echo $lineno $(echo "$line" |wc -c); 
        lineno=$((lineno + 1)); 
    done < ../dastest/selected-history
)
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: