Hi, I’m Erika Rowland (a.k.a. erikareads). Hi, I’m Erika. I’m an Ops-shaped Software Engineer, Toolmaker, and Resilience Engineering fan. I like Elixir and Gleam, Reading, and Design. She/Her. ← Constellation Webring → Published on April 26, 2024

How to Count Unique Lines Using Unix Pipes

Sometimes I want to take a command line output and count the unique lines from that output. I can do this with a combination of standard unix tools and pipes:Read more about sort, uniq, and head in the linked GNU docs.Alternatively, use man sort, etc. to see what the available flags are on your system of choice.I always find new things when I read the docs. sort for example, has a --batch-size flag that limits the maximum number of inputs to be merged at once.Hillel Wayne advocates for this kind of “browsing” documentation instead of “searching” in his newsletter.

<output> | sort | uniq -c | sort -nr | head -n 5

By using the long flag names, we can get a better sense of what’s going on:I generally prefer long flags, because I tend to forget which shortflag goes to what. Especially if I haven’t used a particular command in a while. On my system, bash provides tab-completion for these long-flags, which makes them easier to discover and type.

<output> | sort \ 
  | uniq --count \ 
  | sort --numeric-sort --reverse \
  | head --lines 5

Step by step:

First, we sort the output alphabetically with sort.

Then, uniq --count merges matching lines while prepending the count to each line.We needed to sort first because this merge only happens on identical lines next to each other in the output.

Then, sort --numeric-sort --reverse sorts these lines by the prepended count, in descending order.

Finally, we use head --lines 5 to only get the top 5 counts.I see this as the unix equivalent to the SQL: select count(field) from table group by field order by count(field) desc limit 5.

Applications

I wanted to share a few places where I’ve found this useful recently. This technique combines well with other command line “power tools”, such as awk and jq:

Cut - Aggregate a CSV Column

If I have a csv like this:

hi,1,a
hello,2,a
world,3,b

I can use cut to select just a single column:Read more about cut here.I was originally going to use awk for this example. Thanks to PgSuper for suggesting to use the simpler cut.The original awk is: awk 'BEGIN { FS = "," } ; { print $3 }'Alternatively, cut -d',' -f3 using the shortflags.

cat my.csv \
  | cut --delimiter ',' --field 3

Then, I can combine this output with our aggregation:

<previous>  | sort | uniq -c | sort -nr | head -n 5

To get our aggregated count:

      2 a
      1 b

jq - Aggregating the Web

Our count unique aggregation works well with jq for counting API output.I tend to prefer jaq, a clone of jq, for performance reasons, but jq is the more popular tool. Both work with the same syntax from this example.

For example, Codeberg’s Forgejo instance exposes a public API with information about the repos hosted there:By default /repos/search returns 25 items, at time of writing.

curl "https://codeberg.org/api/v1/repos/search"

I can use jq to turn the resulting JSON into a list of primary languages used on each repo:Learn more about jq from the tutorial.

<previous> | jq '.data[].language'

Here, I use jq to access the data key, operate over a list of objects, and then select the language key from each one.

Finally, we can aggregate to see what languages are most used:

<previous>  | sort | uniq -c | sort -nr | head -n 15

For me that looks like:

      9 ""
      4 "Shell"
      4 "Lua"
      2 "Python"
      2 "Markdown"
      1 "Kotlin"
      1 "HTML"
      1 "GDScript"
      1 "Emacs Lisp"

It looks like a fair number of repos don’t have their language field set in the Forgejo API response.

← Constellation Webring →