How to Count Unique Lines Using Unix Pipes
Sometimes I want to take a command line output and count the unique lines from that output. I can do this with a combination of standard unix tools and pipes:Read more about sort
, uniq
, and head
in the linked GNU docs.Alternatively, use man sort
, etc. to see what the available flags are on your system of choice.I always find new things when I read the docs. sort
for example, has a --batch-size
flag that limits the maximum number of inputs to be merged at once.Hillel Wayne advocates for this kind of “browsing” documentation instead of “searching” in his newsletter.
<output> | sort | uniq -c | sort -nr | head -n 5
By using the long flag names, we can get a better sense of what’s going on:I generally prefer long flags, because I tend to forget which shortflag goes to what. Especially if I haven’t used a particular command in a while. On my system, bash provides tab-completion for these long-flags, which makes them easier to discover and type.
<output> | sort \
| uniq --count \
| sort --numeric-sort --reverse \
| head --lines 5
Step by step:
First, we sort the output alphabetically with sort
.
Then, uniq --count
merges matching lines while prepending the count to each line.We needed to sort first because this merge only happens on identical lines next to each other in the output.
Then, sort --numeric-sort --reverse
sorts these lines by the prepended count, in descending order.
Finally, we use head --lines 5
to only get the top 5 counts.I see this as the unix equivalent to the SQL: select count(field) from table group by field order by count(field) desc limit 5
.
Applications
I wanted to share a few places where I’ve found this useful recently. This technique combines well with other command line “power tools”, such as awk
and jq
:
Cut - Aggregate a CSV Column
If I have a csv like this:
hi,1,a
hello,2,a
world,3,b
I can use cut
to select just a single column:Read more about cut
here.I was originally going to use awk
for this example. Thanks to PgSuper for suggesting to use the simpler cut
.The original awk
is: awk 'BEGIN { FS = "," } ; { print $3 }'
Alternatively, cut -d',' -f3
using the shortflags.
cat my.csv \
| cut --delimiter ',' --field 3
Then, I can combine this output with our aggregation:
<previous> | sort | uniq -c | sort -nr | head -n 5
To get our aggregated count:
2 a
1 b
jq - Aggregating the Web
Our count unique aggregation works well with jq
for counting API output.I tend to prefer jaq
, a clone of jq
, for performance reasons, but jq
is the more popular tool. Both work with the same syntax from this example.
For example, Codeberg’s Forgejo instance exposes a public API with information about the repos hosted there:By default /repos/search
returns 25 items, at time of writing.
curl "https://codeberg.org/api/v1/repos/search"
I can use jq
to turn the resulting JSON into a list of primary languages used on each repo:Learn more about jq
from the tutorial.
<previous> | jq '.data[].language'
Here, I use jq
to access the data
key, operate over a list of objects, and then select the language
key from each one.
Finally, we can aggregate to see what languages are most used:
<previous> | sort | uniq -c | sort -nr | head -n 15
For me that looks like:
9 ""
4 "Shell"
4 "Lua"
2 "Python"
2 "Markdown"
1 "Kotlin"
1 "HTML"
1 "GDScript"
1 "Emacs Lisp"
It looks like a fair number of repos don’t have their language
field set in the Forgejo API response.