Old Skool Unix part 2 - sort and uniq
I promised you a tool for big time logfile statistics. Not that logfiles are the only data you can use this technique on.
First do a bit of planning in our head. We already cut the path out in our last installment. And we want to check out which paths are accessed most on a particular day and site.
For that, we now (after our exercise with cut
) have
a list of all the paths that get accessed, in the
order then got accessed. We don’t want it ordered
that way, because it looks chaotic, so just let’s
sort
them first:
$ cut -d'"' -f2 <logfile | cut -d' ' -f2 | sort
/config.json
/posts/annoying-poetry-bug/
/robots.txt
/robots.txt
/robots.txt
/sitemap.xml
/tags/
/tags/coding/
Look at that! Looks much nicer already. But /robots.txt
is
mentioned three times, let’s fix that with uniq
:
$ cut -d'"' -f2 <logfile | cut -d' ' -f2 | sort | uniq
/config.json
/posts/annoying-poetry-bug/
/robots.txt
/sitemap.xml
/tags/
/tags/coding/
But hey, you said we wanted to count them!
Yes, I did. Turns out uniq
has a nice flag for just that,
counting how many times the particular line was there to
begin with:
$ cut -d'"' -f2 <logfile | cut -d' ' -f2 | sort | uniq -c
1 /config.json
1 /posts/annoying-poetry-bug/
3 /robots.txt
1 /sitemap.xml
1 /tags/
1 /tags/coding/
So yes, /robots.txt
was accessed three times. Now let’s sort the
result again, to have a sorted list with the most accessed path
on top:
$ cut -d'"' -f2 <logfile | cut -d' ' -f2 | sort | uniq -c | sort -r
3 /robots.txt
1 /tags/coding/
1 /tags/
1 /sitemap.xml
1 /posts/annoying-poetry-bug/
1 /config.json
Note that I used sort -r
this time. That means the sort
utility is asked to show the result in reverse order, that is,
largest first (in this case, 3
before 1
).