Old Skool Unix part 2 - sort and uniq // BLEEN.DEV

I promised you a tool for big time logfile statistics. Not that logfiles are the only data you can use this technique on.

First do a bit of planning in our head. We already cut the path out in our last installment. And we want to check out which paths are accessed most on a particular day and site.

For that, we now (after our exercise with cut) have a list of all the paths that get accessed, in the order then got accessed. We don’t want it ordered that way, because it looks chaotic, so just let’s sort them first:

$ cut -d'"' -f2 <logfile | cut -d' ' -f2 | sort
/config.json
/posts/annoying-poetry-bug/
/robots.txt
/robots.txt
/robots.txt
/sitemap.xml
/tags/
/tags/coding/

Look at that! Looks much nicer already. But /robots.txt is mentioned three times, let’s fix that with uniq:

$ cut -d'"' -f2 <logfile | cut -d' ' -f2 | sort | uniq
/config.json
/posts/annoying-poetry-bug/
/robots.txt
/sitemap.xml
/tags/
/tags/coding/

But hey, you said we wanted to count them!

Yes, I did. Turns out uniq has a nice flag for just that, counting how many times the particular line was there to begin with:

$ cut -d'"' -f2 <logfile | cut -d' ' -f2 | sort | uniq -c
      1 /config.json
      1 /posts/annoying-poetry-bug/
      3 /robots.txt
      1 /sitemap.xml
      1 /tags/
      1 /tags/coding/

So yes, /robots.txt was accessed three times. Now let’s sort the result again, to have a sorted list with the most accessed path on top:

$ cut -d'"' -f2 <logfile | cut -d' ' -f2 | sort | uniq -c | sort -r
      3 /robots.txt
      1 /tags/coding/
      1 /tags/
      1 /sitemap.xml
      1 /posts/annoying-poetry-bug/
      1 /config.json

Note that I used sort -r this time. That means the sort utility is asked to show the result in reverse order, that is, largest first (in this case, 3 before 1).