Sensibly sorting a shopping list with an LLM

I often get shopping deliverys from a large UK supermarket chain, Sainsbury's. They email me a PDF with a summary, listing all the items and stuff that's missing, I print that out and tick off the items that have arrived.

The list is alphabetically sorted, but the names are not that obvious, for example you pick an apple out of the box and search the list and find:

Sainsbury's Breaeburn Apple Single
Sainsbury's Royal Gala Apple Single

Those are both sorted under 'S' because they're Sainsbury's own brand, but within there they're under B and R for the type of Apple, not anywhere near 'A'.

Similarly the Sainsbury's Maris Piper Potatoes are sorted under 'M' not 'P' for potatoes.

A good, safe use for an LLM

I think this is a pretty good use for an llm because:

It's a fairly fuzzy problem
The output of this problem is just reorganising data, not inventing new stuff
It's not critical - no one is going to get hurt if I have to spend longer scanning this list

I'm running an LLM locally, so there's no worrying about where my data is going. At the moment, this is the llama.cpp program running the Qwen3 coder model reoptimised down by Unsloth. I find it usably fast just on my CPU, and gives reasonably sensible answers for a 18G model.

My approach

Given that I'm a beardy Unix person, and I don't actually want to trust the LLM that much, my approach is to use good old Unix tools to extract the text from the PDF and turn it into a simple textual list, then give the list to the LLM.

I'm then using the LLM to give a single word output to classify each input which I'm then using as the main sorting key, and sorting it externally.

This way, the LLM output is actually very simple, it doesn't have to output the whole input text or anything, it can't mangle the input text - because that's not what is output at the end.

Building it

Testing out the LLM

I started out by trying a prompt and then throwing some tests at the model:

> For each line of input, give a single word from the input which is the most important in classifying the type of product.  Only ever give a single word output, and it must be one of the input words:

> Sainsbury's Maris Piper Potatoes 2kg
Potatoes

> Sainsbury's Thick Sliced Wholemeal Bread 800g
Bread

> St Helen's Farm Goat's Milk Natural Yogurt 450g
Yogurt

OK, that looks promising!

Wrangling the PDF

I used Poppler's pdftotext program to take the pdf and spit out raw text, and with a bit of awk and fmt got a reasonable one-per-line list of items. This is of course very delicate and will break if the retailer changes anything.


pdftotext -nopgbrk "$PDF" $TMPDIR/text

# The PDF contains a lot of headers and stuff, just get the items as text
# Obviously delicate as hell!
awk '/^Order summary/ { enable=0; } { if (enable) { print $0; }} /^Groceries / { enable=1; }' $TMPDIR/text | \ 
   fmt -300 |
   awk '{ac=ac" "$0} /^£/ { sub(" *", "", ac); print ac;ac=""}' > $TMPDIR/items.txt

Running the LLM

Then I attached the prompt to the top of the items file, and ran it through the LLM


cat > $TMPDIR/llm-input << __HERE_MARKER 
$PROMPT
$(cat $TMPDIR/items.txt)
__HERE_MARKER

# Do the LLM magic
$LLAMAPATH -m $MODELPATH $MODELPARAMS $HOSTPARAMS -f $TMPDIR/llm-input --no-display-prompt < /dev/null 2>/dev/null > $TMPDIR/llm-output

# The output contains a blank line and a EOF marker that starts with > - strip them
sed -e '/^$/d' -e'/^>/d' $TMPDIR/llm-output > $TMPDIR/llm-output.clean

Remembering not to trust the LLM, I then do a simple sanity:


ITEMCOUNT=$(wc -l $TMPDIR/items.txt | cut -d" " -f1)
OUTLINES=$(wc -l $TMPDIR/llm-output.clean | cut -d" " -f1)

if [ $ITEMCOUNT -ne $OUTLINES ]
then
  echo "Mismatch items: $ITEMCOUNT  llm output: $OUTLINES" >&2
  exit 1
fi

Using the output

Finally, I use the Unix pastecommand to glue the LLM output to the original list, and then pass it through sort.


paste $TMPDIR/llm-output.clean $TMPDIR/items.txt | sort > $TMPDIR/sorted-items.txt

This gets us output lines like:

  Yogurt	1 Sainsbury's Natural Yogurt 500g  £1.10

Results

In general it works pretty well, I've seen a few cases where the LLM decides to output a two word phrase like Sweet Potato or Cottage Cheese even though I strictly told it not to. Fortunately this works with the way I built it.

For most things, the LLMs chosen tags are reasonable, for some things it isn't sure and gives different answers each time. For example, for Sainsbury's Choco Rice Pops I've seen outputs of both Choco and Pops.

From a users point of view, that's pretty bad, because you can't know from one order to the next where to find it. Adding an explicit rule in the prompt works for those, e.g. For 'Choco rice pops' the word "Choco" is the best answer.

The whole script


#!/bin/bash

# Exit on error
set -e

LLAMAPATH=/discs/more/git/llama.cpp/build/bin/llama-cli
MODELPATH=/discs/fast/ai/Qwen3-Coder-30B-A3B-Instruct-IQ4_NL.gguf
MODELPARAMS="--jinja -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --presence-penalty 1.0 --cache-type-k q8_0"
HOSTPARAMS="-t 32 -dev none"

PROMPT="The following is a list of shopping items, one per line. For each item give a single word from the input which is the most important in classifying the type of product.  Only ever give a single word output, and it must be one of the input words for that item. For 'Choco rice pops' the word "Choco" is the best answer. Exit afterwards."


PDF="$1"
OUTFILE="$1-cat.txt"

TMPDIR=$(mktemp --tmpdir -d shopping-sort-XXXXXXXXXXXXXXX)
trap "/bin/rm -fr $TMPDIR" EXIT

pdftotext -nopgbrk "$PDF" $TMPDIR/text

# The PDF contains a lot of headers and stuff, just get the items as text
# Obviously delicate as hell!
awk '/^Order summary/ { enable=0; } { if (enable) { print $0; }} /^Groceries / { enable=1; }' $TMPDIR/text | \
   fmt -300 |
   awk '{ac=ac" "$0} /^£/ { sub(" *", "", ac); print ac;ac=""}' > $TMPDIR/items.txt

# Now prepare the full LLM input, prompt+list
cat > $TMPDIR/llm-input << __HERE_MARKER
$PROMPT
$(cat $TMPDIR/items.txt)
__HERE_MARKER

# Do the LLM magic
$LLAMAPATH -m $MODELPATH $MODELPARAMS $HOSTPARAMS -f $TMPDIR/llm-input --no-display-prompt < /dev/null 2>/dev/null > $TMPDIR/llm-output

# The output contains a blank line and a EOF marker that starts with > - strip them
sed -e '/^$/d' -e'/^>/d' $TMPDIR/llm-output > $TMPDIR/llm-output.clean

# Sanity check the number of items and output lines
ITEMCOUNT=$(wc -l $TMPDIR/items.txt | cut -d" " -f1)
OUTLINES=$(wc -l $TMPDIR/llm-output.clean | cut -d" " -f1)

if [ $ITEMCOUNT -ne $OUTLINES ]
then
  echo "Mismatch items: $ITEMCOUNT  llm output: $OUTLINES" >&2
  exit 1
fi

# Now sort based on the llm's idea of the category
paste $TMPDIR/llm-output.clean $TMPDIR/items.txt | sort > $TMPDIR/sorted-items.txt

# Add an extra line for printing
awk '{print $0,"\n"}' $TMPDIR/sorted-items.txt > "$OUTFILE"

mail: fromwebpage@treblig.org irc: penguin42 on libera.chat | matrix: penguin42 on matrix.org | mastodon: penguin42 on mastodon.org.uk

back to Dave Gilbert's Home Page