The list is alphabetically sorted, but the names are not that obvious, for example you pick an apple out of the box and search the list and find:
Similarly the Sainsbury's Maris Piper Potatoes are sorted under 'M' not 'P' for potatoes.
I'm then using the LLM to give a single word output to classify each input which I'm then using as the main sorting key, and sorting it externally.
This way, the LLM output is actually very simple, it doesn't have to output the whole input text or anything, it can't mangle the input text - because that's not what is output at the end.
> For each line of input, give a single word from the input which is the most important in classifying the type of product. Only ever give a single word output, and it must be one of the input words: > Sainsbury's Maris Piper Potatoes 2kg Potatoes > Sainsbury's Thick Sliced Wholemeal Bread 800g Bread > St Helen's Farm Goat's Milk Natural Yogurt 450g YogurtOK, that looks promising!
pdftotext -nopgbrk "$PDF" $TMPDIR/text
# The PDF contains a lot of headers and stuff, just get the items as text
# Obviously delicate as hell!
awk '/^Order summary/ { enable=0; } { if (enable) { print $0; }} /^Groceries / { enable=1; }' $TMPDIR/text | \
fmt -300 |
awk '{ac=ac" "$0} /^£/ { sub(" *", "", ac); print ac;ac=""}' > $TMPDIR/items.txt
Then I attached the prompt to the top of the items file, and ran it through the LLM
cat > $TMPDIR/llm-input << __HERE_MARKER
$PROMPT
$(cat $TMPDIR/items.txt)
__HERE_MARKER
# Do the LLM magic
$LLAMAPATH -m $MODELPATH $MODELPARAMS $HOSTPARAMS -f $TMPDIR/llm-input --no-display-prompt < /dev/null 2>/dev/null > $TMPDIR/llm-output
# The output contains a blank line and a EOF marker that starts with > - strip them
sed -e '/^$/d' -e'/^>/d' $TMPDIR/llm-output > $TMPDIR/llm-output.clean
Remembering not to trust the LLM, I then do a simple sanity:
ITEMCOUNT=$(wc -l $TMPDIR/items.txt | cut -d" " -f1)
OUTLINES=$(wc -l $TMPDIR/llm-output.clean | cut -d" " -f1)
if [ $ITEMCOUNT -ne $OUTLINES ]
then
echo "Mismatch items: $ITEMCOUNT llm output: $OUTLINES" >&2
exit 1
fi
Finally, I use the Unix pastecommand to glue the LLM output to the original list, and then pass it through sort.
paste $TMPDIR/llm-output.clean $TMPDIR/items.txt | sort > $TMPDIR/sorted-items.txt
This gets us output lines like:
Yogurt 1 Sainsbury's Natural Yogurt 500g £1.10
For most things, the LLMs chosen tags are reasonable, for some things it isn't sure and gives different answers each time. For example, for Sainsbury's Choco Rice Pops I've seen outputs of both Choco and Pops.
From a users point of view, that's pretty bad, because you can't know from one order to the next where to find it. Adding an explicit rule in the prompt works for those, e.g. For 'Choco rice pops' the word "Choco" is the best answer.
#!/bin/bash
# Exit on error
set -e
LLAMAPATH=/discs/more/git/llama.cpp/build/bin/llama-cli
MODELPATH=/discs/fast/ai/Qwen3-Coder-30B-A3B-Instruct-IQ4_NL.gguf
MODELPARAMS="--jinja -ngl 99 --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --presence-penalty 1.0 --cache-type-k q8_0"
HOSTPARAMS="-t 32 -dev none"
PROMPT="The following is a list of shopping items, one per line. For each item give a single word from the input which is the most important in classifying the type of product. Only ever give a single word output, and it must be one of the input words for that item. For 'Choco rice pops' the word "Choco" is the best answer. Exit afterwards."
PDF="$1"
OUTFILE="$1-cat.txt"
TMPDIR=$(mktemp --tmpdir -d shopping-sort-XXXXXXXXXXXXXXX)
trap "/bin/rm -fr $TMPDIR" EXIT
pdftotext -nopgbrk "$PDF" $TMPDIR/text
# The PDF contains a lot of headers and stuff, just get the items as text
# Obviously delicate as hell!
awk '/^Order summary/ { enable=0; } { if (enable) { print $0; }} /^Groceries / { enable=1; }' $TMPDIR/text | \
fmt -300 |
awk '{ac=ac" "$0} /^£/ { sub(" *", "", ac); print ac;ac=""}' > $TMPDIR/items.txt
# Now prepare the full LLM input, prompt+list
cat > $TMPDIR/llm-input << __HERE_MARKER
$PROMPT
$(cat $TMPDIR/items.txt)
__HERE_MARKER
# Do the LLM magic
$LLAMAPATH -m $MODELPATH $MODELPARAMS $HOSTPARAMS -f $TMPDIR/llm-input --no-display-prompt < /dev/null 2>/dev/null > $TMPDIR/llm-output
# The output contains a blank line and a EOF marker that starts with > - strip them
sed -e '/^$/d' -e'/^>/d' $TMPDIR/llm-output > $TMPDIR/llm-output.clean
# Sanity check the number of items and output lines
ITEMCOUNT=$(wc -l $TMPDIR/items.txt | cut -d" " -f1)
OUTLINES=$(wc -l $TMPDIR/llm-output.clean | cut -d" " -f1)
if [ $ITEMCOUNT -ne $OUTLINES ]
then
echo "Mismatch items: $ITEMCOUNT llm output: $OUTLINES" >&2
exit 1
fi
# Now sort based on the llm's idea of the category
paste $TMPDIR/llm-output.clean $TMPDIR/items.txt | sort > $TMPDIR/sorted-items.txt
# Add an extra line for printing
awk '{print $0,"\n"}' $TMPDIR/sorted-items.txt > "$OUTFILE"
mail: fromwebpage@treblig.org irc: penguin42 on libera.chat | matrix: penguin42 on matrix.org | mastodon: penguin42 on mastodon.org.uk