Another day, another opinion
My earlier test results were still worrying me
Had I made some terrible mistake?
Having done a bit more research, taken a closer look at my test results and conducted further testing, two important points became apparent.
1). My first problem is due to the ffmpeg command I used to create an audio file with copies of the chiffchaff call. This command was generated by AI and certainly does what I asked it to do.
Except that it created small differences to the call each time it replicated it. I couldn’t hear these variations, but when I took a closer look at repeated runs of tests results, I found that there was a pattern to the sequence.
The reason I missed this initially was because not all calls had a corresponding result (i.e. not all calls were detected).
For example, one record set may have been:-
86,45,67,71,91,85,92,65,90,72 ...and the next:-
45,71,91,92,65,90,72 ...and the next:-
86,67,71,34,85,92,65,90,83
It started to look like each individual call segment in my file was unique, and when detected, BirdNet rendered the same, repeatable result for each call segment.
I then created a simple python program to use a single instance of my chiffchaff call, and simply stream it 30 times.
The results were amazing on BirdNet-Pi; 30 detections, each with a confidence of exactly 93.43%. I was so amazed, I thought something was broken or stuck!
On BirdNet-Go this approach did not work. AI told me that this was due to the way BirdNet-Go dealt with TCP streams; as my program kept starting & stopping the stream, BirdNet-Go just could not cope. Whereas BirdNet-Pi was quite happy, due to the different way it handles incoming streams.
I did spend some time trying other approaches (suggested by AI) in order to find a test method which would work on both BirdNet systems. Using ffmpeg to ‘loop’ while maintain TCP always resulted in slight differences between ‘copies’ of my chiffchaff call.
Incidentally, working with mp3 audio files probably doesn’t help maintain audio integrity. So for now, I’ve stopped pursuing that approach.
Back with BirdNet-Pi, I tested with a few wav files taken from my primary BirdNet-Pi system, where confirmed species calls were infrequent, i.e. only one detection >80% in a day. These resulted in pretty wild results from the test system setup with a threshold of 5%. For example lots of different readings from 5% upwards for the target species, and many other species at varying percentages.
2). The second problem I had was with my original evaluation of the test results, which is basically too simplistic.
Having read “Guidelines for appropriate use of BirdNET scores and other detector outputs” by Connor Wood & Stefan Kahl, I now realise I should be evaluating “precision” and “recall” rather than some simple concept of repeatability. Enter the world of the F-score...
![]() |
| image by Walber, Wikipedia |
However, the problem now is finding or obtaining an audio file that will adequately test BirdNet in a way that will allow me to evaluate system adjustments. Like all evaluation tests (including GCSE subjects) its no good if the results are ‘10 out of 10’. Initial test results need to be moderate, so that you can see any improvement or degradation as a result of altering settings.
My chiffchaff audio clip is too good (easy) while those with high variability may contain unconfirmed species.
So another pause, as I try to work out what to do next!
Related:-

No comments:
Post a Comment