One final test method
Is BirdNet really so bad?
or is there an 'age related' issue, i.e. 'me'
The older I get, the more I question my judgement. So I keep thinking I must be making mistakes or missing something.
I decided to conduct one last series of tests to work out if BirdNet repeatabilty was poor, or whether I've been missing something.
This time, I've taken a winter call of a chiffchaff from my main BirdNet-Pi system. While the spring song of the chiffchaff is two tweets and allegedly sounds like "chiff ..chaff", the winter call is just a single tweet.
I trimmed this audio clip to a short, single tweet of about 3 seconds duration using Audacity, and replicated it 30 times with a 30 second silent lead-in & 30s gap between calls using an ffmpeg command, something like this:-
ffmpeg -i /home/steve/BirdNet-Chiffchaff-17-10-2025.mp3 -f lavfi -t 30 -i anullsrc=r=48000:cl=mono -filter_complex "[0:a]apad=pad_dur=30[a1];[a1]aloop=loop=29:size=2e+09[looped];[1:a][looped][1:a]concat=n=3:v=0:a=1[a]" -map "[a]" /home/steve/chiffchaff-1call-30s.mp3
To summarise, this test recording contained 30 sections which included a single call, and some 'random' noise. But the call within the noise was clear enough for my poor ears to recognise and, importantly, it was typical of the calls captured by my main BirdNet-Pi system. Also, each of the 30 call segments was identical to all other call segments.
I also created a second edit using the Audacity menu Effect > Amplify... which offered to increase gain by over 30dB without clipping. I selected +24dB and the trace looked like this;
..and the Audacity spectrogram for this call looks like this;
I initially set both test systems as before, but eventually lowered the threshold settings to 5% (0.05).
I streamed this test audio file to both the u-u-t BirdNet-Pi & BirdNet-Go.
After testing with the chiffchaff, I repeated a few tests using a European Robin call.
Conclusions
- Neither system reliably reported all 30 calls in a test run; generally, I got 23 - 30 calls
- results for initial test file were similar to +24dB file (this was to be expected, as increasing the amplitude does not improve the signal-to-noise ratio)
- increasing overlap from 1.0 to 2.0 did not appear to improve results
- some other (false positive) species were reported with higher confidence that some of the low target species confidence levels.
Typical average confidence for BNP was 40% approx and population standard deviation was 18.
Typical average confidence for BNG was 54% approx and population standard deviation was 15.
So these results for BNG were better, but still dreadful. The wild, random nature of these results may mean that repeated tests could put either system ahead of the other.
Because I was still looking for a reason for these poor results, I put my phone inside a box along with a Bluetooth speaker and played these test recordings while using whoBIRD. I still got similar results, so I don't think I can blame my audio stream.
Even if there are defects in my audio recordings, I would expect these systems to give a more consistent result (either consistently low, medium or consistently high confidence levels).
I can't help but question why I've been using this system for the last 2 years.
I guess if you get enough recorded calls from a particular species (...and assuming its not a dog or a fox) the system probably does indicate that its genuinely within range.
Related posts:-
BirdNet Testing #2: a critical review from AI
No comments:
Post a Comment