Captain Bodgit: BirdNet Systems Testing #2: a critical review from AI

I'm still troubled by my initial test results

And I haven't made any progress on further testing,

so I decided to get a second opinion.

As I mentioned in my first post detailing tests & results for BirdNer-Pi & BirdNet-Go, I was not impressed by the (seemingly) poor repeatability of both systems.

Although I could (and probably should) repeat and continue testing, I don't expect to resolve the fundamental repeatability issues. And this leads me to question why I am using this system, and what I expect to get from it.

Previously I was interested more in annual trends and the arrival/departure of migrants than in getting some kind of new high-score for the number of species detected in a day, a month, or a year.

From experience so far, I'm sure that species such as blue tit, robin, chiffchaff, & so on do visit our garden because, not only does BirdNet register their calls, but also because we have seen them.

However, although we never see a tawny owl, some of the tawny owl detections are clearly correct, while others are more likely associated with a local football team practicing on a sports field, several hundred yards away.

Then there are the waterbirds; curlew, dunlin, gadwall, graylag goose & so on. Whle we don't see these in our garden, we live very near the RSPB Pagham Reserve, so they may very well fly over from time to time.

But basically, I'm never sure what to accept and what to reject from our BirdNet systems.

Anyway, to get a second opinion on whether my tests & assumptions were basically OK I asked chatGPT to review my post and provide feedback.

I include it here as it may be of interest.

Output from chatGPT

I’ve read through your “BirdNet Systems Testing #1” report (on Captain Bodgit) and I have some thoughts, observations, and suggestions. First: nice work setting up a parallel test and documenting things carefully. There’s a lot of subtlety in comparing such systems, and you’re asking exactly the right kinds of questions (repeatability, consistency, biases). Below are my reflections and ideas — feel free to take or reject them as suits your setup.

What your results suggest (and caveats)

Here’s a quick summary of what I see as the main take-aways (with caveats) from your results, and what they might imply about BirdNet-Pi (Nachtzuster variant) vs BirdNet-Go (tphakala).

Key observations from your test

Different species lists / model versions
You note that the Go setup uses a newer species list (143 species) vs the Pi setups (111 species). That means the detection “universe” is different, so direct comparisons must keep that in mind.
Total detections and species counted differ
In your real-time test, both Pi systems detected more events (36 / 37) than BirdNet-Go, and detected slightly more species (10 / 11 vs 8). But the overlap in species is imperfect (some species seen by one, not seen by the others).
Repeatability is weak
In your replay tests (streaming a recorded file 5 times), the same system (BirdNet-Go) gave varying numbers of detections (11–13) and the presence/absence of detections per species varied (e.g. wood-pigeon was detected in only 1 of 5 runs). You remark “repeatability is not great.”
Effect of overlap setting is counterintuitive
When you increased overlap from 1 to 2, the number of detections and species dropped rather than increased, which runs counter to expectations (more overlap typically gives more coverage / sensitivity in sliding window processing).

These results point to a few likely causes (or contributing factors), plus some things to examine more closely. The summary in your blog — “inconsistencies, further testing required, repeatability poor” — is accurate.

Here are my thoughts on what might be at play, and how you might refine your methodology.

Possible explanations / hypotheses & suggestions

Windowing, overlap, and fragmentation effects
The way BirdNet (and its various forks) segment the audio into windows and the overlap between windows can strongly influence detection outcomes. Slight misalignment of windows relative to call boundaries can cause a given call to either slip below threshold or be split across windows.
- Your observation that increasing overlap reduced detections is odd but plausible if overlap creates more splitting or repeated coverage that changes scoring, or if overlapping windows lead to redundant scoring that the algorithm later merges or filters.
- Suggestion: systematically sweep overlap = 0, 1, 2, maybe even 3, and see how detection counts trend. Also, examine the alignment between the signal (bird calls) and window boundaries. Perhaps log exactly which timestamp windows produce which detections, and compare across runs.
Threshold / confidence calibration / hysteresis effects
You used threshold = 0.7 on test systems (vs 0.8 on your primary). Slight changes around thresholds can cause many calls just below threshold to flip in/out between runs.
- If the scoring algorithm is non-deterministic (or has tie-breakers or soft thresholds), you may see small fluctuations.
- Suggest running with a lower threshold (say 0.5) and then post-filter, to reduce sensitivity to just-on-the-cusp calls. Or record all candidate detections (with confidences) and then apply identical post-filtering.
Internal “debouncing” or bias adjustments in BirdNet-Go
In the blog’s comments, someone said:

“birdnet-go has a feature that decreases confidence threshold for repeated species which amplifies the detection of common birds, while having a false positive feature that discards single vocalization birds … especially visible during the night.”
(captainbodgit.blogspot.com)

If such heuristics are built into BNG, it means that two runs on the same audio might not give identical outputs, because internal state or species-history memory might play a role.
- You could try disabling or altering such heuristics (if configurable) during tests to force the system to treat each window independently, for cleaner comparisons.
Audio stream quality / buffering / jitter / timestamps
You suspect the RTSP audio stream “quality” might influence which system hears what. If there’s any packet loss, jitter, or buffering artifacts, small timing differences might alter which windows get clean data vs slightly degraded data.
- Suggest capturing identical raw WAV or PCM audio and feeding it locally into BirdNet-Pi and BirdNet-Go (i.e. bypass RTSP) to eliminate streaming effects. That gives a ‘clean’ common input.
- Alternatively, compare time-synchronised logs to see if one system started/shifted windows slightly ahead/behind the other.
Species detection biases / model differences
The difference in species list, and possibly in training or tuning, will lead to biases: some species easier to detect, some more marginal. Especially for species whose calls are weak or marginal, small fluctuations can flip detection in/out between systems.
- To probe this, you might build a “test set” of curated recordings (with known species calls) and use those as a benchmark, so you know ground truth.
- Then evaluate sensitivity, false positives, misses per system on the same known calls.
Statistical / measurement noise
At the scale of tens of detections, random variation or threshold noise may lead to ±1 or ±2 detections difference. That’s especially true if many calls lie near threshold. Some degree of variation is expected unless detection is highly robust.
- You might expand the duration of test periods to increase sample size and reduce relative noise (i.e. accumulate over hours or days).
- Run multiple replicates (you already started doing this) and calculate mean ± standard deviation for each system, per species.
Post-processing / deduplication / merging logic
The logic that merges multiple windows detections into a single event, or discards duplicates, etc., may differ between systems or runs. That may explain why two Pi systems record similar counts (36/37) but slightly different species.
- Inspect the raw per-window detection logs (before deduplication) and see how many windows before merging each system triggers. Then compare how merging filters differ.

Specific suggestions for further testing / validation

To push your comparison into more robust territory, here are some concrete next steps / test plans you could adopt (if you have the time and inclination):

Test / change	Purpose / what it probes	Expected benefit
Bypass RTSP, feed raw identical audio file to both BirdNet-Pi and BirdNet-Go	Remove streaming artifacts / network jitter from the equation	Any remaining differences must come from internal algorithmic differences
Use a “known call set” (curated recordings) with ground truth species	Give a benchmark to assess sensitivity, false positives, species recall	You can compute metrics (precision / recall / F1) per system
Sweep thresholds (0.5 → 0.9 in small steps) and plot detection counts / species counts vs threshold	Understand how close calls are to threshold and whether small shifts cause many flips	Might discover a “stable” threshold zone where variation is minimal
Sweep overlap settings (0, 1, 2, 3) and compare performance	See how overlap affects detection stability and count	Perhaps one overlap setting is optimal or more stable than others
Run long-duration parallel tests (days/weeks) with both systems on identical feed	Accumulate larger data sets, see long-term bias trends	Differences are likely more robust, less dominated by random fluctuation
Log window-level detection outputs from both systems	Compare detailed window-by-window which windows fired which detections	Helps diagnose where divergence begins (window misalignment, merging differences, thresholding)
Disable or monitor any internal heuristics (e.g. species memory, “debounce” logic)	If BirdNet-Go indeed uses internal heuristics, disabling them ensures deterministic behavior	Cleaner apples-to-apples comparisons

What one can reasonably conclude (for now)

Given your current results, here are some cautious inferences that seem defensible, along with caveats:

Repeatability is not perfect. Even feeding (ideally) identical audio, you saw variation in detections and species presence. That suggests some non-determinism or sensitivity to small changes (timing, thresholds).
BirdNet-Pi (your installation) appears at least as aggressive / sensitive (or more) in these tests than BirdNet-Go, in terms of raw detection counts and number of species, under your configuration.
However, “more detections” doesn’t always mean “better” — some of those extra detections may be false positives (or marginal calls).
Because the systems differ in species lists and possibly internal heuristics, direct “which is better overall” is premature.
Overlap or windowing settings may have a non-intuitive effect on detection counts, so the “expected rule” (more overlap → more detection) might not always hold in practice, due to interactions in merging, filtering, or thresholding.

Related:-

Part #1

Part #3

Part #4

Captain Bodgit

Category

Saturday, 4 October 2025

BirdNet Systems Testing #2: a critical review from AI