Jai here, from Autofix Bot team. We've published results of the initial benchmark run[1] comparing Gitleaks, detect-secrets and trufflehog ~3 weeks ago. In the meantime, we've put together a significantly improved dataset, and we're planning to rerun those benchmarks shortly; will include Kingfisher to the list, and share the results here.
Btw, we use Kingfisher's validation system internally for generating request/expected_response pairs for a given secret, as the last step of the pipeline. We don't run/call the validation queries ourselves, due to rate limit issues. But, we add this information in a structured format as part of the response which can be executed on the client side (or) by the user who is integrating via the API. Thanks for building it :)
Loved this “lead bullets” framing, especially the parts on taint checking, scanners, and pre-processing/sampling logs. One practical add-on to the "Sensitive data scanners" section is verification: can you tell which candidates are actually live creds?
We’ve been working on an open source tool, Kingfisher, that pairs fast detection (Hyperscan + Tree-Sitter) with live validation for a bunch of providers (cloud + common SaaS) so you can down-rank false positives and focus on the secrets that really matter. It plugs in at the chokepoints this post suggests: CI, repo/org sweeps, and sampled log archives (stdin/S3) after a Vector/rsyslog hop.
One of the main benefits of Semgrep is its unified DSL that works across all supported languages. In contrast, using the Go module "smacker/go-tree-sitter" can expose you to differences in s-expression outputs due to variations and changes in independent grammars.
I've seen grammars that are part of "smacker/go-tree-sitter" change their syntax between versions, which can lead to broken S-expressions. Semgrep solves that with their DSL, because it's also an abstraction away from those kind of grammar changes.
I'm a bit concerned that tree-sitter s-expressions can become "write-only" and rely on the reader to also understand the grammar for which they've been generated.
For example, here's a semgrep rule for detecting a Jinja2 environment with autoescaping disabled:
That's a really interesting breakdown of the DSL vs. S-expression approach. I can see your point about the potential fragility of relying directly on tree-sitter outputs, especially with grammar drift. It took me a while to wrap my head around the S-expression syntax when I first started using tree-sitter, so I appreciate the comparison to a more human-readable DSL like Semgrep's.
The other benefit of a DSL like Semgrep's is that LLMs have become very good at generating it. See https://github.com/lambdasec/autogrep on how to automatically generate Semgrep rules from existing CVEs.
> One of the main benefits of Semgrep is its unified DSL that works across all supported languages.
> People can disagree, but I'm not sure that tree-sitter S-expressions as an upgrade over a DSL.
100% agree — a DSL is a better user experience for sure. But this is a deliberate choice we made of not inventing a new DSL and using tree-sitter natively. We've directly addressed this and agree that the S-expressions are gnarly; but we're optimizing for a scenario that you wouldn't need to write this by hand anyway.
It's a trade-off. We don't want to spend time inventing a DSL and port every language's idiosyncrasies to that DSL — we'd rather improve our runtime and add support for things that other tools don't support, or support only on a paid tier (like cross-file analysis — which you can do on Globstar today).
I'd love to hear how this project differs from Bearer, which is also written in Go and based on tree-sitter?
https://github.com/Bearer/bearer
Regardless, considering there is a large existing open-source collection of Semgrep rules, is there a way they can be adapted or transpiled to tree-sitter S-expressions so that they may be reused with Globstar?
> I'd love to hear how this project differs from Bearer, which is also written in Go and based on tree-sitter? https://github.com/Bearer/bearer
The primary difference is that we're optimizing for users to write their custom rules easily. We do plan to ship built-in checkers [1] so we cover at least OWASP Top 10 across all major programming languages. We're also truly open-source using the MIT license.
> Regardless, considering there is a large existing open-source collection of Semgrep rules, is there a way they can be adapted or transpiled to tree-sitter S-expressions so that they may be reused with Globstar?
I'm pretty sure there should be a way to make that work. We believe writing checkers (and having a long list of built-in checkers) will be a commodity in a world where AI can generate S-expressions (or tree-sitter node queries in Go) for any language with very high accuracy (which is where we have an advantage as compared to tools that use a custom DSL). To that extent, we're focused on improving the runtime itself so we can support complex use cases from our YAML and Go interfaces. If the community can help us port rules from other sources to our built-in checkers, we'd love that!
Any chance you could try and share results? Full disclosure, I built Kingfisher