Hacker Newsnew | past | comments | ask | show | jobs | submit | nvdnadj92's commentslogin

Amazing! I was planning on making a tool just like this, will take yours for a spin!


Thanks, let me know what you think!


Building a set of experiments that explores LLMs visual understanding of your photos to learn about you, especially given the recent learnings from deepseek-OCR. Part of the experiments delve into storing the memories with GraphRAG so they can be effectively retrieved without losing too information.


Laptop: Apple M2 Max, 32GB memory (2023)

Setup:

Terminal:

- Ghostty + Starship for modern terminal experience

- Homebrew to install system packages

IDE:

- Zed (can connect to local models via LM-Studio server)

- also experimenting with warp.dev

LLMs:

- LM-studio as open-source model playground

- GPT-OSS 20B

- QWEN3-Coder-30B-AEB-quantized-4bit

- Gemma3-12B

Other utilities:

- Rectangle.app (window tile manager)

- Wispr.flow - create voice notes

- Obsidian - track markdown notes


After reading this post and the readme, I'm not convinced that this is solving a real, observed problem. You outline an example with the long-term coaching mentorship, but why or how is your solution preferable to telling Claude to maintain a set of notes and observations about you, similar to https://github.com/heyitsnoah/claudesidian?

the jazz metaphors do not help provide additional context.


Fair feedback. Claudesidian is a productivity system where you organize knowledge and Claude assists. StoryKeeper is relational infrastructure that maintains emotional continuity across AI sessions and agent handoffs. Different layers of the stack, both valuable. I'll update the docs to make this clearer — appreciate the push for concreteness.


In case you need conversational data for the experiment you want to try, I developed an open-source cli tool [1] that create transcripts from voice chats on discord. Feel free to try it out!

[1] https://github.com/naveedn/audio-transcriber


Agree with the first half of the article, but every example the author pointed out predates AI. What are examples of companies that have been founded in the past 3 years and prove the authors point that the data model is the definitive edge?


What does AI have to do with anything here?


Just had a chat with AI to see how we could address the issues mentioned in the article. You can create models that cater to multiple use cases. You can split the domain model into facts (tables) and perspectives (views). This gives you a lot of flexibility in addressing the different perspectives presented in the artcile from a shared domain model.


I vibecoded a similar app. Here’s the open source link, if folks want to build their own:

https://github.com/naveedn/audio-transcriber


Slower


Yes, but by a negligible margin. My program is designed for multi-track audio, which means I run this in parallel on multiple 3 hour recordings, and get results in 12 minutes.

You haven’t shared any architectural details. What model? What size? How can anyone be sure that what you’re building is truly offline?


Yours isn't OSS, meaning I have no idea what I'm running


OSS would be incredibly slow, also seems like overkill for this use case


I was going to buy the app, but these responses are putting me off massively. How would making it OSS slow it down?


I suspect, from the responses of the creator here, that this app they are selling is likely violating a number of open source licenses…


the obnoxious site deisgn and comments like this stopped me from clicking buy in the apple store


What does that even mean? Why would OSS make it slower? Why would it be an overkill? This is not Producthunt, you have to give at least some kind of explanation for your claims.


OSS as in open source software. Not Open Sound System. Just in case.


Can you back up your claim that it's slow?


I wanted to build my own speech-to-text transcription program [1] for Discord, similar to how zoom or google hangouts works. I built it so that I can record my group's DND sessions and build applications / tools for VTTs (Virtual TableTop gaming).

It can process a set of 3-hour audio files in ~20 mins.

I recorded a demo video of how it works here: https://www.youtube.com/watch?v=v0KZGyJARts&t=300s

[1] https://github.com/naveedn/audio-transcriber

I alluded to building this tool on a previous HN thread: https://news.ycombinator.com/item?id=45338694


I have found a hack. If you wait long enough, someone will build what you wanted to build :)

Thanks for building this. I am trying to set it up but facing this issu

> `torch` (v2.3.1) only has wheels for the following platforms: `manylinux1_x86_64`, `manylinux2014_aarch64`, `macosx_11_0_arm64`, `win_amd64`


Ah lovely! I’d be happy to assist, create an issue on GitHub and we can go from there!


I would suggest 2 speaker-diarization libraries:

- https://huggingface.co/pyannote/speaker-diarization-3.1 - https://github.com/narcotic-sh/senko

I personally love senko since it can run in seconds, whereas py-annote took hours, but there is a 10% WER (word error rate) that is tough to get around.


I'm working on the same project myself and was planning to write a blog post similar to the author's. However, I'll share some additional tips and tricks that really made a difference for me.

For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.

For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.

For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.


I added EQ to a task after reading this and got much more accurate and consistent results using whisper, thanks for the obvious in retrospect tip.


Please can you share the prompt you use in ChatGPT to remove hallucination chunks


If I understood correctly, VAD has superior results than using ffmpeg silencedetect + silentremove, right?

I think latest version of ffmpeg could use whisper with VAD[1], but I still need to explore how with a simple PoC script

I'd love to know more about the post-processing prompt, my guess is that looks like an improved version of `semantic correction` prompt[2], but I may be wrong ¯\_(ツ)_/¯ .

[1] https://ffmpeg.org/ffmpeg-filters.html#toc-whisper-1

[2] https://gist.github.com/eevmanu/0de2d449144e9cd40a563170b459...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: