Building a set of experiments that explores LLMs visual understanding of your photos to learn about you, especially given the recent learnings from deepseek-OCR. Part of the experiments delve into storing the memories with GraphRAG so they can be effectively retrieved without losing too information.
After reading this post and the readme, I'm not convinced that this is solving a real, observed problem. You outline an example with the long-term coaching mentorship, but why or how is your solution preferable to telling Claude to maintain a set of notes and observations about you, similar to https://github.com/heyitsnoah/claudesidian?
the jazz metaphors do not help provide additional context.
Fair feedback. Claudesidian is a productivity system where you organize knowledge and Claude assists. StoryKeeper is relational infrastructure that maintains emotional continuity across AI sessions and agent handoffs. Different layers of the stack, both valuable. I'll update the docs to make this clearer — appreciate the push for concreteness.
In case you need conversational data for the experiment you want to try, I developed an open-source cli tool [1] that create transcripts from voice chats on discord. Feel free to try it out!
Agree with the first half of the article, but every example the author pointed out predates AI. What are examples of companies that have been founded in the past 3 years and prove the authors point that the data model is the definitive edge?
Just had a chat with AI to see how we could address the issues mentioned in the article. You can create models that cater to multiple use cases. You can split the domain model into facts (tables) and perspectives (views). This gives you a lot of flexibility in addressing the different perspectives presented in the artcile from a shared domain model.
Yes, but by a negligible margin. My program is designed for multi-track audio, which means I run this in parallel on multiple 3 hour recordings, and get results in 12 minutes.
You haven’t shared any architectural details. What model? What size? How can anyone be sure that what you’re building is truly offline?
What does that even mean? Why would OSS make it slower? Why would it be an overkill?
This is not Producthunt, you have to give at least some kind of explanation for your claims.
I wanted to build my own speech-to-text transcription program [1] for Discord, similar to how zoom or google hangouts works. I built it so that I can record my group's DND sessions and build applications / tools for VTTs (Virtual TableTop gaming).
It can process a set of 3-hour audio files in ~20 mins.
I personally love senko since it can run in seconds, whereas py-annote took hours, but there is a 10% WER (word error rate) that is tough to get around.
I'm working on the same project myself and was planning to write a blog post similar to the author's. However, I'll share some additional tips and tricks that really made a difference for me.
For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.
For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.
For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.
If I understood correctly, VAD has superior results than using ffmpeg silencedetect + silentremove, right?
I think latest version of ffmpeg could use whisper with VAD[1], but I still need to explore how with a simple PoC script
I'd love to know more about the post-processing prompt, my guess is that looks like an improved version of `semantic correction` prompt[2], but I may be wrong ¯\_(ツ)_/¯ .