As far as I understand, the training data is closed source. But, the methodology of training is open source which allows independent parties to recreate the model from scratch and see similar results. Not only can you download the full >400GB model using huggingface or ollama, but they also offer distilled versions of the model which are small enough to run on something like a raspberry pi. i’m running it locally on my machine at home with perplexica (perplexity.ai lookalike with searching capabilities)
Not home so I can’t try it but do you need to be so specific to match the whole markdown syntax?
You might be able to get away with
basically, matching #this%20is%20LIKELY%20a%20link.md as opposed to matching whole markdown link
lowercasing that entire match, then on a search matching stuff that looks like that, replace the %20 with a hyphen (combined into a single sed command). this only fails when an http link falls within the same line as a markdown hyperlink