Plan: 1. Similarity score on phrases. 2. Qemu coding
Overall approach: small projects.
Current approach to language:
1) As mentioned prior, no free lunch for language processing. Need a ton of computation to process. Can’t do it in a computationally efficient way.
2) Two approaches: a) use language straight, this is what LLMs do, what everyone is doing, and it’s not satisfying to me. It’s likely the only solution. b) Use language based on parse trees. This is the more satisfying approach since it allows me to make explicit models of entities. Makes the models transparent, makes it easier to learn off of little data. Since a) is currently being done (and less satisfying), pursuing b).
3) Two approaches to modeling via parse trees: a) make embeddings, b) use similarity graphs/trees/cooccurrence data. Problem with a) is that there’s nothing guaranteeing that the embeddings don’t overlap with other embeddings. Would need to randomly sample the dataset of all embeddings OR do something like GloVe. Can’t randomly sample because as the number of embeddings scales (millions, not tens of thousands), the number of negative samples needs to increase, and the computation involved becomes intractable. Similarly, GloVe won’t have enough data? Eh, will it? Training like GloVe definitely won’t work since the data is so sparse, we would need to do something iterative like fit the phrase vectors to the existing word vectors then update the word vectors separately, but this runs into the same problem of embeddings folding over themselves. Also, we run into the same issue with sparse data requiring us to update word embeddings differently than phrase embeddings. This would require a ton of hacks to get right. If we tried doing something like factorizing the sparse co-occurrence matrix, well I don’t quite know what algorithms can do that, and whether they would be efficient on the scale we’re dealing with - the benefit that we have is that we’re dealing with sparse data, and the algorithm would need to only look at the non-zero values in the matrix in order to exploit the nature of the problem. How would new data be fit to the model? Once again, we’d need to make sure embeddings didn’t fold over themselves.
There’s a separate problem, which is, why bother making embeddings for phrases? What’s the point of checking the similarity between “red apple” and “green apple” vs. “red tomato” vs. “purple plum”? Similarity makes the model, defines the model. But the point here is that we can use other means to gauge similarity for phrases, ones that don’t need anything super complicated.
Admittedly, if something doesn’t have much data, we don’t need to model it because we can just browse the data ourselves. If something has a lot of data, we can train a model on it. Black-box, large llm type model. Fine-tune it on the data available and call it a day. If it comes to that, I should first find a use case where users are asking for that thing and then do it, instead of trying to anticipate for users. To be fair, Google is something people use all the time, and I mainly want a way to search. Eh but we don’t need llms for searching this stuff. Eh but for question answering yeah maybe. But fine-tuning on niche datasets wouldn’t have much appeal for question answering since RAG on a llm would do just fine.
The whole approach that I’m taking is to see if there’s a route based on parse trees that *doesn’t* use llms. Something that uses llms is merely a matter of using llms, which isn’t super interesting to me. I might as well work on something else, something that interests me more. And maybe use llms as a tool in the toolbox for later, when needed.