Taco (τ,τ):
I'd like to share my current perspective on the AI/ML feasibility/viability/sensibility of @bittensor_, especially compared to the efforts of EleutherAI
#AI#ML#bittensor#eleuther
1. Bittensor fine-tunes, while Eleuther pretrains. Fine-tuning has significantly less bandwidth/compute requirements than pretraining. Fine-tuning could give an order of magnitude lower error than a pretrained model in many cases, depending on the domain
2. The Muskian limits-of-physics assessment of LLMs is basically limited model capacity can contain only limited info. Pretraining a single model on a lot of info will only make it average in general, hence the need to typically fine-tune a pretrained model to make it useful..
..this is basically HuggingFace's mission statement
3. NLU is entering the Post-RETRO era, because the [power law of scaling](arxiv.org/abs/2001.08361) of Neural Language Models gives diminishing returns, so new approaches are expanding AI faculties e.g.
..with retrieval (DeepMind RETRO) to benefit from specialized knowledge otherwise difficult to capture in general pretrained models..
..RETRO is on-the-fly finetuning, and gives effective 25x parameter gain, so now a 7B model can outperform a 280B model, and significantly for special domains such as Github code prediction
4. A single commodity GPU with 11GB RAM can finetune a 6B GPT-J with 8-bit gradient compression, with negligible performance drop..
..Rougly, the medium-term vision is that Bittensor can represent 2000 x 6B fine-tuned RETRO models for 25x gain over GPT-J with a further order of magnitude improvement due to fine-tuning
5. Once multimodal data starts to integrate RETRO-style, the capacity concerns and the need for specialized experts become pronounced, and a divide-and-conquer approach becomes the only viable option..
..introspection and automatic knowledge derivation in specialized domains will further bolster the benefit from a RETRO-style approach
6. Bittensor is distributing specialization to achieve parameter efficiency, a collective of fine-tuned deep monoliths largely trained independently, so the bandwidth requirement is lessened.
7. Bittensor has also incentivized adversarially-resilient differentiable task-intelligence routing..
..the incentive promotes missing expertise, so our Advanced Miner will have a differentiable means of finding knowledge gaps over miners/servers, and will then fine-tune itself to provide that specialization to the network..
..top-down this resembles an ontology, it's a routing tree from the most general root to specialized leaves of different domains/topics
8. Bittensor is a bottom-up differentiable self-organizing network that uses ML to construct this routing tree, by trial-and-error of routing tasks based on their content to different experts and obtaining actionable intelligence representations from the experts..
..so we have e.g. 4000 leaf specialized domains, each fine-tuned for by a specific small set of miners/servers, and collectively these cover all incentivized knowledge
9. Internet-bandwidth regime suffices for Bittensor, because it is predominantly concerned with constructing the domain routing tree, which could be less than 1MB of info..
..furthermore, the tree is bootstrapped by defining large domain clusters when hosting HuggingFace fine-tuned models, e.g. a model per language
10. The uptrend of global connectivity could 10x overall bandwidth in the next 10-20 years, which would make more intensive training ops with Bittensor even more viable.
11. Enforcing adversarial resilience by checking the informational quality of contributors likely takes more bandwidth, but this is not directly to do with training, and the consensus mechanism can be sampled more liberally to fit available bandwidth.
12. Distillation is mostly a means of boosting service capacity by replicating existing capabilities, and for bootstrapping naive smaller models, but eventually all miners/servers will be incentivized to at least run large pretrained models with RETRO..
..and specialize on uncovered domains - which would put the importance of distillation in the background
13. The collective moderates/promotes/censors representations according to consensus, so one of Eleuther's main concerns of using uncensored data is thereby addressed in Bittensor..
..furthermore, Bittensor mainly offers actionable intelligence in the form of primed informational representations, but how it is enacted is entirely up to the user..
..the user decides what to do with the info and how to perform a given task, so task-level moderation can be applied locally