In an AI-powered world, Africa’s data dearth more apparent than ever
January 16, 2024
Alexandria S Williams, The Africa Data Digest

“Algorithms can only see the numbers,” says Nigeria-based AI researcher Wuraola Oyewusi when asked her opinion about ChatGPT.

“These models don’t see that [the data] is from Africa”.

Oyewusi had been tinkering with open-sourced generative AI models for years before ChatGPT became a reality. That’s why she believes the assumption that African countries will be left out of the AI race is not grounded in a deeper understanding of the problem. In fact, Africa-focused researchers don’t even have to develop brand-new generative models. Ready-to-deploy models already exist, primed and awaiting application; all it takes is someone to gather the data necessary to make them effective.

The case for repurposed AI in Africa

There is an entire path of machine learning dedicated to jail-breaking AI models for languages, cultures, and images that do not have the same spread of data inputs available in the Western English-speaking world. “These methods just take a little bit more work”, Oyewusi said. One popular workaround is called transfer learning. It’s based on something technical but human in nature.

Let’s say, for example, your favourite tailor usually makes elaborate gowns for special occasions. That tailor is probably so good at their job that you could ask them to sew a few simple pants and shirts if you wanted. You only need to substitute for the various inputs you use to make your clothes and change a few measurements.

This is how transfer learning works. You take a model trained on a large data set, tweak it, add a few layers and retrain it for something smaller. It is one of many tools that Oyewusi and a team of researchers used to develop AFRIGAN, an African fashion style generator. Finding a pre-trained model was doable for Oyewusi, who was part of a vibrant open-source machine-learning community with knowledge and access.

The issue was that African fashion had no central data repository to train their model with. So, the AFRICAN creators went through the painstaking process of manually removing human models from images of African fashions they found on different e-commerce sites. The data was out there. It just took more time to curate. According to Oyewusi, better collection and curation are keys that AI researchers need to reimagine the future of AI in Africa.

Good data still needs curation

High-quality, curated, data is akin to a perfectly organised pantry. When all the ingredients you need for a dish are well-labelled and thoughtfully placed in neat rows, it’s easier to make your favourite dish. You don’t mistake salt for sugar. You get through the recipe with ease. That’s what data curation feels like when are researcher sources information from “languages of the internet” like English, which makes up around 55% of the internet’s written content. Out of interest, Spanish comes in second at a mere 5%, followed by Russian and German.

Researchers, scholars and technologists have added to these high-resource, linguistic repositories for years, making them more sophisticated and easier to access. The opposite is true for most other languages, which due to the low availability of online material published in them, are often referred to as “low resource” — a mere 0.28% of the world’s languages are “high resource” by machine learning standards. If an AI researcher wants to develop something on the scale of ChatGPT, which was trained on around 175 billion parameters of internet texts, they’ll have a harder time finding all this information in one location.

Wenitte, who is the CEO and founder of Mandla, experienced this while developing an app for learning African languages.

Most language learning apps can use AI to convert text to speech or generate unique sentences for learners, but Mandla is rarely able to access these resources. Due to significant variations in vocabulary, pronunciation, and grammar for African languages, the app’s team was challenged in collecting the language data needed to make the app work.

Mandla worked with linguists to assuage this barrier. He observed that even a carefully selected group of specialists had difficulty reaching a consensus about which words should be used for specific languages. Wenitte hesitated to exploit the host of AI tools available to other language learning apps like Duolingo, which reportedly uses AI to improve language learning through speech analysis, without consensus.

Generative AI learns patterns and structures from the original dataset. If inputs are imperfect, then output is even further off the mark. In Mandla’s case, the app ran the risk of teaching learners patterns in languages that didn’t actually exist. This was a risk that Wenitte could not take. He saw the app as a step toward documenting African languages officially on the web and confronting this “low resource problem” head-on.

“If we were incorrect, we ran running the risk of our mistake becoming the source of truth down the line,” Wenittee said.

Mandla is, as a result, a tool that utilizes AI but is still very human in nature. The team independently sourced most of the app’s language data and used human beings to voice words and sentences.

Experts believe that there has to be humans in the loop when it comes to using AI for low-resource data sets, explained Jeremy Kirshbaum, the CEO of Handshake Innovations, a company that works with businesses and governments to develop AI strategies. Human involvement is an effective tool for verification, but “it is difficult and expensive to organize at scale”, Kirshbaum said. This extends beyond language learning applications like Mandla. It is and will continue to be a conundrum for AI, which has yet to divorce itself entirely from human handlers.

The not-so-artificial side of AI

At the heart of what observers see as pure automation are human ingenuity and input. Whether it is Wuraola, who came up with an idea for an AI-generated book on African fashions, or ChatGPT itself, whose creators outsourced data labelling to low-paid Kenyan workers, AI shortfalls can’t be resolved without people. Cultures, languages and histories are not artificially generated, so they must be documented to be included.

That is why Wuraola champions better collection and curation of African data. There are already people doing this, she pointed out, referring to organisations like the Lacuna Fund and Masakhane that are deepening AI research and data collection for African purposes.

Wenitte sees a better future for language data. He cites policy changes in African countries that promote the use of local languages as a good step toward standardisation. However, he thinks governments should deepen investment in language data collection. If gaps in collection and curation are resolved, then tangible financial benefits for innovators in African countries can be realised.

Wenitte imagines an AI chatbot to accompany language learners on Mandla. He says that if ChatGPT was better versed in African languages, they could use OpenAI’s API and create new worlds of possibility for Africa-based users wanting to interact with it in their mother tongues. But for now, in a world where what’s out there is good enough but not great, abundant but not centralized, creators like Oyewusi and Wenitte will have to keep tinkering and finding workarounds until Africa’s data problem is rectified.

This article was originally published by The Africa Data Digest and produced in collaboration with Founders Factory Africa. Original source here.