Sunday, April 30, 2023

Only as good as the data

The Washington Post reports on the data used for ChatGPT and other large language models (LLMs):
We found several media outlets that rank low on NewsGuard’s independent scale for trustworthiness: No. 65, the Russian state-backed propaganda site; No. 159, a well-known source for far-right news and opinion; and No. 993, an anti-immigration site that has been associated with white supremacy.

Chatbots have been shown to confidently share incorrect information ... Untrustworthy training data could lead it to spread bias, propaganda and misinformation.

AI is only as good as its data. Obviously using known propaganda like Russia Today will be a problem for ChatGPT. Generally, including disinformation or misinformation will make the output worse.

AI/ML benefits from thinking hard about high quality data and the metrics you use for evaluation. It's all an optimization process. Optimize for the wrong thing and your product will do the wrong thing.

