OpenAI and robots.txt: the question of opting out

OpenAI recently documented their support for the venerable robots.txt format, allowing us to opt our sites out from their model-training web crawler. I initially assumed I’d just opt out everything I work on, but then I realized that it was not so simple.

For this site, for my blog and side projects, I added the opt-out. On the off chance that someone finds something I wrote useful, they can find it via Google or other links, and they can see my name on it. I’ve already written some thoughts on this topic—people’s words should be seen where they publish them, not fed through a blender into a slurry you can’t judge for accuracy and can’t attribute to anyone.

For timestory.app, however, I struggled a bit, and finally chose not to opt out. That’s a product site, meant for marketing and documenting my app. People are widely using these generative tools for search, and while I have no way of ensuring they say truthful things, I need to be at least available for ingestion for there to be a chance. And it’s a little bit frustrating.

One night, after discussing this, my wife started asking GPT3.5 (via the Wavelength app) some questions about TimeStory and its competitors. Like always, it gave back a mixture of facts and believable untruths. Nothing egregious, just wrong. She corrected it in the conversation, but somehow I doubt that will fix the answers for anyone else. Again, I know people are out there using these tools, so what can I do? If these automated remixers are going to exist, I at least hope that they remix enough of my product pages and user documentation that people can get honest answers from it, and maybe even see TimeStory recommended if they search for the right kind of things.