Stack Overflow is unique as a page, in the sense that its contributions are under a license that allows for reuse (Creative Commons Share-Alike) as long as the individual users are properly credited. Does this mean that OverflowAI keeps the credit metadata and knows who wrote each individual part of an answer?
AI doesn’t work that way. No one wrote “part of the answer.” It’s more like each contributor casted a vote on what the next token should be and it randomly picks one of the top ten voted tokens. (Very very roughly.)
Edit: definitely read the other responses because apparently there are some techniques I wasn’t aware of and don’t understand nearly as well as I understand the underlying AI technology - and I’m only an enthusiast layman.
I don’t think there is any way of doing that. AI is like a huge matrix that says ‘if (’ is followed by
’ x’: 60%
’ foo’: 19%
’ person’: 9%
Etc.
And then it does it all over again for the next token based on randomly selecting one of the tokens and then saying ‘if ( person’ is followed by
‘.id’: 30%
‘.name’: 27%
Etc.
So just to write a simple ‘if person.name.startsWith(“foo”) {’ is the aggregate result of thousands of contributors - really pretty much every author of every code snippet ingested from the training material.
There is no single author even if the code matches existing code token for token. The only exception would be code that is so esoteric that there is only a single author writing code that does a particular thing. But even in that case, there is nothing in the probability matrix to indicate that a particular sequence of tokens is unique to a certain author. Best you could do is full text search a line of code to see if it matches anything in the training data and if there is a very small set of authors to whom credit might be assigned. That might be possible, but it would be an add-on (and significant performance hit) to the actual AI itself. Sort of like how browser integrated AI just runs a search and feeds the result into the context to make the output more likely to contain information in the top results.
Check out the article and feature video. It does appear to link to answers it pulled from. Bing and Bard do the same. Posters saying it’s impossible are mistaken.
Posters aren’t saying that its impossible to put search results through an LLM and ask it to cite the source it reads. They’re saying that the neural networks, as used today in LLMs, do not store token attribution in the vocabulary or per node. You can implement a system for the neural network to work in that provides it the proper input (search results) and prodding (a prompt that encourages the network to biasing toward citation), not that the single LLM can conceptualize of that on its own.
If it’s doing a search for the code, pulling it in to the context, and then spitting it back out in slightly modified form, then it can attribute the source it pulled in. That’s a very different thing from the AI because code that is pulled into context by a search had a strong influence on the output. The output is still generated the same way but it would be reasonable to credit the author of the code that is pulled in. However, the code in the training data cannot be credited. How you would pull in just the right piece of code in the first place though is a bit of a mystery to me.
Then I’m guilty of breaking the license. I have always been stealing code from Stack Overflow. Well, since I’m a senior dev right now I steal only from answers.
Stack Overflow is unique as a page, in the sense that its contributions are under a license that allows for reuse (Creative Commons Share-Alike) as long as the individual users are properly credited. Does this mean that OverflowAI keeps the credit metadata and knows who wrote each individual part of an answer?
AI doesn’t work that way. No one wrote “part of the answer.” It’s more like each contributor casted a vote on what the next token should be and it randomly picks one of the top ten voted tokens. (Very very roughly.)
Fair enough, but at least there should be a way for OverflowAI to list which contributors had the strongest link to the given answer, right?
Edit: definitely read the other responses because apparently there are some techniques I wasn’t aware of and don’t understand nearly as well as I understand the underlying AI technology - and I’m only an enthusiast layman.
I don’t think there is any way of doing that. AI is like a huge matrix that says ‘if (’ is followed by
’ x’: 60%
’ foo’: 19%
’ person’: 9%
Etc.
And then it does it all over again for the next token based on randomly selecting one of the tokens and then saying ‘if ( person’ is followed by
‘.id’: 30%
‘.name’: 27%
Etc.
So just to write a simple ‘if person.name.startsWith(“foo”) {’ is the aggregate result of thousands of contributors - really pretty much every author of every code snippet ingested from the training material.
There is no single author even if the code matches existing code token for token. The only exception would be code that is so esoteric that there is only a single author writing code that does a particular thing. But even in that case, there is nothing in the probability matrix to indicate that a particular sequence of tokens is unique to a certain author. Best you could do is full text search a line of code to see if it matches anything in the training data and if there is a very small set of authors to whom credit might be assigned. That might be possible, but it would be an add-on (and significant performance hit) to the actual AI itself. Sort of like how browser integrated AI just runs a search and feeds the result into the context to make the output more likely to contain information in the top results.
Check out the article and feature video. It does appear to link to answers it pulled from. Bing and Bard do the same. Posters saying it’s impossible are mistaken.
Thanks for the TLDW - I could ogle a bit of the article but since I was at work, I couldn’t just play the video out loud.
Posters aren’t saying that its impossible to put search results through an LLM and ask it to cite the source it reads. They’re saying that the neural networks, as used today in LLMs, do not store token attribution in the vocabulary or per node. You can implement a system for the neural network to work in that provides it the proper input (search results) and prodding (a prompt that encourages the network to biasing toward citation), not that the single LLM can conceptualize of that on its own.
If it’s doing a search for the code, pulling it in to the context, and then spitting it back out in slightly modified form, then it can attribute the source it pulled in. That’s a very different thing from the AI because code that is pulled into context by a search had a strong influence on the output. The output is still generated the same way but it would be reasonable to credit the author of the code that is pulled in. However, the code in the training data cannot be credited. How you would pull in just the right piece of code in the first place though is a bit of a mystery to me.
Then I’m guilty of breaking the license. I have always been stealing code from Stack Overflow. Well, since I’m a senior dev right now I steal only from answers.
It does seem to do that in the feature video. It appears to link to all the answers it pulled from.