The idea I’m working on (in progress…) is to add Parlor-style reactions to a video as you watch it –”aha”, “agreed”, “huh?”, etc. (another reference for this kind of tagging of reactions: http://canv.as/)
It would be nice to be able to play back only the highlighted parts (or only interesting/confusing/irritating parts), but for this there should be a mechanism to determine the length of the highlight. That might make the interaction cumbersome –having to keep the tag pressed until the “aha” moment passed, for example, or clicking to finish the highlighting. It requires more attention. But it might be worth a try. An alternative is to have a default length, that can be edited later as you watch the highlights.
It might be good for the highlights to be private, but shareable (so that people don’t feel embarassed about being confused about something, for example). A nice feature would be to find people who are confused about the same part. Or irritated. Or pleased (when you were irritated, or vice versa).
I think I’d take the conversation elsewhere, making it easy to insert clips from the video for reference. The timeline might feel too confined, and the path of the conversation/argument is unlikely to follow the linear path of the video. Also, this way the focus would be on the conversation rather than on annotating the content (which has already been done before).
A note –by default, I wouldn’t show other people’s remarks while you are watching the video.

