Comments to author (Associate Editor)
=====================================

I ask you to take the comments from the reviewers,
especially those from reviewer n. 5, into consideration in
revising your paper.

Reviewer 5 of ITSC 2023 submission 347

Comments to the author
======================

The paper considers the task of predicting lane changes and
turning at intersections by combining incabin and outside
pictures.

Compared to other methods, the main differences seem to be
the addition of temporal integration, in the form of a
recursive neural network and the exploitation of context
during training, the context being the current lane the
vehicle is in and whether or not it
is close to an intersection. The first novelty is clearly
explained, but the second is not.


The idea of temporal integration is a logical one and it is
implemented in a straight forward fashion. It is however
not clear how much the benefits of this approach are due to
simply having more
observations versus really exploiting temporal scene
changes. For instance, detecting the current lane the
vehicle is in becomes easier if analysis of multiple
pictures is combined, but this could also be
done in postprocessing, i.e., by first independently
processing each image and then aggregating the output
activations in a simple manner (e.g., averaging over 1
second). The question then is: how much better does the
more complicated proposed approach perform compared to this
naive approach. If the answer is "a lot" then this
would indicate that the network is truly exploiting the
time dimension, e.g., by inferring internally the
current speed of the vehicle, which obviously has a
predictive potential.

Also, it is not clear what is the benefit of fusing outside
and in-cabin information at a very early stage.
How would the results of the paper compare to fusing at a
later stage.

The results indicate a small but barely significant
improvement over state of the art methods.
For instance, the accuracy score in table I is better by
0.02 compared to [7], whereas the standard deviation
is also 0.02 on that performance number.

The results in the paper for the proposed method also seem
to indicate that the outside observations
contribute little to the overall performance (e.g.,
difference in accuracy below tolerance).


The paper would be clearer if "traffic context" and "driver
intent" would be concretely defined in the
introduction. Now they only become clear much later.


Recursive approaches like in this paper are appropriate for
predictable situations, but may react poorly to
unexpected scene changes, e.g., when driving first in a
narrow street and suddenly entering a main road. The
longer the historic memory, the longer the network might
take to recover from an incorrect assessment of the
current scene. On the other hand, the shorter the memory,
the less the network will have predictive capabilities.
It would be useful to look into this balance.

Also, probably the network is fed with data at a fixed rate
(related to the camera framerate), but in
real-life information flows at different rates,
proportional to the speed of the vehicle for outside data,
but with
a much more complex dependency on that speed for indoor
data.  How does the system adapt to vehicle speed?
How would it react if trained for data from a standard
speed, but then evaluated in sequences with slower
traffic?

Section D "Context-consistency loss" is very poorly
explained. Also the error criterion seems rather ad hoc.
Is it correct that the only way this context is taken into
account is at training time? Then it is not really
exploiting context.


Detailed comments

Several pieces of criticism in the paper are so vague as to
become meaningless. For example:
" This strategy may lead to suboptimal performance, as the
individual processing steps might not be
specifically designed for optimal integration of
information
from both sources."
This is not convincing. Any method may have weaknesses, but
such generic criticism does not belong in
a scientific paper. Instead, the authors should provide
more concrete arguments for this perceived weakness
(or experimental evidence)

Or: " Directly optimizing the standard cross-entropy loss
lce can lead to incorrect predictions of some
scenarios into categories that conflict with the traffic
context".
Indeed it "can", but just writing that does not make it
true. Also what is "a category that conflicts
with a traffic context" (the whole section D should be
improved. It is not clear).


The statement " Furthermore, we observe that CEMFormer
surpasses the other two methods in terms of the number
of parameters. " is confusing as the number of parameters
in CEMFormer is smaller and does NOT surpass
the other methods in that respect (which is a good thing).

The statement "These results indicate that the proper use
of
an episodic memory-guided architecture allows the model
to learn complex spatial-temporal relationships from both
in-cabin and external driving views." is also misleading:
the results show that the proposed method works better, but
this could
also be due to "multi-observation" integration without
exploiting the time sequence of
scene events. While it is possible or even likely that this
time sequence of events has indeed
been exploited, I do not think the paper actually provides
evidence on that.


"This is in contrast
to previous work such as [6], which claimed that outside
views were not helpful for driver intention prediction
tasks."
True, but [7] show the opposite. Why not mention that?


Model latency: Latency should not be confused with
throughput. For example a network may take
longer than desired to react to a new context because the
recursive processing remembers the old context
for to long. This "decision" latency could be long compared
to the frame rate and is at least
as important to study.


Why is the standard deviation on accuracy for the "in-cabin
& external" case  much larger
for your method? Does this mean that it is much more
sensitive to small variations in the
training set, despite the smaller number of parameters?


Reviewer 14 of ITSC 2023 submission 347

Comments to the author
======================

In this work, the authors propose a technique to predict
driver intention with a spatial-temporal transformer that
takes in-cabin and external images as input. Furthermore,
they implement a novel loss function and evaluate their
approach on a publicly available dataset.

The manuscript is well-written, and the overall discussion
is quite clear.
The authors compare their approach with other
state-of-the-art methods and report an ablation study
demonstrating the effectiveness of episodic memory and
context consistency loss.
In particular, their approach shows superior performance
than the other methods considered.
It would be appreciated if the authors could improve Fig. 1
by reporting also more details of the encoder part.