\subsection{Intrinsic Rewards}
Most exploration approaches struggle, when the rewards for actions have long delays (i.e. when rewards are sparse). Recall our room navigation example again, and imagine a second goal \(\tilde{G}\) closer to the start with a smaller reward. Then the algorithm will most likely stumble into the secondary goal first and the action values around this secondary goal will relatively quickly be updated to lead towards this goal.
\begin{wrapfigure}{r}{5.5cm}
\begin{center}
\begin{tikzpicture}
\foreach \x in {0,1,2,3,4}{
\foreach \y in {0,1,2,3,4}{
\draw (\x,\y) rectangle +(1,1);
}
}
\node at (0.5,1.5){\(\tilde{G}\)};
\node at (1.5,1.5){\(\leftarrow\)};
\node at (1.5,2.5){\(\leftarrow\)};
\node at (2.5,1.5){\(\leftarrow\)};
\node at (3.5,1.5){\(\leftarrow\)};
\node at (2.5,0.5){\(\leftarrow\)};
\node at (3.5,1.5){\(\leftarrow\)};
\node at (0.5,2.5){\(\downarrow\)};
\node at (1.5,3.5){\(\downarrow\)};
\node at (2.5,2.5){\(\downarrow\)};
\node at (0.5,0.5){\(\uparrow\)};
\node at (1.5,0.5){S};
\node at (3.5, 4.5){G};
\end{tikzpicture}
\end{center}
\end{wrapfigure}
In case of the \(\vep\)-greedy policy, the occasional exploratory actions will only cause a one step deviation from the shortest path to this secondary goal, and the algorithm will use its knowledge about the surroundings to quickly walk back towards this goal using subsequent greedy actions. While finding the larger goal, requires multiple subsequent exploratory actions, leading away from the secondary goal.
答案1
Wrapfig 很容易被它所处的环境所混淆,并且在间距方面往往有些保守。所以如果我将您的请求重新组织成更传统的请求,结果就是这样。
根据 Leandriis 的评论,我删除了顶部母线,从而增加了空白,但仍将“切割线”保持在 [15],如果不需要标题,可以减少到 [14]。
\documentclass[a4paper]{article}
\usepackage{lipsum,wrapfig,tikz}
\begin{document}
\subsection{Intrinsic Rewards}
Most exploration approaches struggle, when the rewards for actions have long delays (i.e. when rewards are sparse). Recall our room navigation example again, and imagine a second goal \(\tilde{G}\) closer to the start with a smaller reward. Then the algorithm will most likely stumble into the secondary goal first and the action values around this secondary goal will relatively quickly be updated.
\begin{wrapfigure}[15]{r}{5.5cm}\centering% Title (Topmatter)
\begin{tikzpicture}
\foreach \x in {0,1,2,3,4}{
\foreach \y in {0,1,2,3,4}{
\draw (\x,\y) rectangle +(1,1);
}
}
\node at (0.5,1.5){\(\tilde{G}\)};
\node at (1.5,1.5){\(\leftarrow\)};
\node at (1.5,2.5){\(\leftarrow\)};
\node at (2.5,1.5){\(\leftarrow\)};
\node at (3.5,1.5){\(\leftarrow\)};
\node at (2.5,0.5){\(\leftarrow\)};
\node at (3.5,1.5){\(\leftarrow\)};
\node at (0.5,2.5){\(\downarrow\)};
\node at (1.5,3.5){\(\downarrow\)};
\node at (2.5,2.5){\(\downarrow\)};
\node at (0.5,0.5){\(\uparrow\)};
\node at (1.5,0.5){S};
\node at (3.5, 4.5){G};
\end{tikzpicture}
\small\textbf{Wrapfig[15] (Room for caption)}
\end{wrapfigure}
In case of the greedy policy, the occasional exploratory actions will only cause a one step deviation from the shortest path to this secondary goal, and the algorithm will use its knowledge about the surroundings to quickly walk back towards this goal using subsequent greedy actions. While finding the larger goal, requires multiple subsequent exploratory actions, leading away from the secondary goal. \lipsum[66]
\end{document}