Instead of relying on specialized APIs, the system leverages screenshots for visual input and uses virtual mouse and keyboard actions to complete tasks.