シミュレーション・テンプレート

データ収集

ジムでモデルを学習させるためにまず学習データを収集するための簡単なバトルループを書く必要があります。OpenAIのジム環境と似たようなフレームワークを使用しており、次のステップに移るには env.step(action) を呼び出す必要があります。

def run_battle(env, randomize_attributes = False, random_policy = False):
    done = False
    data_collection = {"s": [], "a": [], "r": []}
    your_state, opponent_state = env.reset(randomize_attributes, random_policy)
    your_attributes = env.your_fighter["battle_attributes"]
    opponent_attributes = env.opponent_fighter["battle_attributes"]
    state = get_state(your_state, opponent_state, your_attributes, opponent_attributes)
    while not done:
        action = env.fighters[0]["model"].select_action(state)
        your_new_state, opponent_new_state, done, winner = env.step(action)
        
        reward = get_reward(your_state, your_new_state, opponent_state, opponent_new_state, winner)
        new_state = get_state(your_new_state, opponent_new_state, your_attributes, opponent_attributes)
        
        data_collection["s"].append(state[0])
        data_collection["a"].append(action)
        data_collection["r"].append(reward)
        your_state = your_new_state.copy()
        opponent_state = opponent_new_state.copy()
        state = new_state.copy()

    return winner, data_collection

このループでは学習データのために、状態、行動、報酬を収集します。ですがget_reward 関数はまだ定義していません。以下に、簡単な報酬関数を定義しておきます。しかし、私たちはリサーチャーがこれよりも創造的な報酬関数を考え出すことを期待しています。

def get_reward(your_state, your_new_state, opponent_state, opponent_new_state, winner):    
    opponent_health_delta = opponent_new_state["health"]- opponent_state["health"]
    your_health_delta = your_new_state["health"] - your_state["health"]
    
    hit_reward = (opponent_health_delta < 0) * 0.3
    get_hit_reward = (your_health_delta < 0) * -0.3
    
    result_reward = 0
    if winner == "You":
        result_reward = 2
    elif winner == "Opponent":
        result_reward = -2
    return result_reward + hit_reward + get_hit_reward

データ収集のための核となるバトル・ループができたので、トレーニングを実施する方法がいくつかあります。以下に、私たちが作成した2つのテンプレートを定義します。どちらのテンプレートも同じトレーニングループを利用しています。

GAMMA = 0.95

def training_loop(env, episodes = 100):
    for e in range(episodes):
        winner, gameplay_data = run_battle(env)
        
        states = np.array(gameplay_data["s"])
        actions = np.array(gameplay_data["a"])
        discounted_return = get_discounted_return(gameplay_data["r"], GAMMA)

        env.fighters[0]["model"].train(states, actions, discounted_return)

スターターモデルを確認して、get_discounted_return関数をどのように定義しているかを見てみましょう。

<aside> ♻️ 核となるトレーニングループは、このような進行を繰り返しています。

バトル実行 → データ収集 → トレーニング

</aside>

片側強化学習（One-Sided RL）

ほとんどのリサーチャーは、エージェントを訓練するこの方法に慣れていることでしょう。リサーチャーは、1つのエージェントを改良することだけに集中し、一定の環境下でそのエージェントの行動をシミュレートします。AIアリーナではこのタイプのトレーニングのための環境はゲームと学習しない相手エージェントで構成されます。私たちのスターターテンプレートでは、対戦相手としてルールベースのエージェントを使用しています。

セルフプレイ

この学習方法では、リサーチャーは最終的に互いのコピーとなる2つのモデルの学習を担当します。モデルは定期的に非同期で学習され、継続的に更新されます。つまりモデルは常に以前の自分より優れたバージョンと対峙することになります。これを実装するには、トレーニングをしているモデルと自分のモデルを交換する間隔を定義するだけです。詳しくはセルフプレイをご覧ください。

SWAP_INTERVAL = 50

if (e + 1) % SWAP_INTERVAL == 0:
		env.swap_fighters()

<aside> ⬅️ 戻る

</aside>

<aside> ↩️ トップへ戻る

</aside>