具体的にどこにsleepを追加したらいい？

2024年6月8日

fetch_product_url()メソッド内の詳細ページの取得部分で、要素が表示されるまで待機するために明示的な待機を追加することができます。以下はその方法です：

python
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# fetch_product_url()メソッド内の詳細ページの取得部分
def fetch_product_url(self):
    """_詳細ページのHTML等の取得_

    Returns:
        df: _DataFrameで現時点URLにいる一覧ページ上のHTML情報を返す_
    """
    
    df = pd.DataFrame()
    
    # 一覧ページ上の詳細ページリンクを取得し、ループを回す
    for index, i in tqdm(enumerate(self.fetch_product_loop_url_atag()),total=len(self.fetch_product_loop_url_atag())):
        self.logger.info(f'detail_url:{i.get_attribute("href")}') # 現在の詳細ページを追跡
        
        # 詳細ページへのリンクをクリックし、ページが読み込まれるのを待つ
        try:
            i.click()
            WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="content"]')))  # ここで要素が表示されるまで待機
        except Exception as e:
            self.logger.error(f"Error clicking detail page link: {e}")
            continue
        
        # ページが読み込まれた後の処理
        try:
            df.at[0, "crawl_site_id"] = self.crawl_site_id
            df.at[0, "url"] = i.get_attribute("href")
            df.at[0, "html_content"] = requests.get(i.get_attribute("href")).text
            df.at[0, "html_sub_content"] = self.fetch_sub_content(i)
            df.at[0, "crawl_timestamp"] = timezone.now()
            self.update_or_create_crawl_data(df)  # クロールデータをテーブルへ書き込み
            df = pd.DataFrame()  # メモリ解放
            sleep(self.sleep_time)  # 時間待機
        except Exception as e:
            self.logger.error(f"Error fetching HTML content for URL: {i.get_attribute('href')}. Error: {e}")
            continue
    
    return df

このようにすると、詳細ページへのリンクをクリックした後、要素が表示されるのを待つことができます。その後、ページの読み込みが完了したらHTMLコンテンツを取得する処理を実行します。

未分類

Posted by ぼっち