.find_elements(By.TAG_NAME, “a”) この部分も渡せる？By.TAG_NAMEとは限らない、CLASS_NAMEになる場合もある

2024年6月8日

はい、find_elementsメソッドに渡すセレクターを動的に指定することも可能です。そのためには、fetch_product_urlメソッドの引数としてセレクターを渡し、それを使用してfind_elementsメソッドを呼び出します。

以下はその修正後のコードです。

python
class Fetch_product_url:
    '''商品URLを取得するクラスです。
    '''

    def __init__(self, crawl_urls, crawl_site_id, elements):
        
        # ChromeDriverの設定
        driver_path = "/home/NBSystem/chromedriver"
        chrome_options = webdriver.ChromeOptions()
        chrome_options.binary_location = '/usr/bin/google-chrome'
        chrome_options.add_argument("--headless") 
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        service = Service(executable_path=driver_path)
        driver = webdriver.Chrome(options=chrome_options, service=service)

        self.driver = driver
        self.crawl_site_id = crawl_site_id
        self.elements = elements

        # インスタンスを作成
        for crawl_url in crawl_urls:
            # クロール開始
            self.driver.get(crawl_url)
            concat_df = pd.DataFrame()
            count = 1
            while True:
                df = self.fetch_product_url(self.elements)
                # 結合
                concat_df = pd.concat([concat_df, df], axis=0)

                try:
                    count += 1
                    self.next(str(count))
                except NoSuchElementException:
                    break
                sleep(5)

            # 重複レコードを更新
            self.update_or_create_crawl_data(concat_df)

        self.driver.quit()

    # URLの取得
    def fetch_product_url(self, elements):
        df = pd.DataFrame()
        for index, i in tqdm(enumerate(self.driver.find_element(*self.elements).find_elements(*elements))):
            df.at[index, "crawl_site_id"] = self.crawl_site_id
            df.at[index, "url"] = i.get_attribute("href")
            df.at[index, "html_content"] = requests.get(i.get_attribute("href")).text
            try:
                df.at[index, "html_sub_content"] = i.find_element(By.CLASS_NAME, "comment").text
            except:
                pass
            df.at[index, "crawl_timestamp"] = timezone.now()
            sleep(5)
        return df

このように修正すると、fetch_product_urlメソッドの引数としてセレクターを渡すことができ、それを使用してfind_elementsメソッドを呼び出せます。

未分類

Posted by ぼっち

seleniumで指定のwrapper内の全てのAタグを取得したいんだ。

クラスって、再利用する前提でつくるの？

コメント一覧

まだ、コメントがありません