Python + Playwright로 웹스크래핑(WebScraping) 예제 (Sync 방식)

펭귄컴퓨팅/프로그래밍 / 2023. 8. 29. 16:07

"""
블로그에 있는 글의 제목들을 추출하는 xpath 구문
page.locator("//div[@class='entry']/div[@class='titleWrap']/h2/a").all_text_contents()

블로그에 있는 글의 작성일자를 추출하는 xpath 구문
page.locator("//div[contains(@class,'entry')]/div[@class='titleWrap']/div[@class='info']/span[@class='date']").first.text_content()
page.locator("//div[@class='entry']/div[@class='titleWrap']/div[@class='info']/span[@class='date']").last.text_content()
page.locator("//div[@class='entry']/div[@class='titleWrap']/div[@class='info']/span[@class='date']").all_text_contents()

블로그에 있는 글의 작성일자를 추출하는 css 구문
page.locator('div:nth-child(3) > div.titleWrap > div > span.date').text_content()

블로그에 있는 글의 개별 링크 URL들을 추출하는 xpath 구문
link_locators = page.locator("//div[@class='entry']/div[@class='titleWrap']/h2/a").all()
for l_loc in link_locators:
print(l_loc.get_attribute('href'))
print(l_loc.text_content())

※ locator() 내부에서 명시적으로 css= xpath= 를 삽입할 수 있지만, 꼭 쓸 필요는 없다.
page.locator("xpath=/html/body/div[1]/div/div[2]/div[2]/div/div[3]/div[1]/div/span[2]").text_content()
"""

from playwright.sync_api import Playwright, sync_playwright, expect
import os
from datetime import datetime

def run(playwright: Playwright) -> None:

# Get Current Working Directory
current_dir = os.getcwd()
if current_dir[-1] != '/':
current_dir = current_dir + '/'
#print(current_dir)

# Get Current Date and Time
current_datetime = datetime.now().strftime("%Y-%m-%d %H-%M-%S")
#print("Current Date & Time : ", current_datetime)
# Convert datetime obj to string
str_current_datetime = str(current_datetime)

## 브라우저가 화면에 나타나지 않도록 headless옵션을 켜고, 크롬브라우저를 사용합니다.
## headless=False 이면, 브라우저가 화면에 나타납니다.
browser = playwright.chromium.launch(headless=True, channel="chromium")
context = browser.new_context()

## 브라우저로 웹페이지를 실행합니다
page = context.new_page()

## 아래 URL 주소로 이동합니다.
page.goto("https://hook.tistory.com/")

## 웹 페이지의 스크린샷을 뜬다. path= 파라미터를 사용하여 저장경로를 별도로 지정한다.
#page.screenshot(path=current_dir + 'capture/'+ f'screenshot-{str_current_datetime}.png')

entry_locators = page.locator("//div[@class='entry']/div[@class='titleWrap']").all()
for a_loc in entry_locators:
print(a_loc.locator("//h2/a").text_content())
print(a_loc.locator("//div[@class='info']/span[@class='date']").text_content())
print(a_loc.locator("//h2/a").get_attribute('href'))
print("----------------------------------------------------------------------------------------")

## 잠시 중지
#page.pause()

# 브라우저 종료
context.close()
browser.close()

# 주 실행함수
with sync_playwright() as playwright:
run(playwright)

Posted by 훅크선장

, |

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

훅크선장의 전함

카테고리

달력

공지사항

태그목록

최근에 올라온 글

Python + Playwright로 웹스크래핑(WebScraping) 예제 (Sync 방식)

최근에 달린 댓글

글 보관함

링크

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역