Mish Ushakov 1 год назад
Родитель
Сommit
3d29ea63ca
3 измененных файлов с 25 добавлено и 12 удалено
  1. 20 7
      README.md
  2. 4 4
      examples/codegen.ts
  3. 1 1
      examples/streaming.ts

+ 20 - 7
README.md

@@ -2,7 +2,7 @@
 
 
 <img width="1800" alt="Screenshot 2024-04-20 at 23 11 16" src="https://github.com/mishushakov/llm-scraper/assets/10400064/ab00e048-a9ff-43b6-81d5-2e58090e2e65">
 <img width="1800" alt="Screenshot 2024-04-20 at 23 11 16" src="https://github.com/mishushakov/llm-scraper/assets/10400064/ab00e048-a9ff-43b6-81d5-2e58090e2e65">
 
 
-LLM Scraper is a TypeScript library that allows you to convert **any** webpages into structured data using LLMs.
+LLM Scraper is a TypeScript library that allows you to extract structured data from **any** webpage using LLMs.
 
 
 > [!TIP]
 > [!TIP]
 > Under the hood, it uses function calling to convert pages to structured data. You can find more about this approach [here](https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction)
 > Under the hood, it uses function calling to convert pages to structured data. You can find more about this approach [here](https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction)
@@ -14,7 +14,8 @@ LLM Scraper is a TypeScript library that allows you to convert **any** webpages
 - Full type-safety with TypeScript
 - Full type-safety with TypeScript
 - Based on Playwright framework
 - Based on Playwright framework
 - Streaming objects
 - Streaming objects
-- Supports 4 input modes:
+- **NEW** Code-generation
+- Supports 4 formatting modes:
   - `html` for loading raw HTML
   - `html` for loading raw HTML
   - `markdown` for loading markdown
   - `markdown` for loading markdown
   - `text` for loading extracted text (using [Readability.js](https://github.com/mozilla/readability))
   - `text` for loading extracted text (using [Readability.js](https://github.com/mozilla/readability))
@@ -137,15 +138,13 @@ await page.close()
 await browser.close()
 await browser.close()
 ```
 ```
 
 
-### Streaming
+## Streaming
 
 
 Replace your `run` function with `stream` to get a partial object stream (Vercel AI SDK only).
 Replace your `run` function with `stream` to get a partial object stream (Vercel AI SDK only).
 
 
 ```ts
 ```ts
-// Run the scraper
-const { stream } = await scraper.stream(page, schema, {
-  format: 'html',
-})
+// Run the scraper in streaming mode
+const { stream } = await scraper.stream(page, schema)
 
 
 // Stream the result from LLM
 // Stream the result from LLM
 for await (const data of stream) {
 for await (const data of stream) {
@@ -153,6 +152,20 @@ for await (const data of stream) {
 }
 }
 ```
 ```
 
 
+## NEW: Code-generation
+
+Using the `generate` function you can generate re-usable playwright script that scrapes the contents according to a schema.
+
+```ts
+// Generate code and run it on the page
+const { code } = await scraper.generate(page, schema)
+const result = await page.evaluate(code)
+const data = schema.parse(result)
+
+// Show the parsed result
+console.log(data.news)
+```
+
 ## Contributing
 ## Contributing
 
 
 As an open-source project, we welcome contributions from the community. If you are experiencing any bugs or want to add some improvements, please feel free to open an issue or pull request.
 As an open-source project, we welcome contributions from the community. If you are experiencing any bugs or want to add some improvements, please feel free to open an issue or pull request.

+ 4 - 4
examples/codegen.ts

@@ -27,15 +27,15 @@ const schema = z.object({
   ),
   ),
 })
 })
 
 
-// Run the scraper
+// Generate code and run it on the page
 const { code } = await scraper.generate(page, schema)
 const { code } = await scraper.generate(page, schema)
 console.log('code', code)
 console.log('code', code)
 
 
 const result = await page.evaluate(code)
 const result = await page.evaluate(code)
-const validated = schema.parse(result)
+const data = schema.parse(result)
 
 
-// Show the result from LLM
-console.log('result', validated.news)
+// Show the parsed result
+console.log('result', data)
 
 
 await page.close()
 await page.close()
 await browser.close()
 await browser.close()

+ 1 - 1
examples/streaming.ts

@@ -31,7 +31,7 @@ const schema = z.object({
     .describe('Top 5 stories on Hacker News'),
     .describe('Top 5 stories on Hacker News'),
 })
 })
 
 
-// Run the scraper
+// Run the scraper in streaming mode
 const { stream } = await scraper.stream(page, schema, {
 const { stream } = await scraper.stream(page, schema, {
   format: 'html',
   format: 'html',
 })
 })