DEAR PEOPLE FROM THE FUTURE: Here's what we've figured out so far...

Welcome! This is a Q&A website for computer programmers and users alike, focused on helping fellow programmers and users. Read more

What are you stuck on? Ask a question and hopefully somebody will be able to help you out!
0 votes

I want to scrape a website asynchronously using a list of tor circuits with different exit nodes and making sure each exit node only makes a request every 5 seconds.

For testing purposes, I'm using the website https://books.toscrape.com/ and I'm lowering the sleep time, number of circuits and number of pages to scrape.

I'm getting the following two errors when I use the --tor argument. Both related to the torpy package.

'TorWebScraper' object has no attribute 'circuits'
'_GeneratorContextManager' object has no attribute 'create_stream'

Here is the relevant code causing the error:

async with aiohttp.ClientSession() as session:
    for circuit in self.circuits:
        async with circuit.create_stream() as stream:
            async with session.get(url, proxy=stream.proxy) as response:
                await asyncio.sleep(20e-3)
                text = await response.text()
                return url, text

Here there is more context

by
edited by

1 Answer

0 votes
 
Best answer

I've managed to do this by creating subprocesses, the problem with this approach is that I'm using three hops when there is no need for it:

import subprocess


def main() -> None:
    for i in range(1, 11):
        cmd = ["torsocks", "-i", "python", "download_and_extract_metadata.py", "--download", "--metadata", "-o", "/mnt/T/fanfic/download", "--divide", "10", "--part", str(i)]
        subprocess.Popen(cmd)


if __name__ == "__main__":
    main()
by
Contributions licensed under CC0
...