execution errors #1
Labels
No labels
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Lancaster-University/Scraping-Alpha#1
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi @Rumperuu
I am trying to run the transcripts spider but I am getting lots of errors.
First, I am getting these identical errors all over the place.
Traceback (most recent call last):
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in
return (_set_referer(r) for r in result or ())
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "C:\Users\k0st7as\Documents\Scraping_Alpha\Scraping_Alpha\spiders\transcript_spider.py", line 109, in parse_transcript
titleAndDate = chunks[i].css('p::text').extract[1]
TypeError: 'method' object is not subscriptable
Second, the spider stops before it parses the whole 4193 pages of earning calls. These is what I get:
2017-02-13 10:24:17 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-13 10:24:17 [scrapy.extensions.feedexport] INFO: Stored json feed (65 items) in: transcripts.json
2017-02-13 10:24:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 55793,
'downloader/request_count': 125,
'downloader/request_method_count/GET': 125,
'downloader/response_bytes': 10500279,
'downloader/response_count': 125,
'downloader/response_status_count/200': 96,
'downloader/response_status_count/403': 29,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 13, 10, 24, 17, 574346),
'item_scraped_count': 65,
'log_count/DEBUG': 191,
'log_count/ERROR': 27,
'log_count/INFO': 44,
'request_depth_max': 4,
'response_received_count': 125,
'scheduler/dequeued': 125,
'scheduler/dequeued/memory': 125,
'scheduler/enqueued': 125,
'scheduler/enqueued/memory': 125,
'spider_exceptions/IndexError': 7,
'spider_exceptions/TypeError': 20,
'start_time': datetime.datetime(2017, 2, 13, 10, 16, 43, 876698)}
2017-02-13 10:24:17 [scrapy.core.engine] INFO: Spider closed (finished)
Third, the extracted earnings call transcripts are not complete. The spider only extracts the first page of the transcript. I attach my transcript.json file to see what happens. Adding a revolving useragent gives more results.
I wonder whether implementing a login function within scrapy will allow the spider to collect full transcripts by using the "?part=single" at the end of the url.