execution errors #1

Open
opened 2017-02-13 16:24:22 +00:00 by k0st7as · 0 comments
k0st7as commented 2017-02-13 16:24:22 +00:00 (Migrated from github.com)

Hi @Rumperuu

I am trying to run the transcripts spider but I am getting lots of errors.

First, I am getting these identical errors all over the place.

Traceback (most recent call last):
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in
return (_set_referer(r) for r in result or ())
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "C:\Users\k0st7as\Documents\Scraping_Alpha\Scraping_Alpha\spiders\transcript_spider.py", line 109, in parse_transcript
titleAndDate = chunks[i].css('p::text').extract[1]
TypeError: 'method' object is not subscriptable

Second, the spider stops before it parses the whole 4193 pages of earning calls. These is what I get:

2017-02-13 10:24:17 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-13 10:24:17 [scrapy.extensions.feedexport] INFO: Stored json feed (65 items) in: transcripts.json
2017-02-13 10:24:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 55793,
'downloader/request_count': 125,
'downloader/request_method_count/GET': 125,
'downloader/response_bytes': 10500279,
'downloader/response_count': 125,
'downloader/response_status_count/200': 96,
'downloader/response_status_count/403': 29,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 13, 10, 24, 17, 574346),
'item_scraped_count': 65,
'log_count/DEBUG': 191,
'log_count/ERROR': 27,
'log_count/INFO': 44,
'request_depth_max': 4,
'response_received_count': 125,
'scheduler/dequeued': 125,
'scheduler/dequeued/memory': 125,
'scheduler/enqueued': 125,
'scheduler/enqueued/memory': 125,
'spider_exceptions/IndexError': 7,
'spider_exceptions/TypeError': 20,
'start_time': datetime.datetime(2017, 2, 13, 10, 16, 43, 876698)}
2017-02-13 10:24:17 [scrapy.core.engine] INFO: Spider closed (finished)

Third, the extracted earnings call transcripts are not complete. The spider only extracts the first page of the transcript. I attach my transcript.json file to see what happens. Adding a revolving useragent gives more results.

I wonder whether implementing a login function within scrapy will allow the spider to collect full transcripts by using the "?part=single" at the end of the url.

Hi @Rumperuu I am trying to run the transcripts spider but I am getting lots of errors. First, I am getting these identical errors all over the place. Traceback (most recent call last): File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\k0st7as\AppData\Local\Continuum\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\k0st7as\Documents\Scraping_Alpha\Scraping_Alpha\spiders\transcript_spider.py", line 109, in parse_transcript titleAndDate = chunks[i].css('p::text').extract[1] TypeError: 'method' object is not subscriptable Second, the spider stops before it parses the whole 4193 pages of earning calls. These is what I get: 2017-02-13 10:24:17 [scrapy.core.engine] INFO: Closing spider (finished) 2017-02-13 10:24:17 [scrapy.extensions.feedexport] INFO: Stored json feed (65 items) in: transcripts.json 2017-02-13 10:24:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 55793, 'downloader/request_count': 125, 'downloader/request_method_count/GET': 125, 'downloader/response_bytes': 10500279, 'downloader/response_count': 125, 'downloader/response_status_count/200': 96, 'downloader/response_status_count/403': 29, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 2, 13, 10, 24, 17, 574346), 'item_scraped_count': 65, 'log_count/DEBUG': 191, 'log_count/ERROR': 27, 'log_count/INFO': 44, 'request_depth_max': 4, 'response_received_count': 125, 'scheduler/dequeued': 125, 'scheduler/dequeued/memory': 125, 'scheduler/enqueued': 125, 'scheduler/enqueued/memory': 125, 'spider_exceptions/IndexError': 7, 'spider_exceptions/TypeError': 20, 'start_time': datetime.datetime(2017, 2, 13, 10, 16, 43, 876698)} 2017-02-13 10:24:17 [scrapy.core.engine] INFO: Spider closed (finished) Third, the extracted earnings call transcripts are not complete. The spider only extracts the first page of the transcript. I attach my transcript.json file to see what happens. Adding a revolving useragent gives more results. I wonder whether implementing a login function within scrapy will allow the spider to collect full transcripts by using the "?part=single" at the end of the url.
This repo is archived. You cannot comment on issues.
No Milestone
No project
No Assignees
1 Participants
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Lancaster-University/Scraping-Alpha#1
No description provided.