This repository has been archived on 2022-08-01. You can view files and clone it, but cannot push or open issues or pull requests.
Scraping-Alpha/scraping-alpha/Scraping_Alpha/README.md
2016-12-27 00:45:46 +00:00

2.9 KiB

Scraping Alpha

Author

Ben Goldsworthy <email> <website>

Version

1.0

Abstract

Scraping Alpha is a series of Python scripts used to scrape Seeking Alpha earnings call transcripts and produce SQL from them.

It was created for Dr Lars Hass of the Lancaster University Management School.

Usage

The instructions for each step of the process can be found at the beginning of each of the files involved: transcript_spider.py, JSONtoSQL.py and execsAndAnalysts.py. The are repeated here for brevity.

transcript_spider.py

This file is the webspider that Scrapy uses to retrieve the information from the website. Left unattended, it will scrape all 4,000+ pages of results. To interrupt this behaviour and still be able to proceed with the other steps, cancel the script with CTRL+Z. This will likely leave an unfinished JSON item at the end of the output file. To clear this up, open the file in vim and type the following keys:

G
V
d
$
i
BACKSPACE
ENTER
]
ESC
:wp
ENTER

This will truncate the file at the last complete record and seal it off.

For installation instructions for Scrapy, see here. This file should be in the spiders directory of the project, and is run via scrapy crawl transcripts -o transcripts.json at the command line (the output file will be placed in the directory the Terminal is currently pointing to).

JSONtoSQL.py

This file takes the transcripts.json file output of transcript_spider.py and converts it into SQL.

This file should be located in the same directory as transcripts.json, and is run via python JSONtoSQL.py > [FILE].sql, where [FILE] is the desired name of the output file.

execsAndAnalysts.py

First, import the output file of JSONtoSQL.py to your chosen DBMS (I've tested it with phpMyAdmin). Then, run the following query:

SELECT `id`, `execs`, `analysts` FROM `transcripts`

Export the resulting table (instructions) to transcripts.sql, and place the file in the same directory as execsAndAnalysts.py. Run it with 'python execsAndAnalysts'.

It creates from this two files (execs.sql and analysts.sql). Import them into your DBMS to create two linking tables. The final instruction of analysts.sql then deletes the superfluous execs and analysts columns from the transcripts table (and for this reason, execs.sql must be imported first).

Future

Harvesting the URLs of slide images shouldn't be too hard to implement - slides_spider.py should in theory to this, but the link to a transcript's slides is added to the page later via Javascript, which means at the moment it throws up a load of HTTP 200 status codes and nowt else. Scrapy+Splash may be the solution, however.