92 lines
2.1 KiB
Markdown
92 lines
2.1 KiB
Markdown
# ORSR Scraper
|
|
|
|
With this application you can get all changed records in orsr for the current day.
|
|
|
|
The application consists of two parts:
|
|
|
|
### 1. Scraper:
|
|
- gets the data of all changed records
|
|
- either the "aktuálna" or the "úplna" version
|
|
- can use a socks5 proxy
|
|
- stores the data in a MongoDB
|
|
|
|
### 2. Flask app:
|
|
|
|
- Minimalistic flask app that has two endpoints:
|
|
- /detail with parameter ico
|
|
- returns a json data for the record with ico
|
|
- /list
|
|
- returns a paginated list of records ico and obhcodneMeno
|
|
|
|
|
|
## Setup
|
|
### 1. Prerequisites
|
|
You need to have installed/access to:
|
|
- current python
|
|
- MongoDB
|
|
- Socks5 proxy (optional)
|
|
|
|
The installation of these is out of scope of this README
|
|
|
|
### 1. Download the app
|
|
Download/clone the application
|
|
|
|
### 2. venv and requirements
|
|
Open terminal cd to app folder and install venv
|
|
```
|
|
cd [appPath]
|
|
python -m venv venv
|
|
```
|
|
install the requirements from `requirements.txt`
|
|
```
|
|
venv/bin/pip install -r requirements.txt
|
|
|
|
for Windows:
|
|
venv\Scripts\pip.exe install -r requirements.txt
|
|
```
|
|
|
|
### 3. Config File
|
|
There is a default config file "config_base.cfg".
|
|
For local changes copy this base config file and store it as "config.cfg". The config file has the following structure:
|
|
```
|
|
[DB]
|
|
MONGODB_URI = mongodb://localhost:27017
|
|
MONGODB_DB = softone
|
|
MONGODB_COLLECTION = orsr
|
|
|
|
[WEB]
|
|
BASE_URL = https://www.orsr.sk/
|
|
ENDPOINT = hladaj_zmeny.asp
|
|
|
|
[PROXY]
|
|
#HTTP_PROXY = socks5://user:pass@host:port
|
|
#HTTPS_PROXY = socks5://user:pass@host:port
|
|
|
|
[APP]
|
|
THREADS = 8
|
|
```
|
|
|
|
Setup the connection to MongoDB, number of threads being used for collecting the data and optionally also the Socks5 Proxy params.
|
|
|
|
## Run the applications
|
|
### 1. Scraper
|
|
Run the scraper with
|
|
```
|
|
venv/bin/python scraper.py
|
|
|
|
for Windows:
|
|
venv\Scripts\python.exe scraper.py
|
|
```
|
|
It will ask you if you want to download the "aktuálny" or "úplný" record.
|
|
|
|
tqdm status bar with ThreadPool sometimes continues on newline!
|
|
|
|
### 2. Flask
|
|
Start flask application
|
|
```
|
|
venv/bin/python flaskapp.py
|
|
|
|
for Windows:
|
|
venv\Scripts\python.exe flaskapp.py
|
|
```
|
|
Now you can get the data from the local test server that usually runs on `http://127.0.0.1:5000` |