Příspěvek

Implementing a lecturer IS scraper

Implementing a lecturer IS scraper

For more than two years, I’ve been working as a mathematics lecturer at Populo. Kabinet, the information system we were using, was good enough for basic agenda, but it did not provide the opportunity to actively monitor courses (think of a watchdog). At least that was the case in July 2024.

So I’ve decided to reverse-engineer the Kabinet API and to my surprise, I had a working scraper prototype after one weekend.

Authentication & Cognito

First step, of course, was to figure out what authentication method is used so that I can retrieve a JWT token which is used in the rest of the Kabinet REST API.

These endpoints were called after a successful login:

Requests after login

So it’s quite obvious the backend makes use of the Cognito AWS service to authenticate and authorize its users.

Since I didn’t have much knowledge about Cognito, I wasn’t able to figure out what the primary source of the connection pool configuration was, so I decided to parse this setting from the main JavaScript where it is for some reason hardcoded:

1
2
3
4
5
6
7
8
9
10
11
...
const yxt = {
    Auth: {
        Cognito: {
            userPoolId: "eu-central-1_LyA4D7N6q",
            identityPoolId: "eu-central-1:edfb5190-0a0a-4b58-bcff-749e12065953",
            userPoolClientId: "3q7sbbp4u743ot2181ou1r8bdf"
        }
    }
};
...

Since the script is obfuscated, I also needed to have a set of fallback values in my configuration that will be used if the Kabinet developers decide to set the code obfuscation properly. :))

Kabinet API

Good, now we have a usable JWT through which we can fetch a list of available courses, no rocket science here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from typing import List, Any

import requests
from mypy_boto3_cognito_idp import CognitoIdentityProviderClient
from mypy_boto3_cognito_idp.type_defs import AuthenticationResultTypeTypeDef

AVAILABLE_COURSES_URL = "{api_prefix}/tutor/{user}/available-courses"

def list_available_courses(
    cognito_client: CognitoIdentityProviderClient,
    auth_result: AuthenticationResultTypeTypeDef,
    api_prefix: str,
) -> List[Any]:
    user = cognito_client.get_user(AccessToken=auth_result["AccessToken"])["Username"]
    courses = requests.get(
        AVAILABLE_COURSES_URL.format(user=user, api_prefix=api_prefix),
        headers={
            "Accept": "application/json",
            "Authorization": f"Bearer {auth_result['IdToken']}",
            # override default rubbish sent from session-level
            "User-Agent": None,
            "Connection": None,
            "Accept-Encoding": None,
        },
    )
    return list(courses.json()["data"])

This already gave me a very usable response, I anonymised the payload for confidentiality reasons:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
{
    "data": [
        {
            "externalId": "<ID>",
            "topic": {
                "externalId": "<ID>",
                "value": "Anglický jazyk"
            },
            "number": 12345,
            "firstMinute": false,
            "studyStart": "2024-05-06T00:00:00.000Z",
            "studyReason": null,
            "schoolType": {
                "externalId": "<ID>",
                "value": "Základní škola"
            },
            "studyYear": {
                "externalId": "<ID>",
                "value": "9."
            },
            "requiredLevel": {
                "externalId": "<ID>",
                "value": "ZŠ",
                "index": 1
            },
            "requiredQualification": null,
            "language": {
                "externalId": "<ID>",
                "value": "Czech (Czech Republic)"
            },
            "branch": {
                "externalId": "<ID>",
                "value": "Zlín",
                ...
            },
            "timeslots": [
                {
                    "externalId": "<ID>",
                    "startTime": "2024-04-27T15:00:00.000Z",
                    "dueTime": "2024-04-27T20:15:00.000Z",
                    "studyType": {
                        "externalId": "<ID>",
                        "value": "Prezenčně"
                    },
                    "weekday": {
                        "externalId": "<ID>",
                        "value": "Pátek",
                        "number": 5
                    },
                    "notes": "PREFERUJ%C3%8D+%2F%2F1x+45min"
                },
                ...
            ],
            "rate": "140.0000",
            "rateBonus": "20.0000",
            "rateType": {
                "externalId": "<ID>",
                "value": "Hod. sazba"
            },
            "interested": false,
            "stared": false,
            "interestedLength": 0
        },
        ...
    ]
}

Basically, the only thing left was to filter out the payload only to interesting courses, e.g. courses which match my arbitrary criteria. In my case, these are the criteria which I hardcoded in code because I couldn’t be bothered writing a persistence layer:

  • topic: Mathematics
  • school type: High School
  • branch: Brno

Once I filter out courses, all I need to know are the following data which help me decide whether the course is really for me:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from pydantic import BaseModel


class CourseTimeSlot(BaseModel):
    startHour: int
    endHour: int
    weekDay: str


class InterestingAvailableCourse(BaseModel):
    start: date
    schoolType: str
    studyYear: str
    timeSlots: List[CourseTimeSlot]

As you might suspect, the result of the Kabinet API scraping is a list of InterestingAvailableCourse. The only thing left to do is to send the data somewhere.

Notifications

The most tricky part of the scraper was to determine how and where I will send notifications.

I was not able to come up with an easy way of sending notifications to my iPhone, so came up with two ideas:

  • if my machine is on the same network where my scraper is running, send a Dunst notification to a Dunst daemon via SSH
  • if my machine is not on the network, send an e-mail report using Sendgrid

Dunst

The idea is straightforward and nasty at the same time: if I locate a machine running with an open port 2222 on the same subnet as the scraper is running in, a Dunst notification is sent via SSH client to this host.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from enum import Enum
from typing import List

from paramiko import ECDSAKey, SSHClient, AutoAddPolicy
from io import StringIO

from populoscraper.config import PopuloScraperConfig
from populoscraper.container_logging import log
from populoscraper.scraper.model import InterestingAvailableCourse, ScrapeSummary

class DunstNotificationSeverity(Enum):
    LOW = "low"
    NORMAL = "normal"
    HIGH = "critical"


def send_dunst_notification(
    config: PopuloScraperConfig,
    interesting_courses: List[InterestingAvailableCourse],
    host: str
) -> ScrapeSummary:
    global ssh_client
    try:
        log.info(f"Will send a notification to host {host}.")
        # key is provided in the PEM format from Vault
        private_key = ECDSAKey.from_private_key(StringIO(config.private_ssh_key))
        ssh_client = SSHClient()
        ssh_client.set_missing_host_key_policy(AutoAddPolicy())
        ssh_client.connect(hostname=host, username=config.ssh_username, pkey=private_key, port=config.ssh_port)
        notification_body = "\n".join(
            [
                f"{c.start} - {c.schoolType}, {c.studyYear} tř."
                + "\n"
                + "\n".join([f"\t- {ts.weekDay}: {ts.startHour} - {ts.endHour}" for ts in c.timeSlots])
                for c in interesting_courses
            ]
        )
        notification_command = _send_critical_notification_command_string(
            notification_body=notification_body
        )
        log.info(f"Notification command: {notification_command}")
        ssh_client.exec_command(notification_command)
        return ScrapeSummary.DUNST_NOTIFICATION_SUCCESS
    except Exception as e:
        log.error(f"Error while sending notification: {e}")
        return ScrapeSummary.DUNST_NOTIFICATION_FAILURE
    finally:
        if ssh_client:
            ssh_client.close()


def _send_critical_notification_command_string(
    notification_summary: str = "Nové zajímavé kurzy!",
    notification_body: str = "",
    severity: DunstNotificationSeverity = DunstNotificationSeverity.HIGH
) -> str:
    return f'notify-send "{notification_summary}" "{notification_body}" -u {severity.value} -c {app_name} -a {app_name}'

E-mail

In case there is no active host with port 2222, the scraper will fall back to sending a report which is basically an HTML table.

Nothing complicated, this is what it looks like:

E-mail report example

This post is licensed under CC BY 4.0 by the author.

Music Fun Fact

Loading a fun music fact...