2022-10-27 05:24:27 +00:00
|
|
|
|
# scrape_ob
|
|
|
|
|
|
2022-10-30 03:08:02 +00:00
|
|
|
|
Scrape open behavior and convert to structured data
|
|
|
|
|
|
|
|
|
|
## Components
|
|
|
|
|
|
|
|
|
|
- `scrape.py` - utility functions for listing and downloading files
|
|
|
|
|
- `parse.py` - Parse downloaded HTML files
|
|
|
|
|
- `main.py` - Wrapper/entrypoint scripts
|
|
|
|
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
>>> scrape_ob --help
|
|
|
|
|
usage: scrape_ob [-h] [-u URL] [-o OUTPUT] [-d]
|
|
|
|
|
|
|
|
|
|
Scrape Open Behavior and return structured data
|
|
|
|
|
|
|
|
|
|
options:
|
|
|
|
|
-h, --help show this help message and exit
|
|
|
|
|
-u URL, --url URL Root URL for open behavior's open source project directory. Default is https://edspace.american.edu/openbehavior/open-source-tools/
|
|
|
|
|
-o OUTPUT, --output OUTPUT
|
|
|
|
|
Output directory to store downloaded html files in. Default is ./html
|
|
|
|
|
-d, --download Just download html files without parsing them
|
|
|
|
|
|
|
|
|
|
Be kind and always give credit where labor has been done
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Downloading
|
|
|
|
|
|
|
|
|
|
We try and always work on local copies of the files so we are not requesting too much from their server, so first
|
|
|
|
|
we start by downloading html copies of their project descriptions. Say to the `./html` folder
|
|
|
|
|
|
|
|
|
|
Interactively:
|
|
|
|
|
```python
|
|
|
|
|
from scrape_ob.scrape import download_all
|
|
|
|
|
download_all(output_folder='./html')
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
From the cli
|
|
|
|
|
```json
|
|
|
|
|
scrape_ob --download
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Parsing
|
|
|
|
|
|
|
|
|
|
From there we can parse an individual project's HTML representation into a
|
|
|
|
|
structured one using two primary dataclasses,
|
|
|
|
|
|
|
|
|
|
* `Project`: The primary representation of a project
|
|
|
|
|
* `Blurb`: The image and link boxes at the bottom of a page that link to
|
|
|
|
|
further details about a project. Since these don't have structure information,
|
|
|
|
|
and will use different terminology, embed links within text, etc. we are leaving them
|
|
|
|
|
as relatively unprocessed for now, pending further refinement of the parsing classes
|
|
|
|
|
|
|
|
|
|
The Project class can parse projects given a url or a file, eg.
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
from scrape_ob.parse import Project
|
|
|
|
|
|
|
|
|
|
ap = Project.from_file('html/autopilot.html')
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Which gives us a structured representation of the project, which
|
|
|
|
|
we can access directly as attributes or pull out as a dictionary like:
|
|
|
|
|
|
|
|
|
|
(omitting the `body` attribute for clarity)
|
|
|
|
|
```python
|
|
|
|
|
>>> print(ap.dict())
|
|
|
|
|
|
|
|
|
|
{
|
|
|
|
|
'name': 'AutoPilot: python framework for behavior experiments with raspberry pi',
|
|
|
|
|
'url': 'https://edspace.american.edu/openbehavior/project/autopilot/',
|
|
|
|
|
'date': datetime.datetime(2019, 12, 12, 0, 0),
|
|
|
|
|
'tags': [
|
|
|
|
|
'automated',
|
|
|
|
|
'cognition',
|
|
|
|
|
'decision-making',
|
|
|
|
|
'gui',
|
|
|
|
|
'hardware',
|
|
|
|
|
'perceptual',
|
|
|
|
|
'raspberrypi',
|
|
|
|
|
'sensory',
|
|
|
|
|
'software'
|
|
|
|
|
],
|
|
|
|
|
'categories': [
|
|
|
|
|
'behavior-apparatus',
|
|
|
|
|
'behavior-measurement',
|
|
|
|
|
'data-analysis-software',
|
|
|
|
|
'behavior-rigs',
|
|
|
|
|
'behavior-analysis',
|
|
|
|
|
'behavioral-tasks',
|
|
|
|
|
'freely-moving',
|
|
|
|
|
'integrated-systems',
|
|
|
|
|
'stimuli'
|
|
|
|
|
],
|
|
|
|
|
'rrids': ['SCR_021448', 'SCR_021518'],
|
|
|
|
|
'blurbs': [
|
|
|
|
|
{
|
|
|
|
|
'links': [
|
|
|
|
|
'https://www.biorxiv.org/content/10.1101/807693v1'
|
|
|
|
|
],
|
|
|
|
|
'name': 'Paper',
|
|
|
|
|
'type': None},
|
|
|
|
|
{
|
|
|
|
|
'links': [
|
|
|
|
|
'https://github.com/wehr-lab/autopilot',
|
|
|
|
|
'http://docs.auto-pi-lot.com/'
|
|
|
|
|
],
|
|
|
|
|
'name': 'Github',
|
|
|
|
|
'type': None
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
'links': [
|
|
|
|
|
'https://auto-pi-lot.com/',
|
|
|
|
|
'https://auto-pi-lot.com/presentation/#/'
|
|
|
|
|
],
|
|
|
|
|
'name': 'Website',
|
|
|
|
|
'type': None
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
'docs': None,
|
|
|
|
|
'repo': None,
|
|
|
|
|
'paper': None
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Note how we are able to pull out the "category" and "tags" information which
|
|
|
|
|
is usually hidden as part of the page metadata.
|
|
|
|
|
|
|
|
|
|
The extra `docs`, `repo`, and `paper` fields are currently left unfilled, as we
|
|
|
|
|
will eventually use the `Blurb` class to parse out that information
|
|
|
|
|
|
|
|
|
|
The body of the project description is extracted into `body` and converted to
|
|
|
|
|
markdown :)
|
|
|
|
|
|
|
|
|
|
```markdown
|
|
|
|
|
Jonny Saunders from Michael Wehr’s lab at the University of Oregon
|
|
|
|
|
recently posted a preprint documenting their project Autopilot, which is
|
|
|
|
|
a python framework for running behavioral experiments:
|
|
|
|
|
|
|
|
|
|
------------------------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
[Autopilot](https://auto-pi-lot.com/)\xa0is a python framework for
|
|
|
|
|
behavioral experiments through utilizing\xa0[Raspberry Pi
|
|
|
|
|
microcontrollers](https://www.raspberrypi.org/). Autopilot incorporates
|
|
|
|
|
all aspects of an experiment, including the hardware, stimuli,
|
|
|
|
|
behavioral task paradigm, data management, data visualization, and a
|
|
|
|
|
user interface. The authors propose that Autopilot is the fastest, least
|
|
|
|
|
expensive, most flexibile behavioral system that is currently available.
|
|
|
|
|
|
|
|
|
|
The benefit of using Autopilot is that it allows more experimental
|
|
|
|
|
flexibility, which lets researchers to optimize it for their specific
|
|
|
|
|
experimental needs. Additionally, this project exemplifies how useful a
|
|
|
|
|
raspberry pi can be for performing experiments and recording data. The
|
|
|
|
|
preprint discusses many benefits of raspberry pis, including their
|
|
|
|
|
speed, precision and proper data logging, and they only cost $35 (!!).
|
|
|
|
|
Ultimately, the authors developed Autopilot in an effort to encourage
|
|
|
|
|
users to write reusable, portable experiments that is put into a public
|
|
|
|
|
central library to push replication and reproducibility.
|
|
|
|
|
|
|
|
|
|
*This research tool was created by your colleagues. Please acknowledge
|
|
|
|
|
the Principal Investigator, cite the article in which the tool was
|
|
|
|
|
described, and include an\xa0RRID\xa0in the Materials and Methods of your
|
|
|
|
|
future publications.\xa0\xa0Project portal\xa0RRID:SCR_021448;
|
|
|
|
|
Software\xa0RRID:SCR_021518*
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The utility function `parse_folder` can be used to parse the entire downloaded
|
|
|
|
|
folder of html documents!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## TODO
|
|
|
|
|
|
|
|
|
|
- Complete parsing of blurbs
|
|
|
|
|
- Export `Project` to structured markdown with YAML headers or to whatever format
|
|
|
|
|
open neuro ppl want
|
|
|
|
|
- Make mediawiki template and client code to push into the open neuro wiki
|