scrape_ob/README.md

180 lines
5.8 KiB
Markdown
Raw Permalink Normal View History

2022-10-27 05:24:27 +00:00
# scrape_ob
2022-10-30 03:08:02 +00:00
Scrape open behavior and convert to structured data
## Components
- `scrape.py` - utility functions for listing and downloading files
- `parse.py` - Parse downloaded HTML files
- `main.py` - Wrapper/entrypoint scripts
## Usage
```
>>> scrape_ob --help
usage: scrape_ob [-h] [-u URL] [-o OUTPUT] [-d]
Scrape Open Behavior and return structured data
options:
-h, --help show this help message and exit
-u URL, --url URL Root URL for open behavior's open source project directory. Default is https://edspace.american.edu/openbehavior/open-source-tools/
-o OUTPUT, --output OUTPUT
Output directory to store downloaded html files in. Default is ./html
-d, --download Just download html files without parsing them
Be kind and always give credit where labor has been done
```
### Downloading
We try and always work on local copies of the files so we are not requesting too much from their server, so first
we start by downloading html copies of their project descriptions. Say to the `./html` folder
Interactively:
```python
from scrape_ob.scrape import download_all
download_all(output_folder='./html')
```
From the cli
```json
scrape_ob --download
```
### Parsing
From there we can parse an individual project's HTML representation into a
structured one using two primary dataclasses,
* `Project`: The primary representation of a project
* `Blurb`: The image and link boxes at the bottom of a page that link to
further details about a project. Since these don't have structure information,
and will use different terminology, embed links within text, etc. we are leaving them
as relatively unprocessed for now, pending further refinement of the parsing classes
The Project class can parse projects given a url or a file, eg.
```python
from scrape_ob.parse import Project
ap = Project.from_file('html/autopilot.html')
```
Which gives us a structured representation of the project, which
we can access directly as attributes or pull out as a dictionary like:
(omitting the `body` attribute for clarity)
```python
>>> print(ap.dict())
{
'name': 'AutoPilot: python framework for behavior experiments with raspberry pi',
'url': 'https://edspace.american.edu/openbehavior/project/autopilot/',
'date': datetime.datetime(2019, 12, 12, 0, 0),
'tags': [
'automated',
'cognition',
'decision-making',
'gui',
'hardware',
'perceptual',
'raspberrypi',
'sensory',
'software'
],
'categories': [
'behavior-apparatus',
'behavior-measurement',
'data-analysis-software',
'behavior-rigs',
'behavior-analysis',
'behavioral-tasks',
'freely-moving',
'integrated-systems',
'stimuli'
],
'rrids': ['SCR_021448', 'SCR_021518'],
'blurbs': [
{
'links': [
'https://www.biorxiv.org/content/10.1101/807693v1'
],
'name': 'Paper',
'type': None},
{
'links': [
'https://github.com/wehr-lab/autopilot',
'http://docs.auto-pi-lot.com/'
],
'name': 'Github',
'type': None
},
{
'links': [
'https://auto-pi-lot.com/',
'https://auto-pi-lot.com/presentation/#/'
],
'name': 'Website',
'type': None
}
],
'docs': None,
'repo': None,
'paper': None
}
```
Note how we are able to pull out the "category" and "tags" information which
is usually hidden as part of the page metadata.
The extra `docs`, `repo`, and `paper` fields are currently left unfilled, as we
will eventually use the `Blurb` class to parse out that information
The body of the project description is extracted into `body` and converted to
markdown :)
```markdown
Jonny Saunders from Michael Wehrs lab at the University of Oregon
recently posted a preprint documenting their project Autopilot, which is
a python framework for running behavioral experiments:
------------------------------------------------------------------------
[Autopilot](https://auto-pi-lot.com/)\xa0is a python framework for
behavioral experiments through utilizing\xa0[Raspberry Pi
microcontrollers](https://www.raspberrypi.org/). Autopilot incorporates
all aspects of an experiment, including the hardware, stimuli,
behavioral task paradigm, data management, data visualization, and a
user interface. The authors propose that Autopilot is the fastest, least
expensive, most flexibile behavioral system that is currently available.
The benefit of using Autopilot is that it allows more experimental
flexibility, which lets researchers to optimize it for their specific
experimental needs. Additionally, this project exemplifies how useful a
raspberry pi can be for performing experiments and recording data. The
preprint discusses many benefits of raspberry pis, including their
speed, precision and proper data logging, and they only cost $35 (!!).
Ultimately, the authors developed Autopilot in an effort to encourage
users to write reusable, portable experiments that is put into a public
central library to push replication and reproducibility.
*This research tool was created by your colleagues. Please acknowledge
the Principal Investigator, cite the article in which the tool was
described, and include an\xa0RRID\xa0in the Materials and Methods of your
future publications.\xa0\xa0Project portal\xa0RRID:SCR_021448;
Software\xa0RRID:SCR_021518*
```
The utility function `parse_folder` can be used to parse the entire downloaded
folder of html documents!
## TODO
- Complete parsing of blurbs
- Export `Project` to structured markdown with YAML headers or to whatever format
open neuro ppl want
- Make mediawiki template and client code to push into the open neuro wiki